What “Parsing and Summarizing CSV Data” Means
CSV stands for “comma-separated values.” A CSV file is a plain-text table: each line is a row, and each row contains columns separated by a delimiter (often a comma). Many tools export data as CSV because it is simple and widely supported.
When you “parse” a CSV file, you convert each row of text into structured values you can work with in Python (for example, turning "19.99" into a number). When you “summarize” CSV data, you compute useful information from many rows: totals, averages, minimum/maximum, counts per category, and simple reports.
This chapter focuses on building a small, practical CSV reader that can: (1) read a CSV file safely, (2) interpret headers, (3) convert selected columns to numbers or dates, (4) compute common summary statistics, and (5) print a readable report. The goal is not to build a full spreadsheet engine, but a dependable script you can reuse for many small data tasks.
Why Use Python’s csv Module Instead of Splitting on Commas
A beginner’s first instinct is often to do something like line.split(','). That works only for very simple CSV files. Real CSV data often includes commas inside quoted text, like:
id,name,comment
1,Ana,"Great, fast service"If you split that line on commas, you get the wrong number of columns. Python’s built-in csv module handles quoting rules, embedded delimiters, and different newline conventions. It also supports different delimiters (comma, semicolon, tab) and can read rows as lists or dictionaries keyed by header names.
Continue in our app.
You can listen to the audiobook with the screen off, receive a free certificate for this course, and also have access to 5,000 other free online courses.
Or continue reading below...Download the app
CSV Structure: Headers, Rows, and Types
Most CSV files have a header row: the first line contains column names. Example:
date,category,amount,notes
2026-01-01,Groceries,34.20,Milk and bread
2026-01-02,Transport,12.00,Bus passWhen you read this file, every value initially arrives as text. If you want to compute totals, you must convert amount to a numeric type. If you want to group by month, you may want to parse date into a date object or at least extract the YYYY-MM portion.
A “simple CSV reader” script usually does these steps:
- Open the file with the correct encoding and newline handling.
- Read the header row and map each column name to its position.
- For each row: validate the number of columns, clean values, and convert types.
- Update summary accumulators (totals, counts, min/max).
- Print or write a report.
Step-by-Step: Reading a CSV File with csv.DictReader
csv.DictReader is convenient because it returns each row as a dictionary: keys are header names, values are strings from the file. That makes code easier to read because you can refer to row["amount"] instead of row[2].
Example CSV: expenses.csv
Assume your file looks like this:
date,category,amount,notes
2026-01-01,Groceries,34.20,Milk and bread
2026-01-02,Transport,12.00,Bus pass
2026-01-03,Groceries,18.75,Vegetables
2026-01-03,Utilities,55.10,Electricity
2026-01-04,Transport,3.50,Coffee on the go
2026-01-05,Groceries,9.99,"Snacks, chips"Notice the last row: the notes field contains a comma inside quotes. This is exactly the kind of case where split(',') breaks and the csv module succeeds.
Minimal reader
import csv
path = "expenses.csv"
with open(path, "r", encoding="utf-8", newline="") as f:
reader = csv.DictReader(f)
for row in reader:
print(row)This prints dictionaries like:
{'date': '2026-01-01', 'category': 'Groceries', 'amount': '34.20', 'notes': 'Milk and bread'}At this stage, everything is still text. Next, you’ll convert and summarize.
Designing a Simple Summarizer: What to Compute
Before writing code, decide what “summary” means for your data. For an expenses file, common summaries include:
- Total amount across all rows
- Number of transactions (rows)
- Average transaction amount
- Total per category
- Largest and smallest transaction
- Totals per month (optional)
These summaries can be computed in one pass through the file: you read each row once and update running totals. This is efficient and works even for large files.
Step-by-Step: Converting Columns Safely
Because CSV values are strings, you need conversion rules. A practical approach is to create small helper functions that convert values and handle common messy cases (extra spaces, empty strings, currency symbols).
Cleaning and converting money values
Amounts might appear as "34.20", " 34.20 ", or even "$34.20". A simple converter can strip whitespace and remove a leading currency symbol.
def parse_amount(text):
if text is None:
return None
cleaned = text.strip()
if cleaned == "":
return None
if cleaned.startswith("$"):
cleaned = cleaned[1:]
return float(cleaned)If your data uses commas as thousands separators (for example "1,234.56"), you can remove commas:
cleaned = cleaned.replace(",", "")Be careful: in some locales, commas are used as decimal separators. For a beginner-friendly script, it’s best to assume one consistent format and document it.
Extracting a month key from a date string
If dates are in ISO format (YYYY-MM-DD), you can extract the month with slicing:
def month_key(date_text):
if not date_text:
return None
date_text = date_text.strip()
if len(date_text) < 7:
return None
return date_text[:7] # "YYYY-MM"This avoids introducing extra complexity. If you later want stricter date validation, you can use datetime, but for many summaries, a month key string is enough.
Building the Summarizer Script (One Pass)
The following script reads the CSV, converts the amount, and computes totals and per-category summaries. It also tracks the largest transaction.
import csv
def parse_amount(text):
if text is None:
return None
cleaned = text.strip()
if cleaned == "":
return None
if cleaned.startswith("$"):
cleaned = cleaned[1:]
cleaned = cleaned.replace(",", "")
return float(cleaned)
def month_key(date_text):
if not date_text:
return None
date_text = date_text.strip()
if len(date_text) < 7:
return None
return date_text[:7]
path = "expenses.csv"
row_count = 0
valid_count = 0
sum_amount = 0.0
by_category = {}
by_month = {}
max_amount = None
max_row = None
with open(path, "r", encoding="utf-8", newline="") as f:
reader = csv.DictReader(f)
required_cols = {"date", "category", "amount"}
if reader.fieldnames is None:
raise ValueError("CSV file has no header row")
missing = required_cols - set(reader.fieldnames)
if missing:
raise ValueError(f"Missing required columns: {sorted(missing)}")
for row in reader:
row_count += 1
amt = parse_amount(row.get("amount"))
cat = (row.get("category") or "").strip()
dt = (row.get("date") or "").strip()
if amt is None or cat == "" or dt == "":
# Skip incomplete rows
continue
valid_count += 1
sum_amount += amt
by_category[cat] = by_category.get(cat, 0.0) + amt
mkey = month_key(dt)
if mkey:
by_month[mkey] = by_month.get(mkey, 0.0) + amt
if max_amount is None or amt > max_amount:
max_amount = amt
max_row = row
avg_amount = (sum_amount / valid_count) if valid_count else 0.0
print("CSV SUMMARY")
print("----------")
print(f"Rows read: {row_count}")
print(f"Valid rows: {valid_count}")
print(f"Total amount: {sum_amount:.2f}")
print(f"Average amount: {avg_amount:.2f}")
print("\nTOTAL BY CATEGORY")
for cat in sorted(by_category):
print(f"- {cat}: {by_category[cat]:.2f}")
print("\nTOTAL BY MONTH")
for m in sorted(by_month):
print(f"- {m}: {by_month[m]:.2f}")
if max_row is not None:
print("\nLARGEST TRANSACTION")
print(f"- amount: {max_amount:.2f}")
print(f"- date: {max_row.get('date')}")
print(f"- category: {max_row.get('category')}")
print(f"- notes: {max_row.get('notes')}")What this script is doing (in practical terms)
- It verifies the CSV has the columns you need. This prevents confusing errors later.
- It reads each row once and updates summary variables.
- It skips rows that are missing key values (date/category/amount).
- It builds two dictionaries: totals by category and totals by month.
- It tracks the maximum amount and remembers the row it came from.
This pattern is reusable: replace the column names and summary logic to fit your dataset.
Handling Common CSV “Messiness” Without Overcomplicating
Real-world CSV files often contain issues. You can keep your script simple while still being robust by adding a few targeted checks.
1) Extra whitespace and inconsistent capitalization
Categories might appear as "groceries", "Groceries", or " Groceries ". Decide on a normalization rule. For example, you can strip spaces and standardize capitalization:
cat = (row.get("category") or "").strip().title()Be aware that .title() may not be ideal for every category name (for example, “eBay” becomes “Ebay”). Another approach is .strip() only, and keep the original text.
2) Blank lines in the file
Some CSV exports include blank lines. csv.DictReader may produce rows with all values as None or empty strings. The “skip incomplete rows” check handles this.
3) Different delimiters (semicolon or tab)
Not all “CSV” files use commas. Some use semicolons (;) or tabs. If you open a file and everything appears in one column, the delimiter is likely different. You can pass a delimiter:
reader = csv.DictReader(f, delimiter=";")If you are not sure which delimiter is used, you can try csv.Sniffer to detect it. This adds a bit of complexity but can be worth it for a reusable tool.
import csv
with open(path, "r", encoding="utf-8", newline="") as f:
sample = f.read(2048)
f.seek(0)
dialect = csv.Sniffer().sniff(sample)
reader = csv.DictReader(f, dialect=dialect)For beginner scripts, it’s often fine to support one delimiter and document it, but it’s helpful to know detection exists.
4) Numbers that fail to parse
If an amount column contains unexpected text (like "N/A"), float() will fail. One approach is to treat unparseable amounts as missing and skip the row. You can do this by catching conversion errors inside parse_amount and returning None.
def parse_amount(text):
if text is None:
return None
cleaned = text.strip()
if cleaned == "":
return None
if cleaned.startswith("$"):
cleaned = cleaned[1:]
cleaned = cleaned.replace(",", "")
try:
return float(cleaned)
except ValueError:
return NoneThis keeps the main loop clean: it just checks for None.
Producing a More Readable Report: Sorting and Formatting
Summaries are more useful when they are easy to scan. Two common improvements are sorting by total amount and aligning output.
Sorting categories by total (largest first)
sorted_cats = sorted(by_category.items(), key=lambda item: item[1], reverse=True)
for cat, total in sorted_cats:
print(f"- {cat:15s} {total:10.2f}"){cat:15s} reserves 15 characters for the category name, and {total:10.2f} reserves 10 characters for the number with 2 decimals. This creates a simple table-like output.
Computing percentages per category
Percentages help you understand where most of the total goes.
for cat, total in sorted_cats:
pct = (total / sum_amount * 100) if sum_amount else 0.0
print(f"- {cat:15s} {total:10.2f} ({pct:5.1f}%)")This is a small change that makes the report more informative.
Extending the Reader: Selecting Columns and Reusing the Script
A common next step is to make your CSV reader reusable for different files. Even without building a full command-line app, you can structure the code so you only change a few variables.
Configurable column names
Different CSV exports might call the same idea by different names: amount vs price vs total. You can define the column names at the top:
COL_DATE = "date"
COL_CATEGORY = "category"
COL_AMOUNT = "amount"Then use them in the script:
required_cols = {COL_DATE, COL_CATEGORY, COL_AMOUNT}
...
amt = parse_amount(row.get(COL_AMOUNT))
cat = (row.get(COL_CATEGORY) or "").strip()
dt = (row.get(COL_DATE) or "").strip()This makes it easier to adapt the script to new data sources.
Filtering rows before summarizing
Sometimes you want a summary for only one category or only one month. You can add a simple filter condition in the loop. For example, summarize only “Groceries”:
FILTER_CATEGORY = "Groceries"
...
if cat != FILTER_CATEGORY:
continueOr summarize only a specific month:
FILTER_MONTH = "2026-01"
...
if month_key(dt) != FILTER_MONTH:
continueFiltering is powerful because it lets you reuse the same summarizer for many questions without changing the summary logic.
Reading CSV Without Headers (When You Must)
Some CSV files do not include a header row. In that case, you can use csv.reader, which returns each row as a list. You then decide which column index means what.
Example file without headers:
2026-01-01,Groceries,34.20,Milk and bread
2026-01-02,Transport,12.00,Bus passReader example:
import csv
path = "expenses_no_header.csv"
IDX_DATE = 0
IDX_CATEGORY = 1
IDX_AMOUNT = 2
sum_amount = 0.0
count = 0
with open(path, "r", encoding="utf-8", newline="") as f:
reader = csv.reader(f)
for row in reader:
if len(row) < 3:
continue
dt = row[IDX_DATE].strip()
cat = row[IDX_CATEGORY].strip()
amt = parse_amount(row[IDX_AMOUNT])
if dt == "" or cat == "" or amt is None:
continue
sum_amount += amt
count += 1
print(f"Rows: {count}")
print(f"Total: {sum_amount:.2f}")This approach is slightly more fragile because it depends on column order, but it is sometimes necessary.
Practical Mini-Exercises (Use Your Own CSV)
Exercise 1: Add a “minimum transaction” tracker
In addition to the maximum amount, track the minimum amount and its row. You will need two variables (min_amount and min_row) and update them similarly to the max logic.
Exercise 2: Count transactions per category
Along with totals per category, count how many rows each category has. Use a second dictionary like count_by_category and increment it for each valid row.
count_by_category[cat] = count_by_category.get(cat, 0) + 1Then print both count and total in the report.
Exercise 3: Detect suspicious rows
Add a check that flags rows where the amount is unusually large (for example, greater than 500). Instead of skipping them, print them to the screen so you can review the data.
if amt > 500:
print("Suspicious row:", row)This is a simple way to do basic data validation and catch typos.
Key Takeaways to Apply Immediately
A simple CSV reader becomes genuinely useful when it does three things well: reads CSV correctly (including quotes and commas in text), converts the specific columns you care about into usable values, and summarizes in one pass with clear, readable output. With the patterns in this chapter—csv.DictReader, small parsing helpers, and dictionary-based totals—you can build quick scripts that turn exported data into answers.