All courses > Technology and Programming > Programming Languages ( Python, Ruby, Java, C ) ::

Parsing and Summarizing Data with a Simple CSV Reader

Capítulo 12

Estimated reading time: 12 minutes

What “Parsing and Summarizing CSV Data” Means

CSV stands for “comma-separated values.” A CSV file is a plain-text table: each line is a row, and each row contains columns separated by a delimiter (often a comma). Many tools export data as CSV because it is simple and widely supported.

When you “parse” a CSV file, you convert each row of text into structured values you can work with in Python (for example, turning "19.99" into a number). When you “summarize” CSV data, you compute useful information from many rows: totals, averages, minimum/maximum, counts per category, and simple reports.

This chapter focuses on building a small, practical CSV reader that can: (1) read a CSV file safely, (2) interpret headers, (3) convert selected columns to numbers or dates, (4) compute common summary statistics, and (5) print a readable report. The goal is not to build a full spreadsheet engine, but a dependable script you can reuse for many small data tasks.

Why Use Python’s `csv` Module Instead of Splitting on Commas

A beginner’s first instinct is often to do something like line.split(','). That works only for very simple CSV files. Real CSV data often includes commas inside quoted text, like:

id,name,comment
1,Ana,"Great, fast service"

If you split that line on commas, you get the wrong number of columns. Python’s built-in csv module handles quoting rules, embedded delimiters, and different newline conventions. It also supports different delimiters (comma, semicolon, tab) and can read rows as lists or dictionaries keyed by header names.

Continue in our app.

Listen to the audio with the screen off.
Earn a certificate upon completion.
Over 5000 courses for you to explore!

Or continue reading below...

Download the app

CSV Structure: Headers, Rows, and Types

Most CSV files have a header row: the first line contains column names. Example:

date,category,amount,notes
2026-01-01,Groceries,34.20,Milk and bread
2026-01-02,Transport,12.00,Bus pass

When you read this file, every value initially arrives as text. If you want to compute totals, you must convert amount to a numeric type. If you want to group by month, you may want to parse date into a date object or at least extract the YYYY-MM portion.

A “simple CSV reader” script usually does these steps:

Open the file with the correct encoding and newline handling.
Read the header row and map each column name to its position.
For each row: validate the number of columns, clean values, and convert types.
Update summary accumulators (totals, counts, min/max).
Print or write a report.

Step-by-Step: Reading a CSV File with `csv.DictReader`

csv.DictReader is convenient because it returns each row as a dictionary: keys are header names, values are strings from the file. That makes code easier to read because you can refer to row["amount"] instead of row[2].

Example CSV: `expenses.csv`

Assume your file looks like this:

date,category,amount,notes
2026-01-01,Groceries,34.20,Milk and bread
2026-01-02,Transport,12.00,Bus pass
2026-01-03,Groceries,18.75,Vegetables
2026-01-03,Utilities,55.10,Electricity
2026-01-04,Transport,3.50,Coffee on the go
2026-01-05,Groceries,9.99,"Snacks, chips"

Notice the last row: the notes field contains a comma inside quotes. This is exactly the kind of case where split(',') breaks and the csv module succeeds.

Minimal reader

import csv

path = "expenses.csv"

with open(path, "r", encoding="utf-8", newline="") as f:
    reader = csv.DictReader(f)
    for row in reader:
        print(row)

This prints dictionaries like:

{'date': '2026-01-01', 'category': 'Groceries', 'amount': '34.20', 'notes': 'Milk and bread'}

At this stage, everything is still text. Next, you’ll convert and summarize.

Designing a Simple Summarizer: What to Compute

Before writing code, decide what “summary” means for your data. For an expenses file, common summaries include:

Total amount across all rows
Number of transactions (rows)
Average transaction amount
Total per category
Largest and smallest transaction
Totals per month (optional)

These summaries can be computed in one pass through the file: you read each row once and update running totals. This is efficient and works even for large files.

Step-by-Step: Converting Columns Safely

Because CSV values are strings, you need conversion rules. A practical approach is to create small helper functions that convert values and handle common messy cases (extra spaces, empty strings, currency symbols).

Cleaning and converting money values

Amounts might appear as "34.20", " 34.20 ", or even "$34.20". A simple converter can strip whitespace and remove a leading currency symbol.

def parse_amount(text):
    if text is None:
        return None
    cleaned = text.strip()
    if cleaned == "":
        return None
    if cleaned.startswith("$"):
        cleaned = cleaned[1:]
    return float(cleaned)

If your data uses commas as thousands separators (for example "1,234.56"), you can remove commas:

cleaned = cleaned.replace(",", "")

Be careful: in some locales, commas are used as decimal separators. For a beginner-friendly script, it’s best to assume one consistent format and document it.

Extracting a month key from a date string

If dates are in ISO format (YYYY-MM-DD), you can extract the month with slicing:

def month_key(date_text):
    if not date_text:
        return None
    date_text = date_text.strip()
    if len(date_text) < 7:
        return None
    return date_text[:7]  # "YYYY-MM"

This avoids introducing extra complexity. If you later want stricter date validation, you can use datetime, but for many summaries, a month key string is enough.

Building the Summarizer Script (One Pass)

The following script reads the CSV, converts the amount, and computes totals and per-category summaries. It also tracks the largest transaction.

import csv


def parse_amount(text):
    if text is None:
        return None
    cleaned = text.strip()
    if cleaned == "":
        return None
    if cleaned.startswith("$"):
        cleaned = cleaned[1:]
    cleaned = cleaned.replace(",", "")
    return float(cleaned)


def month_key(date_text):
    if not date_text:
        return None
    date_text = date_text.strip()
    if len(date_text) < 7:
        return None
    return date_text[:7]


path = "expenses.csv"

row_count = 0
valid_count = 0
sum_amount = 0.0

by_category = {}
by_month = {}

max_amount = None
max_row = None

with open(path, "r", encoding="utf-8", newline="") as f:
    reader = csv.DictReader(f)

    required_cols = {"date", "category", "amount"}
    if reader.fieldnames is None:
        raise ValueError("CSV file has no header row")

    missing = required_cols - set(reader.fieldnames)
    if missing:
        raise ValueError(f"Missing required columns: {sorted(missing)}")

    for row in reader:
        row_count += 1

        amt = parse_amount(row.get("amount"))
        cat = (row.get("category") or "").strip()
        dt = (row.get("date") or "").strip()

        if amt is None or cat == "" or dt == "":
            # Skip incomplete rows
            continue

        valid_count += 1
        sum_amount += amt

        by_category[cat] = by_category.get(cat, 0.0) + amt

        mkey = month_key(dt)
        if mkey:
            by_month[mkey] = by_month.get(mkey, 0.0) + amt

        if max_amount is None or amt > max_amount:
            max_amount = amt
            max_row = row

avg_amount = (sum_amount / valid_count) if valid_count else 0.0

print("CSV SUMMARY")
print("----------")
print(f"Rows read: {row_count}")
print(f"Valid rows: {valid_count}")
print(f"Total amount: {sum_amount:.2f}")
print(f"Average amount: {avg_amount:.2f}")

print("\nTOTAL BY CATEGORY")
for cat in sorted(by_category):
    print(f"- {cat}: {by_category[cat]:.2f}")

print("\nTOTAL BY MONTH")
for m in sorted(by_month):
    print(f"- {m}: {by_month[m]:.2f}")

if max_row is not None:
    print("\nLARGEST TRANSACTION")
    print(f"- amount: {max_amount:.2f}")
    print(f"- date: {max_row.get('date')}")
    print(f"- category: {max_row.get('category')}")
    print(f"- notes: {max_row.get('notes')}")

What this script is doing (in practical terms)

It verifies the CSV has the columns you need. This prevents confusing errors later.
It reads each row once and updates summary variables.
It skips rows that are missing key values (date/category/amount).
It builds two dictionaries: totals by category and totals by month.
It tracks the maximum amount and remembers the row it came from.

This pattern is reusable: replace the column names and summary logic to fit your dataset.

Handling Common CSV “Messiness” Without Overcomplicating

Real-world CSV files often contain issues. You can keep your script simple while still being robust by adding a few targeted checks.

1) Extra whitespace and inconsistent capitalization

Categories might appear as "groceries", "Groceries", or " Groceries ". Decide on a normalization rule. For example, you can strip spaces and standardize capitalization:

cat = (row.get("category") or "").strip().title()

Be aware that .title() may not be ideal for every category name (for example, “eBay” becomes “Ebay”). Another approach is .strip() only, and keep the original text.

2) Blank lines in the file

Some CSV exports include blank lines. csv.DictReader may produce rows with all values as None or empty strings. The “skip incomplete rows” check handles this.

3) Different delimiters (semicolon or tab)

Not all “CSV” files use commas. Some use semicolons (;) or tabs. If you open a file and everything appears in one column, the delimiter is likely different. You can pass a delimiter:

reader = csv.DictReader(f, delimiter=";")

If you are not sure which delimiter is used, you can try csv.Sniffer to detect it. This adds a bit of complexity but can be worth it for a reusable tool.

import csv

with open(path, "r", encoding="utf-8", newline="") as f:
    sample = f.read(2048)
    f.seek(0)
    dialect = csv.Sniffer().sniff(sample)
    reader = csv.DictReader(f, dialect=dialect)

For beginner scripts, it’s often fine to support one delimiter and document it, but it’s helpful to know detection exists.

4) Numbers that fail to parse

If an amount column contains unexpected text (like "N/A"), float() will fail. One approach is to treat unparseable amounts as missing and skip the row. You can do this by catching conversion errors inside parse_amount and returning None.

def parse_amount(text):
    if text is None:
        return None
    cleaned = text.strip()
    if cleaned == "":
        return None
    if cleaned.startswith("$"):
        cleaned = cleaned[1:]
    cleaned = cleaned.replace(",", "")
    try:
        return float(cleaned)
    except ValueError:
        return None

This keeps the main loop clean: it just checks for None.

Producing a More Readable Report: Sorting and Formatting

Summaries are more useful when they are easy to scan. Two common improvements are sorting by total amount and aligning output.

Sorting categories by total (largest first)

sorted_cats = sorted(by_category.items(), key=lambda item: item[1], reverse=True)
for cat, total in sorted_cats:
    print(f"- {cat:15s} {total:10.2f}")

{cat:15s} reserves 15 characters for the category name, and {total:10.2f} reserves 10 characters for the number with 2 decimals. This creates a simple table-like output.

Computing percentages per category

Percentages help you understand where most of the total goes.

for cat, total in sorted_cats:
    pct = (total / sum_amount * 100) if sum_amount else 0.0
    print(f"- {cat:15s} {total:10.2f}  ({pct:5.1f}%)")

This is a small change that makes the report more informative.

Extending the Reader: Selecting Columns and Reusing the Script

A common next step is to make your CSV reader reusable for different files. Even without building a full command-line app, you can structure the code so you only change a few variables.

Configurable column names

Different CSV exports might call the same idea by different names: amount vs price vs total. You can define the column names at the top:

COL_DATE = "date"
COL_CATEGORY = "category"
COL_AMOUNT = "amount"

Then use them in the script:

required_cols = {COL_DATE, COL_CATEGORY, COL_AMOUNT}
...
amt = parse_amount(row.get(COL_AMOUNT))
cat = (row.get(COL_CATEGORY) or "").strip()
dt = (row.get(COL_DATE) or "").strip()

This makes it easier to adapt the script to new data sources.

Filtering rows before summarizing

Sometimes you want a summary for only one category or only one month. You can add a simple filter condition in the loop. For example, summarize only “Groceries”:

FILTER_CATEGORY = "Groceries"
...
if cat != FILTER_CATEGORY:
    continue

Or summarize only a specific month:

FILTER_MONTH = "2026-01"
...
if month_key(dt) != FILTER_MONTH:
    continue

Filtering is powerful because it lets you reuse the same summarizer for many questions without changing the summary logic.

Reading CSV Without Headers (When You Must)

Some CSV files do not include a header row. In that case, you can use csv.reader, which returns each row as a list. You then decide which column index means what.

Example file without headers:

2026-01-01,Groceries,34.20,Milk and bread
2026-01-02,Transport,12.00,Bus pass

Reader example:

import csv

path = "expenses_no_header.csv"

IDX_DATE = 0
IDX_CATEGORY = 1
IDX_AMOUNT = 2

sum_amount = 0.0
count = 0

with open(path, "r", encoding="utf-8", newline="") as f:
    reader = csv.reader(f)
    for row in reader:
        if len(row) < 3:
            continue
        dt = row[IDX_DATE].strip()
        cat = row[IDX_CATEGORY].strip()
        amt = parse_amount(row[IDX_AMOUNT])
        if dt == "" or cat == "" or amt is None:
            continue
        sum_amount += amt
        count += 1

print(f"Rows: {count}")
print(f"Total: {sum_amount:.2f}")

This approach is slightly more fragile because it depends on column order, but it is sometimes necessary.

Practical Mini-Exercises (Use Your Own CSV)

Exercise 1: Add a “minimum transaction” tracker

In addition to the maximum amount, track the minimum amount and its row. You will need two variables (min_amount and min_row) and update them similarly to the max logic.

Exercise 2: Count transactions per category

Along with totals per category, count how many rows each category has. Use a second dictionary like count_by_category and increment it for each valid row.

count_by_category[cat] = count_by_category.get(cat, 0) + 1

Then print both count and total in the report.

Exercise 3: Detect suspicious rows

Add a check that flags rows where the amount is unusually large (for example, greater than 500). Instead of skipping them, print them to the screen so you can review the data.

if amt > 500:
    print("Suspicious row:", row)

This is a simple way to do basic data validation and catch typos.

Key Takeaways to Apply Immediately

A simple CSV reader becomes genuinely useful when it does three things well: reads CSV correctly (including quotes and commas in text), converts the specific columns you care about into usable values, and summarizes in one pass with clear, readable output. With the patterns in this chapter—csv.DictReader, small parsing helpers, and dictionary-based totals—you can build quick scripts that turn exported data into answers.

Now answer the exercise about the content:

Why is using Python’s csv module (such as csv.DictReader) more reliable than splitting each line with line.split(,) when reading CSV files?

You are right! Congratulations, now go to the next page

You missed! Try again.

The csv module follows CSV parsing rules, so a comma inside quoted text stays within one field and rows are read correctly across different newline conventions and delimiters.

Next chapter

Practice Exercises with Expected Outputs and Self-Checks

86%

Python for Absolute Beginners: Variables, Loops, and Small Useful Scripts

New course

14 pages

Parsing and Summarizing Data with a Simple CSV Reader

What “Parsing and Summarizing CSV Data” Means

Why Use Python’s `csv` Module Instead of Splitting on Commas

CSV Structure: Headers, Rows, and Types

Step-by-Step: Reading a CSV File with `csv.DictReader`

Example CSV: `expenses.csv`

Minimal reader

Designing a Simple Summarizer: What to Compute

Step-by-Step: Converting Columns Safely

Cleaning and converting money values

Extracting a month key from a date string

Building the Summarizer Script (One Pass)

What this script is doing (in practical terms)

Handling Common CSV “Messiness” Without Overcomplicating

1) Extra whitespace and inconsistent capitalization

2) Blank lines in the file

3) Different delimiters (semicolon or tab)

4) Numbers that fail to parse

Producing a More Readable Report: Sorting and Formatting

Sorting categories by total (largest first)

Computing percentages per category

Extending the Reader: Selecting Columns and Reusing the Script

Configurable column names

Filtering rows before summarizing

Reading CSV Without Headers (When You Must)

Practical Mini-Exercises (Use Your Own CSV)

Exercise 1: Add a “minimum transaction” tracker

Exercise 2: Count transactions per category

Exercise 3: Detect suspicious rows

Key Takeaways to Apply Immediately

Why is using Python’s csv module (such as csv.DictReader) more reliable than splitting each line with line.split(,) when reading CSV files?

Python for Absolute Beginners: Variables, Loops, and Small Useful Scripts

LearnProgramming Languages ( Python, Ruby, Java, C )

LearnTechnology and Programming

Parsing and Summarizing Data with a Simple CSV Reader

What “Parsing and Summarizing CSV Data” Means

Why Use Python’s csv Module Instead of Splitting on Commas

CSV Structure: Headers, Rows, and Types

Step-by-Step: Reading a CSV File with csv.DictReader

Example CSV: expenses.csv

Minimal reader

Designing a Simple Summarizer: What to Compute

Step-by-Step: Converting Columns Safely

Cleaning and converting money values

Extracting a month key from a date string

Building the Summarizer Script (One Pass)

What this script is doing (in practical terms)

Handling Common CSV “Messiness” Without Overcomplicating

1) Extra whitespace and inconsistent capitalization

2) Blank lines in the file

3) Different delimiters (semicolon or tab)

4) Numbers that fail to parse

Producing a More Readable Report: Sorting and Formatting

Sorting categories by total (largest first)

Computing percentages per category

Extending the Reader: Selecting Columns and Reusing the Script

Configurable column names

Filtering rows before summarizing

Reading CSV Without Headers (When You Must)

Practical Mini-Exercises (Use Your Own CSV)

Exercise 1: Add a “minimum transaction” tracker

Exercise 2: Count transactions per category

Exercise 3: Detect suspicious rows

Key Takeaways to Apply Immediately

Why is using Python’s csv module (such as csv.DictReader) more reliable than splitting each line with line.split(,) when reading CSV files?

Python for Absolute Beginners: Variables, Loops, and Small Useful Scripts

LearnProgramming Languages ( Python, Ruby, Java, C )

LearnTechnology and Programming

Why Use Python’s `csv` Module Instead of Splitting on Commas

Step-by-Step: Reading a CSV File with `csv.DictReader`

Example CSV: `expenses.csv`