All courses > Technology and Programming > Programming Languages ( Python, Ruby, Java, C ) ::

Creating a Text Cleaner for Formatting and Filtering

Capítulo 11

Estimated reading time: 12 minutes

What a “Text Cleaner” Does (and Why It’s Useful)

A text cleaner is a small program that takes messy text and produces a more consistent, usable version. “Messy” can mean many things: extra spaces, inconsistent line endings, random punctuation, mixed casing, unwanted characters, duplicated blank lines, or lines you want to remove (like comments, timestamps, or empty entries). “Usable” depends on your goal: maybe you want to prepare text for analysis, create a clean list of items, normalize user-submitted notes, or format content before saving it.

In this chapter you will build a practical text cleaner that can do two kinds of work:

Formatting: normalize whitespace, unify line endings, standardize case, and optionally remove punctuation.
Filtering: keep or remove lines based on rules (empty lines, duplicates, lines containing certain words, or lines matching a pattern).

The key idea is to treat text as something you can transform in stages. Each stage takes text in and returns text out. This “pipeline” approach makes the cleaner easy to extend: you can add a new stage without rewriting everything.

Designing the Cleaner as a Pipeline

Instead of writing one giant block of code that tries to do everything at once, you will create small transformation steps. Each step should be easy to understand and test. A typical pipeline might look like this:

Normalize line endings and split into lines
Trim whitespace on each line
Remove unwanted lines (empty lines, duplicates, lines containing banned words)
Normalize spacing inside lines
Apply casing rules (lowercase, title case, etc.)
Join lines back together

Not every project needs every step. The point is to have a structure where you can turn steps on or off.

Continue in our app.

Listen to the audio with the screen off.
Earn a certificate upon completion.
Over 5000 courses for you to explore!

Or continue reading below...

Download the app

Step 1: Decide the Cleaning Rules

Before coding, decide what “clean” means for your situation. Here are common rules you can implement:

Trim edges: remove leading and trailing spaces on each line.
Collapse internal whitespace: turn multiple spaces/tabs into a single space.
Remove empty lines: delete lines that become empty after trimming.
Remove duplicate lines: keep only the first occurrence of each line.
Filter by keywords: remove lines that contain certain words (like “TODO” or “DEBUG”).
Keep only lines that match: for example, keep only lines that contain an “@” if you are extracting emails.
Normalize case: lowercase everything, or apply title case.
Remove punctuation: useful when you want plain words.
Normalize bullet lists: convert different bullet styles into a single style.

In the build below, you will implement a flexible set of options so the same cleaner can be reused for different tasks.

Step 2: Create a Configuration Object

A cleaner becomes much more useful when you can change its behavior without rewriting code. A simple way is to store options in a dictionary. This chapter avoids re-teaching basics, so focus on how the options affect the pipeline.

Here is a configuration dictionary you can use:

config = {    "strip_lines": True,    "remove_empty_lines": True,    "collapse_spaces": True,    "dedupe_lines": False,    "to_lower": False,    "remove_punctuation": False,    "banned_substrings": [],    "required_substrings": [],    "max_line_length": None}

Meaning of each option:

strip_lines: trims each line.
remove_empty_lines: removes empty lines after trimming.
collapse_spaces: replaces runs of whitespace with a single space.
dedupe_lines: removes duplicate lines while preserving order.
to_lower: lowercases each line.
remove_punctuation: removes punctuation characters.
banned_substrings: if any of these appear in a line, remove the line.
required_substrings: if provided, keep only lines that contain at least one of these.
max_line_length: if set, remove lines longer than this number of characters.

Step 3: Implement Core Formatting Steps

Normalize line endings and split into lines

Text can contain Windows line endings (\r\n) or Unix line endings (\n). Python handles many cases, but it’s still useful to normalize so your logic is consistent.

def split_lines(text):    text = text.replace("\r\n", "\n").replace("\r", "\n")    return text.split("\n")

Strip edges and collapse internal whitespace

Stripping removes spaces at the start and end. Collapsing internal whitespace is slightly different: it turns multiple spaces/tabs into one space. A reliable way is to split on any whitespace and join with a single space.

def strip_line(line):    return line.strip()def collapse_whitespace(line):    parts = line.split()    return " ".join(parts)

Notice that line.split() without an argument splits on any whitespace (spaces, tabs) and also removes extra whitespace automatically.

Optional: remove punctuation

Sometimes you want to remove punctuation to make matching easier (for example, turning “hello!” into “hello”). You can do this with string.punctuation.

import stringdef remove_punct(line):    table = str.maketrans("", "", string.punctuation)    return line.translate(table)

This removes common ASCII punctuation. If you need to handle more complex Unicode punctuation, you would use a different approach, but this is a good beginner-friendly default.

Optional: normalize case

Lowercasing is a common normalization step. It makes comparisons and filtering more consistent.

def to_lowercase(line):    return line.lower()

Step 4: Implement Filtering Steps

Filtering is about deciding whether a line should be kept. A clean way to structure it is to write a function that returns True if the line should be kept and False if it should be removed.

def should_keep_line(line, config):    if config.get("remove_empty_lines") and line == "":        return False    max_len = config.get("max_line_length")    if max_len is not None and len(line) > max_len:        return False    banned = config.get("banned_substrings", [])    for b in banned:        if b in line:            return False    required = config.get("required_substrings", [])    if required:        found = False        for r in required:            if r in line:                found = True                break        if not found:            return False    return True

Two important details:

Order matters: you usually want to remove empty lines early.
Case sensitivity: if you want case-insensitive filtering, apply lowercasing before filtering and also lowercase your banned/required substrings.

Step 5: Implement De-duplication (Preserving Order)

Removing duplicates is a common cleaning step for lists (names, tags, tasks). You want to keep the first occurrence and remove later repeats. A set helps you remember what you have already seen.

def dedupe_preserve_order(lines):    seen = set()    result = []    for line in lines:        if line in seen:            continue        seen.add(line)        result.append(line)    return result

This works well when lines are not too large. If you need to dedupe based on a normalized version (for example, ignoring case), you can store a normalized key in seen while keeping the original line in result.

Step 6: Put It Together in a Single Cleaning Function

Now you will combine formatting and filtering into one function that returns cleaned text. The function will:

Split text into lines
Apply formatting steps to each line
Filter lines
Optionally dedupe
Join back into text

import stringdef clean_text(text, config):    lines = split_lines(text)    cleaned_lines = []    for line in lines:        if config.get("strip_lines", True):            line = strip_line(line)        if config.get("collapse_spaces", False):            line = collapse_whitespace(line)        if config.get("remove_punctuation", False):            line = remove_punct(line)        if config.get("to_lower", False):            line = to_lowercase(line)        if should_keep_line(line, config):            cleaned_lines.append(line)    if config.get("dedupe_lines", False):        cleaned_lines = dedupe_preserve_order(cleaned_lines)    return "\n".join(cleaned_lines)

This is the heart of your text cleaner. It is short, readable, and easy to extend. If you want to add a new rule later (for example, “remove lines starting with #”), you add it in should_keep_line or as a formatting step.

Step 7: Try the Cleaner on Realistic Messy Text

Use a sample input that includes the kinds of problems you expect to see: extra spaces, blank lines, duplicates, and lines you want to remove.

messy = """   Apples   \n\nBananas\n  Carrots  \n# comment: ignore\nBananas\nTODO: later\n"""config = {    "strip_lines": True,    "remove_empty_lines": True,    "collapse_spaces": True,    "dedupe_lines": True,    "to_lower": False,    "remove_punctuation": False,    "banned_substrings": ["#", "TODO"],    "required_substrings": [],    "max_line_length": None}print(clean_text(messy, config))

What happens here:

Leading/trailing spaces are removed.
Blank lines are removed.
Duplicate “Bananas” is removed because dedupe_lines is True.
Lines containing “#” or “TODO” are removed.

Building a Small Interactive Script (Menu of Cleaning Modes)

A text cleaner is often used in different ways depending on the task. Instead of editing the configuration every time, you can define a few “modes” and let the user choose. Each mode is just a different config dictionary.

Example modes:

Notes cleanup: remove empty lines, collapse spaces, keep punctuation.
Keyword filter: keep only lines containing certain words.
Plain word list: remove punctuation, lowercase, dedupe.

def get_mode_config(mode_name):    if mode_name == "notes":        return {            "strip_lines": True,            "remove_empty_lines": True,            "collapse_spaces": True,            "dedupe_lines": False,            "to_lower": False,            "remove_punctuation": False,            "banned_substrings": [],            "required_substrings": [],            "max_line_length": None        }    if mode_name == "wordlist":        return {            "strip_lines": True,            "remove_empty_lines": True,            "collapse_spaces": True,            "dedupe_lines": True,            "to_lower": True,            "remove_punctuation": True,            "banned_substrings": [],            "required_substrings": [],            "max_line_length": None        }    if mode_name == "keep_emails":        return {            "strip_lines": True,            "remove_empty_lines": True,            "collapse_spaces": True,            "dedupe_lines": True,            "to_lower": True,            "remove_punctuation": False,            "banned_substrings": [],            "required_substrings": ["@"],            "max_line_length": None        }    raise ValueError("Unknown mode")

Then you can create a simple interaction where the user pastes text and chooses a mode. The exact input method depends on your environment, but the important part is that the cleaning logic stays the same and only the configuration changes.

More Advanced Filtering: Starts-With, Ends-With, and Contains

Substring checks are useful, but sometimes you want more specific rules:

Remove lines that start with a comment marker like # or //.
Remove lines that end with a certain suffix.
Keep only lines that contain a separator like : (useful for key-value notes).

You can extend should_keep_line with additional options:

def should_keep_line(line, config):    if config.get("remove_empty_lines") and line == "":        return False    if config.get("remove_if_startswith"):        for prefix in config["remove_if_startswith"]:            if line.startswith(prefix):                return False    if config.get("keep_only_if_contains"):        ok = False        for token in config["keep_only_if_contains"]:            if token in line:                ok = True                break        if not ok:            return False    max_len = config.get("max_line_length")    if max_len is not None and len(line) > max_len:        return False    banned = config.get("banned_substrings", [])    for b in banned:        if b in line:            return False    required = config.get("required_substrings", [])    if required:        found = False        for r in required:            if r in line:                found = True                break        if not found:            return False    return True

This keeps the same structure while adding more expressive rules. When you add new options, choose names that clearly describe the behavior.

Formatting for Consistent Lists: Normalizing Bullets

Many text notes include bullets like - item, * item, or • item. If you want consistent output, you can normalize them to a single bullet style. This is a formatting step that runs before filtering.

def normalize_bullets(line):    stripped = line.lstrip()    bullets = ("- ", "* ", "• ")    for b in bullets:        if stripped.startswith(b):            content = stripped[len(b):].strip()            return "- " + content    return line

If you add this step to clean_text, you can standardize lists from different sources. For example, you might apply it right after stripping and collapsing spaces.

Practical Mini-Projects Using the Same Cleaner

1) Clean a pasted list of items for a shopping list

Goal: remove empty lines, collapse spaces, dedupe, keep punctuation, keep original case.

config = {    "strip_lines": True,    "remove_empty_lines": True,    "collapse_spaces": True,    "dedupe_lines": True,    "to_lower": False,    "remove_punctuation": False,    "banned_substrings": [],    "required_substrings": [],    "max_line_length": None}cleaned = clean_text(pasted_text, config)

2) Create a plain word list from notes

Goal: lowercase, remove punctuation, collapse spaces, remove empty lines, then dedupe. This is useful for building a vocabulary list or preparing tags.

config = {    "strip_lines": True,    "remove_empty_lines": True,    "collapse_spaces": True,    "dedupe_lines": True,    "to_lower": True,    "remove_punctuation": True,    "banned_substrings": [],    "required_substrings": [],    "max_line_length": None}cleaned = clean_text(pasted_text, config)

3) Filter log-like lines to keep only important ones

Goal: remove lines containing “DEBUG” or “heartbeat”, and remove very long lines that are likely stack traces.

config = {    "strip_lines": True,    "remove_empty_lines": True,    "collapse_spaces": True,    "dedupe_lines": False,    "to_lower": False,    "remove_punctuation": False,    "banned_substrings": ["DEBUG", "heartbeat"],    "required_substrings": [],    "max_line_length": 200}cleaned = clean_text(log_text, config)

Step-by-Step: Extending the Cleaner with a “Replace Map”

A common need is to replace certain characters or phrases: converting fancy quotes to normal quotes, turning tabs into spaces, or replacing “&” with “and”. You can implement a replace map as a dictionary of old to new strings and apply it to each line.

1) Add an option to the configuration

config = {    # ... other options ...    "replacements": {"\t": " ", "&": "and"}}

2) Write a function that applies replacements

def apply_replacements(line, replacements):    for old, new in replacements.items():        line = line.replace(old, new)    return line

3) Insert it into the pipeline

def clean_text(text, config):    lines = split_lines(text)    cleaned_lines = []    for line in lines:        if config.get("strip_lines", True):            line = strip_line(line)        replacements = config.get("replacements")        if replacements:            line = apply_replacements(line, replacements)        if config.get("collapse_spaces", False):            line = collapse_whitespace(line)        if config.get("remove_punctuation", False):            line = remove_punct(line)        if config.get("to_lower", False):            line = to_lowercase(line)        if should_keep_line(line, config):            cleaned_lines.append(line)    if config.get("dedupe_lines", False):        cleaned_lines = dedupe_preserve_order(cleaned_lines)    return "\n".join(cleaned_lines)

This extension shows the main advantage of the pipeline approach: you can add a new capability with minimal changes, and the cleaner remains readable.

Testing Your Cleaner with Small, Focused Inputs

When you build a tool that transforms text, small tests help you avoid surprises. You can test each step with a tiny input and check the output manually. For example:

Whitespace test: input " a b " should become "a b" when collapsing spaces.
Punctuation test: input "hello, world!" should become "hello world" when removing punctuation.
Filtering test: input lines containing banned tokens should disappear.
Dedupe test: repeated lines should appear once.

Because the cleaner is built from small functions, you can test each function separately, and then test the full pipeline with a realistic sample.

Now answer the exercise about the content:

Why is designing a text cleaner as a pipeline of stages useful?

You are right! Congratulations, now go to the next page

You missed! Try again.

A pipeline breaks cleaning into small stages that take text in and return text out. This makes each step easier to understand and test, and you can add, remove, or reorder stages without rewriting the whole cleaner.