What a “Text Cleaner” Does (and Why It’s Useful)
A text cleaner is a small program that takes messy text and produces a more consistent, usable version. “Messy” can mean many things: extra spaces, inconsistent line endings, random punctuation, mixed casing, unwanted characters, duplicated blank lines, or lines you want to remove (like comments, timestamps, or empty entries). “Usable” depends on your goal: maybe you want to prepare text for analysis, create a clean list of items, normalize user-submitted notes, or format content before saving it.
In this chapter you will build a practical text cleaner that can do two kinds of work:
- Formatting: normalize whitespace, unify line endings, standardize case, and optionally remove punctuation.
- Filtering: keep or remove lines based on rules (empty lines, duplicates, lines containing certain words, or lines matching a pattern).
The key idea is to treat text as something you can transform in stages. Each stage takes text in and returns text out. This “pipeline” approach makes the cleaner easy to extend: you can add a new stage without rewriting everything.
Designing the Cleaner as a Pipeline
Instead of writing one giant block of code that tries to do everything at once, you will create small transformation steps. Each step should be easy to understand and test. A typical pipeline might look like this:
- Normalize line endings and split into lines
- Trim whitespace on each line
- Remove unwanted lines (empty lines, duplicates, lines containing banned words)
- Normalize spacing inside lines
- Apply casing rules (lowercase, title case, etc.)
- Join lines back together
Not every project needs every step. The point is to have a structure where you can turn steps on or off.
Continue in our app.
You can listen to the audiobook with the screen off, receive a free certificate for this course, and also have access to 5,000 other free online courses.
Or continue reading below...Download the app
Step 1: Decide the Cleaning Rules
Before coding, decide what “clean” means for your situation. Here are common rules you can implement:
- Trim edges: remove leading and trailing spaces on each line.
- Collapse internal whitespace: turn multiple spaces/tabs into a single space.
- Remove empty lines: delete lines that become empty after trimming.
- Remove duplicate lines: keep only the first occurrence of each line.
- Filter by keywords: remove lines that contain certain words (like “TODO” or “DEBUG”).
- Keep only lines that match: for example, keep only lines that contain an “@” if you are extracting emails.
- Normalize case: lowercase everything, or apply title case.
- Remove punctuation: useful when you want plain words.
- Normalize bullet lists: convert different bullet styles into a single style.
In the build below, you will implement a flexible set of options so the same cleaner can be reused for different tasks.
Step 2: Create a Configuration Object
A cleaner becomes much more useful when you can change its behavior without rewriting code. A simple way is to store options in a dictionary. This chapter avoids re-teaching basics, so focus on how the options affect the pipeline.
Here is a configuration dictionary you can use:
config = { "strip_lines": True, "remove_empty_lines": True, "collapse_spaces": True, "dedupe_lines": False, "to_lower": False, "remove_punctuation": False, "banned_substrings": [], "required_substrings": [], "max_line_length": None}Meaning of each option:
strip_lines: trims each line.remove_empty_lines: removes empty lines after trimming.collapse_spaces: replaces runs of whitespace with a single space.dedupe_lines: removes duplicate lines while preserving order.to_lower: lowercases each line.remove_punctuation: removes punctuation characters.banned_substrings: if any of these appear in a line, remove the line.required_substrings: if provided, keep only lines that contain at least one of these.max_line_length: if set, remove lines longer than this number of characters.
Step 3: Implement Core Formatting Steps
Normalize line endings and split into lines
Text can contain Windows line endings (\r\n) or Unix line endings (\n). Python handles many cases, but it’s still useful to normalize so your logic is consistent.
def split_lines(text): text = text.replace("\r\n", "\n").replace("\r", "\n") return text.split("\n")Strip edges and collapse internal whitespace
Stripping removes spaces at the start and end. Collapsing internal whitespace is slightly different: it turns multiple spaces/tabs into one space. A reliable way is to split on any whitespace and join with a single space.
def strip_line(line): return line.strip()def collapse_whitespace(line): parts = line.split() return " ".join(parts)Notice that line.split() without an argument splits on any whitespace (spaces, tabs) and also removes extra whitespace automatically.
Optional: remove punctuation
Sometimes you want to remove punctuation to make matching easier (for example, turning “hello!” into “hello”). You can do this with string.punctuation.
import stringdef remove_punct(line): table = str.maketrans("", "", string.punctuation) return line.translate(table)This removes common ASCII punctuation. If you need to handle more complex Unicode punctuation, you would use a different approach, but this is a good beginner-friendly default.
Optional: normalize case
Lowercasing is a common normalization step. It makes comparisons and filtering more consistent.
def to_lowercase(line): return line.lower()Step 4: Implement Filtering Steps
Filtering is about deciding whether a line should be kept. A clean way to structure it is to write a function that returns True if the line should be kept and False if it should be removed.
def should_keep_line(line, config): if config.get("remove_empty_lines") and line == "": return False max_len = config.get("max_line_length") if max_len is not None and len(line) > max_len: return False banned = config.get("banned_substrings", []) for b in banned: if b in line: return False required = config.get("required_substrings", []) if required: found = False for r in required: if r in line: found = True break if not found: return False return TrueTwo important details:
- Order matters: you usually want to remove empty lines early.
- Case sensitivity: if you want case-insensitive filtering, apply lowercasing before filtering and also lowercase your banned/required substrings.
Step 5: Implement De-duplication (Preserving Order)
Removing duplicates is a common cleaning step for lists (names, tags, tasks). You want to keep the first occurrence and remove later repeats. A set helps you remember what you have already seen.
def dedupe_preserve_order(lines): seen = set() result = [] for line in lines: if line in seen: continue seen.add(line) result.append(line) return resultThis works well when lines are not too large. If you need to dedupe based on a normalized version (for example, ignoring case), you can store a normalized key in seen while keeping the original line in result.
Step 6: Put It Together in a Single Cleaning Function
Now you will combine formatting and filtering into one function that returns cleaned text. The function will:
- Split text into lines
- Apply formatting steps to each line
- Filter lines
- Optionally dedupe
- Join back into text
import stringdef clean_text(text, config): lines = split_lines(text) cleaned_lines = [] for line in lines: if config.get("strip_lines", True): line = strip_line(line) if config.get("collapse_spaces", False): line = collapse_whitespace(line) if config.get("remove_punctuation", False): line = remove_punct(line) if config.get("to_lower", False): line = to_lowercase(line) if should_keep_line(line, config): cleaned_lines.append(line) if config.get("dedupe_lines", False): cleaned_lines = dedupe_preserve_order(cleaned_lines) return "\n".join(cleaned_lines)This is the heart of your text cleaner. It is short, readable, and easy to extend. If you want to add a new rule later (for example, “remove lines starting with #”), you add it in should_keep_line or as a formatting step.
Step 7: Try the Cleaner on Realistic Messy Text
Use a sample input that includes the kinds of problems you expect to see: extra spaces, blank lines, duplicates, and lines you want to remove.
messy = """ Apples \n\nBananas\n Carrots \n# comment: ignore\nBananas\nTODO: later\n"""config = { "strip_lines": True, "remove_empty_lines": True, "collapse_spaces": True, "dedupe_lines": True, "to_lower": False, "remove_punctuation": False, "banned_substrings": ["#", "TODO"], "required_substrings": [], "max_line_length": None}print(clean_text(messy, config))What happens here:
- Leading/trailing spaces are removed.
- Blank lines are removed.
- Duplicate “Bananas” is removed because
dedupe_linesisTrue. - Lines containing “#” or “TODO” are removed.
Building a Small Interactive Script (Menu of Cleaning Modes)
A text cleaner is often used in different ways depending on the task. Instead of editing the configuration every time, you can define a few “modes” and let the user choose. Each mode is just a different config dictionary.
Example modes:
- Notes cleanup: remove empty lines, collapse spaces, keep punctuation.
- Keyword filter: keep only lines containing certain words.
- Plain word list: remove punctuation, lowercase, dedupe.
def get_mode_config(mode_name): if mode_name == "notes": return { "strip_lines": True, "remove_empty_lines": True, "collapse_spaces": True, "dedupe_lines": False, "to_lower": False, "remove_punctuation": False, "banned_substrings": [], "required_substrings": [], "max_line_length": None } if mode_name == "wordlist": return { "strip_lines": True, "remove_empty_lines": True, "collapse_spaces": True, "dedupe_lines": True, "to_lower": True, "remove_punctuation": True, "banned_substrings": [], "required_substrings": [], "max_line_length": None } if mode_name == "keep_emails": return { "strip_lines": True, "remove_empty_lines": True, "collapse_spaces": True, "dedupe_lines": True, "to_lower": True, "remove_punctuation": False, "banned_substrings": [], "required_substrings": ["@"], "max_line_length": None } raise ValueError("Unknown mode")Then you can create a simple interaction where the user pastes text and chooses a mode. The exact input method depends on your environment, but the important part is that the cleaning logic stays the same and only the configuration changes.
More Advanced Filtering: Starts-With, Ends-With, and Contains
Substring checks are useful, but sometimes you want more specific rules:
- Remove lines that start with a comment marker like
#or//. - Remove lines that end with a certain suffix.
- Keep only lines that contain a separator like
:(useful for key-value notes).
You can extend should_keep_line with additional options:
def should_keep_line(line, config): if config.get("remove_empty_lines") and line == "": return False if config.get("remove_if_startswith"): for prefix in config["remove_if_startswith"]: if line.startswith(prefix): return False if config.get("keep_only_if_contains"): ok = False for token in config["keep_only_if_contains"]: if token in line: ok = True break if not ok: return False max_len = config.get("max_line_length") if max_len is not None and len(line) > max_len: return False banned = config.get("banned_substrings", []) for b in banned: if b in line: return False required = config.get("required_substrings", []) if required: found = False for r in required: if r in line: found = True break if not found: return False return TrueThis keeps the same structure while adding more expressive rules. When you add new options, choose names that clearly describe the behavior.
Formatting for Consistent Lists: Normalizing Bullets
Many text notes include bullets like - item, * item, or • item. If you want consistent output, you can normalize them to a single bullet style. This is a formatting step that runs before filtering.
def normalize_bullets(line): stripped = line.lstrip() bullets = ("- ", "* ", "• ") for b in bullets: if stripped.startswith(b): content = stripped[len(b):].strip() return "- " + content return lineIf you add this step to clean_text, you can standardize lists from different sources. For example, you might apply it right after stripping and collapsing spaces.
Practical Mini-Projects Using the Same Cleaner
1) Clean a pasted list of items for a shopping list
Goal: remove empty lines, collapse spaces, dedupe, keep punctuation, keep original case.
config = { "strip_lines": True, "remove_empty_lines": True, "collapse_spaces": True, "dedupe_lines": True, "to_lower": False, "remove_punctuation": False, "banned_substrings": [], "required_substrings": [], "max_line_length": None}cleaned = clean_text(pasted_text, config)2) Create a plain word list from notes
Goal: lowercase, remove punctuation, collapse spaces, remove empty lines, then dedupe. This is useful for building a vocabulary list or preparing tags.
config = { "strip_lines": True, "remove_empty_lines": True, "collapse_spaces": True, "dedupe_lines": True, "to_lower": True, "remove_punctuation": True, "banned_substrings": [], "required_substrings": [], "max_line_length": None}cleaned = clean_text(pasted_text, config)3) Filter log-like lines to keep only important ones
Goal: remove lines containing “DEBUG” or “heartbeat”, and remove very long lines that are likely stack traces.
config = { "strip_lines": True, "remove_empty_lines": True, "collapse_spaces": True, "dedupe_lines": False, "to_lower": False, "remove_punctuation": False, "banned_substrings": ["DEBUG", "heartbeat"], "required_substrings": [], "max_line_length": 200}cleaned = clean_text(log_text, config)Step-by-Step: Extending the Cleaner with a “Replace Map”
A common need is to replace certain characters or phrases: converting fancy quotes to normal quotes, turning tabs into spaces, or replacing “&” with “and”. You can implement a replace map as a dictionary of old to new strings and apply it to each line.
1) Add an option to the configuration
config = { # ... other options ... "replacements": {"\t": " ", "&": "and"}}2) Write a function that applies replacements
def apply_replacements(line, replacements): for old, new in replacements.items(): line = line.replace(old, new) return line3) Insert it into the pipeline
def clean_text(text, config): lines = split_lines(text) cleaned_lines = [] for line in lines: if config.get("strip_lines", True): line = strip_line(line) replacements = config.get("replacements") if replacements: line = apply_replacements(line, replacements) if config.get("collapse_spaces", False): line = collapse_whitespace(line) if config.get("remove_punctuation", False): line = remove_punct(line) if config.get("to_lower", False): line = to_lowercase(line) if should_keep_line(line, config): cleaned_lines.append(line) if config.get("dedupe_lines", False): cleaned_lines = dedupe_preserve_order(cleaned_lines) return "\n".join(cleaned_lines)This extension shows the main advantage of the pipeline approach: you can add a new capability with minimal changes, and the cleaner remains readable.
Testing Your Cleaner with Small, Focused Inputs
When you build a tool that transforms text, small tests help you avoid surprises. You can test each step with a tiny input and check the output manually. For example:
- Whitespace test: input
" a b "should become"a b"when collapsing spaces. - Punctuation test: input
"hello, world!"should become"hello world"when removing punctuation. - Filtering test: input lines containing banned tokens should disappear.
- Dedupe test: repeated lines should appear once.
Because the cleaner is built from small functions, you can test each function separately, and then test the full pipeline with a realistic sample.