Why Data Is Called “Fuel”
AI systems don’t “learn” in the way humans do. They adjust internal parameters so that, given an input, they produce an output that matches patterns found in examples. Those examples are data. If the data is relevant, consistent, and representative of the real world, the AI can generalize well. If the data is messy, biased, or mislabeled, the AI will often fail in predictable ways.
Thinking of data as fuel is useful because it highlights three practical realities: (1) you need enough of it to power learning, (2) it must be the right type for the engine (the model), and (3) impurities (errors, bias, leakage) can damage performance. In practice, many AI projects succeed or fail more because of data choices than because of model choices.
What Counts as “Data” in AI Projects
Data can be many things, depending on the task. The key is that it must be something the model can consume as input, and (for many tasks) something you can compare against as the desired output.
Common data types
Tabular data: rows and columns, like spreadsheets (customer age, plan type, monthly usage).
Text: emails, reviews, support tickets, chat logs, documents.
Continue in our app.
You can listen to the audiobook with the screen off, receive a free certificate for this course, and also have access to 5,000 other free online courses.
Or continue reading below...Download the app
Images: photos, scans, medical images, product images.
Audio: speech recordings, call center audio, machine sounds.
Video: security footage, sports clips, manufacturing line recordings.
Time series: sensor readings over time, stock prices, heart rate signals.
Logs and events: clicks, app events, server logs.
Features vs. labels (inputs vs. targets)
In many projects you can separate data into:
Features (inputs): what you feed into the model (e.g., the text of an email, or a customer’s usage stats).
Labels (targets): what you want the model to predict (e.g., “spam/not spam”, “will churn/won’t churn”, “defective/not defective”).
Not every AI approach needs labels, but when labels exist, they strongly shape what the model learns. If labels are wrong or inconsistent, the model will learn the wrong mapping even if the input data is perfect.
Examples: How Data Connects to Real Use Cases
It’s easier to understand data requirements by looking at concrete tasks. Below are several common AI use cases and what “good data” looks like for each.
Example 1: Email spam detection (text classification)
Input: email subject + body text, sender domain, metadata.
Label: spam or not spam.
Quality pitfalls: labels based only on user complaints may be noisy; spam evolves over time; “promotional but legitimate” emails may be inconsistently labeled.
Practical implication: you need ongoing data refresh and clear labeling rules (what counts as spam vs. marketing vs. phishing).
Example 2: Predicting customer churn (tabular prediction)
Input: account age, usage trends, support tickets count, billing issues.
Label: churned within the next 30 days (yes/no).
Quality pitfalls: “churn” definition changes (canceled vs. inactive); missing data for certain customer segments; leakage (features that only exist after churn happens).
Practical implication: define churn precisely and ensure features are available at prediction time.
Example 3: Detecting defects in manufacturing (image classification or detection)
Input: images of products from a consistent camera setup.
Label: defect type or “no defect”; sometimes bounding boxes around defect regions.
Quality pitfalls: lighting changes; rare defect types; inconsistent camera angles; labelers disagree on borderline defects.
Practical implication: standardize image capture and invest in consistent labeling guidelines, especially for edge cases.
Example 4: Speech-to-text for call centers (audio + text)
Input: audio recordings.
Label: correct transcript (and sometimes speaker turns).
Quality pitfalls: background noise; accents; domain-specific terms; privacy constraints; transcription inconsistencies.
Practical implication: include representative accents/noise conditions and maintain a glossary for domain terms.
Labels: What They Are and Why They Matter
A label is the “answer key” for learning. In supervised learning tasks, the model learns by comparing its prediction to the label and adjusting itself to reduce error. If the answer key is unreliable, training becomes like studying from a textbook full of mistakes.
Different kinds of labels
Binary labels: yes/no (fraud/not fraud).
Multi-class labels: one of many categories (issue type: billing, technical, account).
Multi-label: multiple tags can apply at once (an image can contain both “cat” and “dog”).
Continuous labels: a number (house price, delivery time).
Structured labels: bounding boxes, segmentation masks, keypoints, sequences (transcripts).
Where labels come from
Human annotation: people read, watch, or inspect and assign labels.
Business systems: outcomes recorded naturally (refund issued, payment failed, ticket resolved).
Heuristics/rules: proxy labels (e.g., “if chargeback occurred, label as fraud”).
Weak supervision: combining multiple noisy sources to create approximate labels.
Each source has trade-offs. Human labels can be high quality but expensive. System labels are cheap but may reflect business processes rather than ground truth. Rule-based labels can scale but often bake in bias and miss edge cases.
Label Quality: The Most Common Problems
1) Inconsistent definitions
If two labelers interpret “toxic comment” differently, the dataset becomes internally contradictory. The model will learn a blurred boundary and perform unpredictably.
Fix: create a labeling guide with examples and counterexamples, and update it when new edge cases appear.
2) Class imbalance
Some events are rare (fraud, severe defects). If 99.5% of examples are “not fraud,” a model can appear accurate by always predicting “not fraud.”
Fix: collect more positive examples, use targeted sampling, and evaluate with metrics that reflect the rare class (precision/recall) rather than only overall accuracy.
3) Label noise and human error
People make mistakes, especially when tasks are repetitive or ambiguous. Noise can be random (accidental misclick) or systematic (a labeler misunderstands a rule).
Fix: double-label a subset, measure agreement, and review disagreements to improve guidelines.
4) Proxy labels that don’t match the real goal
Sometimes you label what is easy to measure, not what you truly care about. For example, using “customer called support” as a label for “customer is unhappy” misses customers who are unhappy but never call.
Fix: validate proxies against a smaller set of high-quality ground truth labels.
5) Data leakage through labels
Leakage happens when the dataset accidentally includes information that wouldn’t be available at prediction time, making the model look better during testing than it will be in real use.
Example: predicting “will churn next month” while including a feature like “account cancellation date” or “final invoice issued.” The model learns to “cheat.”
Fix: enforce time-based rules: features must come from before the prediction point.
Data Quality Beyond Labels: What “Good” Looks Like
Even with perfect labels, input data can cause problems. Data quality is not just “clean vs. dirty”; it’s about whether the dataset matches the real environment where the model will be used.
Representativeness
Your dataset should reflect the variety of real-world cases: different user groups, lighting conditions, device types, languages, seasons, and behaviors. If you train only on “easy” cases, the model will fail on the messy ones.
Example: a product photo model trained only on studio images may fail on customer-uploaded photos with cluttered backgrounds.
Coverage of edge cases
Edge cases are rare but important scenarios: unusual inputs, borderline categories, or failure modes that matter a lot (e.g., safety-related misclassifications). Good datasets intentionally include them.
Consistency and standardization
In tabular data, the same concept should be recorded the same way. If “United States,” “USA,” and “US” appear as separate values, the model may treat them as different categories.
In images, consistent camera settings reduce unwanted variation. In text, consistent formatting and encoding reduce parsing issues.
Timeliness and drift
Data changes over time. Customer behavior shifts, fraud patterns evolve, product designs change, and language trends move. A model trained on last year’s data may degrade if the world changes.
Practical implication: plan for periodic retraining and monitoring, not a one-time dataset build.
Privacy and sensitive data handling
High-quality data is also responsibly handled data. If you collect more personal data than necessary, you increase risk without improving performance. For many tasks, you can minimize or anonymize sensitive fields while still achieving good results.
Practical implication: decide which fields are truly needed, restrict access, and document how data is used.
Step-by-Step: Building a Small, High-Quality Labeled Dataset
This workflow applies whether you’re labeling text, images, or tabular outcomes. The goal is to create a dataset that is not only large enough, but also trustworthy.
Step 1: Define the prediction task precisely
Write a one-sentence statement: “Given X, predict Y at time T.”
Specify what counts as each label, including borderline cases.
List what the model will and will not be used for (scope boundaries).
Example for churn: “Given customer activity up to today, predict whether the customer will cancel within the next 30 days.”
Step 2: Identify the data sources and the unit of analysis
Decide what one “example” is: one email, one customer-month, one product image, one call recording.
List where inputs come from (databases, logs, sensors) and how they will be joined.
Check for missingness patterns (are some groups systematically missing data?).
Step 3: Create labeling guidelines (a “labeling playbook”)
Define each label in plain language.
Provide at least 5 positive examples and 5 negative examples.
Include a section called “Common confusions” with how to decide.
Define what to do when uncertain (e.g., use an “unclear” tag for review).
For image defects, include annotated example images showing what counts as a defect and what is acceptable variation.
Step 4: Run a pilot labeling round
Label a small batch (e.g., 100–300 items).
Use at least two labelers for a subset (e.g., 20–30%) to measure agreement.
Collect notes on unclear cases and update the playbook.
The pilot is where you discover hidden ambiguity. It is cheaper to fix definitions now than after labeling 50,000 items.
Step 5: Set up quality checks
Spot checks: review random labeled items daily.
Gold set: a small set of items with trusted labels used to evaluate labelers.
Disagreement review: a process to resolve conflicts and refine rules.
Outlier detection: look for labelers who label unusually fast or with unusual distributions.
Step 6: Label at scale with feedback loops
Label in batches and monitor label distributions (did “spam” suddenly jump from 20% to 80%?).
Hold weekly calibration sessions: labelers discuss tricky examples and align decisions.
Version your guidelines so you know which rules were used for which batch.
Step 7: Split data correctly for evaluation
To estimate real-world performance, you need a test set that reflects future usage.
Random split: common, but can hide leakage if similar items appear in both train and test.
Time-based split: train on earlier data, test on later data; useful when the world changes over time.
Group-based split: keep all data from the same customer/device in one split to avoid “memorization.”
Choose the split that matches how the model will be used.
Practical Examples of Data Quality Checks
Check 1: Simple sanity checks for tabular data
Are there impossible values (negative age, future dates)?
Are units consistent (minutes vs. seconds)?
Do categories have duplicates (“NY”, “New York”, “newyork”)?
Are there sudden spikes that indicate logging bugs?
# Example pseudo-checks (conceptual, not tied to a specific tool) if age < 0 or age > 120: flag if signup_date > today: flag if duration_seconds > 86400: flag if country in {"US", "USA", "United States"}: normalize_to("US")Check 2: Label distribution and drift
Track how labels change over time. If “fraud” rate doubles overnight, it might be a real attack, or it might be a labeling rule change or system bug.
Plot label counts per week.
Compare distributions across regions, devices, or product lines.
Investigate sudden changes with a sample review.
Check 3: Inter-annotator agreement for human labels
When two people label the same item, how often do they agree? Low agreement suggests unclear definitions or a task that is inherently subjective.
Start with percent agreement as a simple measure.
Review disagreements and categorize why they happened (missing context, unclear rule, ambiguous input).
The goal is not perfect agreement; it is stable, explainable labeling that matches the intended use.
Check 4: Dataset “slices” to find hidden failures
Overall accuracy can hide poor performance on specific groups or conditions. Evaluate on slices such as:
Short vs. long texts
Low-light vs. normal-light images
New customers vs. long-term customers
Different languages or regions
If performance is much worse on a slice, you may need more data for that slice or different labeling rules.
How Much Data Do You Need?
There is no universal number. The amount depends on task complexity, noise, and how varied the real world is. However, you can make practical decisions without guessing blindly.
Use a learning curve approach
A learning curve shows performance as you add more labeled data. If performance improves quickly and then plateaus, you may be near the limit of what the current features and labels can provide.
Label a small set (e.g., 500 items), train and evaluate.
Increase to 1,000, 2,000, 5,000, and track improvement.
If gains are small, consider improving label quality, adding better features, or redefining the task.
Prioritize “informative” data
Not all data points are equally useful. Data that covers rare cases, confusing boundaries, and real deployment conditions often provides more value than more of the same easy examples.
Example: for defect detection, 200 images of rare defect types may be more valuable than 5,000 images of perfect products.
Data Documentation: Making Datasets Understandable
High-quality datasets are not just files; they are documented artifacts. Without documentation, teams forget what labels mean, how data was collected, and what limitations exist.
What to document (practical checklist)
Purpose: what problem the dataset supports.
Data sources: where it came from and time range.
Unit of analysis: what one row/item represents.
Label definitions: rules used, including edge cases.
Known limitations: missing segments, noisy fields, known biases.
Privacy handling: what was removed or anonymized, access controls.
Versioning: dataset version, guideline version, and changes over time.
This documentation helps future you (and your teammates) avoid repeating mistakes and makes model behavior easier to explain.
Putting It Together: A Mini Case Study (End-to-End Data Thinking)
Imagine you want to build a model that routes incoming support tickets to the right team (billing, technical, account access). The model will read the ticket text and predict the category.
Data and labels
Inputs: ticket subject, ticket body, product type, customer plan.
Label: correct routing category.
Common hidden problems
Labels reflect past routing mistakes: if tickets were historically misrouted, the “category” field may be wrong.
Category definitions drift: a new team is created, or responsibilities change.
Multi-issue tickets: one ticket may include both billing and technical issues, but the system forces a single label.
Practical steps to improve quality
Define categories with clear boundaries and examples.
Use a pilot set where experts label tickets and compare to historical categories.
Add an “multiple issues” or “needs triage” label if that reflects reality.
Evaluate performance by slice: short tickets, tickets with attachments, tickets from new products.
This case shows the central idea: data is not just collected; it is designed. Labels are not just fields; they are decisions that encode what you want the AI to do.