All courses > Technology and Programming > Artificial Intelligence and Machine Learning ::

Datasets and Labeling: Getting Ground Truth You Can Trust

Capítulo 7

Estimated reading time: 12 minutes

1) Define label schemas and keep them consistent

A label schema is the contract between your data, your annotators, and your training/evaluation code. If the schema changes mid-project (even subtly), you create “silent bugs”: models learn the wrong target, metrics become incomparable, and debugging becomes expensive. Define the schema once, version it, and enforce it with validation scripts.

Task-specific schema elements

Classification: list of class names, whether multi-class (exactly one label) or multi-label (any subset), and whether “unknown/other” is allowed.
Detection: class list plus bounding box format, coordinate system, and whether boxes are axis-aligned or rotated.
Segmentation: class list plus mask representation (polygon, raster mask, run-length encoding), resolution rules, and how overlaps are handled.

Define the class taxonomy

Write down the complete set of classes and their intended meaning. Decide early whether you want fine-grained classes (e.g., “sedan”, “SUV”, “truck”) or coarse classes (e.g., “vehicle”). Fine-grained labels can improve downstream usefulness but increase annotation ambiguity and cost. If you expect future refinement, plan for a hierarchical taxonomy (e.g., vehicle > car > sedan) so you can collapse classes later without relabeling.

Bounding box schema (detection)

Pick one box format and never mix formats across tools and exports.

Common formats: xyxy (xmin, ymin, xmax, ymax), xywh (xmin, ymin, width, height), or normalized variants in [0,1].
Coordinate origin: typically top-left of the image. Document it explicitly.
Pixel convention: decide whether coordinates are inclusive/exclusive at the boundary (e.g., does xmax equal the last covered pixel or one past it?). This matters when converting boxes to masks or computing IoU.
Rotated boxes (if needed): define angle units (degrees/radians), rotation direction, and whether angle is relative to x-axis.

Mask schema (segmentation)

Segmentation labels can be represented in multiple ways; choose based on tooling and downstream needs.

Raster masks: one label per pixel. Define whether you store a single-channel class-id image or one binary mask per class/instance.
Polygons: store vertices in image coordinates. Define winding order, whether polygons can self-intersect, and how holes are represented.
Overlaps: for instance segmentation, define whether instances can overlap and how to resolve pixel ownership (z-order rules or per-instance masks).
Void/ignore label: reserve a specific id (e.g., 255) for “ignore” pixels that should not contribute to loss/metrics.

Schema versioning and validation

Treat the schema like code: version it, review changes, and validate every export.

Continue in our app.

Listen to the audio with the screen off.
Earn a certificate upon completion.
Over 5000 courses for you to explore!

Or continue reading below...

Download the app

Store a machine-readable schema file (e.g., JSON/YAML) with: class list, ids, colors (if used), formats, and ignore rules.
Write a validator that checks: class ids are valid, boxes are within image bounds, masks match image size, no NaNs, and required fields exist.
Lock class-id mappings. Renumbering classes midstream breaks training reproducibility.

{  "schema_version": "1.0.0",  "task": "detection",  "box_format": "xyxy",  "coords": "pixel",  "origin": "top_left",  "classes": [    {"id": 0, "name": "person"},    {"id": 1, "name": "bicycle"}  ],  "ignore": {"use": true, "reason_codes": ["too_small", "ambiguous", "heavily_occluded"]}}

2) Annotation guidelines that prevent ambiguity

Annotation guidelines are the operational definition of “ground truth.” Two annotators can look at the same image and disagree for valid reasons unless you specify how to handle edge cases. Your goal is not perfection; it is consistency that matches your intended model behavior.

Write class definitions like testable rules

For each class, include: a short definition, inclusion criteria, exclusion criteria, and visual examples (in your internal doc). Avoid definitions that rely on intent (“a person who is shopping”) and prefer observable cues (“a person holding a basket in a store aisle” is still risky). If a class is hard to define visually, consider removing it or changing the task.

Inclusion criteria: what must be true to label this class.
Exclusion criteria: common confusions and how to label them.
Minimum size/visibility: thresholds for labeling vs ignoring.

Handle edge cases explicitly

Edge cases are where label noise is born. Create a dedicated section for them and update it as you discover new patterns during annotation.

Ambiguous objects: if you cannot confidently assign a class, decide whether to use “unknown/other” or “ignore.”
Reflections and screens: decide whether objects in mirrors, windows, or displays count.
Depictions: posters, drawings, mannequins—decide per class.
Groups: for tightly packed objects, define whether to label each instance or allow a single group label (usually avoid group labels unless the task demands it).

Occlusion and truncation rules

Occlusion (blocked by another object) and truncation (cut off by image boundary) should be treated consistently, especially for detection and segmentation.

Bounding boxes: typically draw the box around the visible extent only (not the full inferred object). If you choose “amodal” boxes (full extent), document it and ensure annotators can do it reliably.
Segmentation masks: usually label only visible pixels. If you need amodal masks, expect higher cost and lower agreement.
Truncation flag: add a boolean attribute (e.g., truncated=true) when an object touches the image boundary.
Occlusion level: optionally record coarse occlusion (none/partial/heavy) to support analysis and filtering.

Define ‘ignore’ regions and when to use them

Ignore regions prevent penalizing the model for areas where labels are unreliable or out of scope. They are especially important in segmentation (void pixels) and dense scenes.

When to ignore: extreme blur, heavy occlusion, ambiguous class, tiny objects below a threshold, or areas outside the domain (e.g., reflections if excluded).
How to ignore: use a dedicated ignore label id for masks; for detection, either omit the object or include it with an ignore flag that your training/eval pipeline respects.
Be conservative: overusing ignore can hide dataset weaknesses. Track ignore frequency per class and per annotator.

Step-by-step: create guidelines that annotators can follow

Step 1: Draft a one-page schema summary (classes + formats) and a longer guideline doc (rules + examples).
Step 2: Pilot on a small set (e.g., 200–500 items) with multiple annotators.
Step 3: Review disagreements and convert them into explicit rules (add to “edge cases”).
Step 4: Update and version the guideline document (e.g., guidelines_v1.2).
Step 5: Train annotators with a short quiz or calibration task and feedback.
Step 6: Lock rules for the main annotation run; only change with a controlled process and re-audit impacted labels.

3) Quality control: measuring and improving label reliability

Quality control (QC) is how you turn “labels” into “ground truth you can trust.” QC should be continuous, not a one-time gate at the end. Combine statistical checks (agreement metrics) with targeted human review (spot checks and gold sets).

Inter-annotator agreement (IAA)

IAA quantifies how consistently different annotators label the same items. Low agreement means your task is ambiguous, your guidelines are unclear, or your annotators need calibration.

Classification: use metrics such as Cohen’s kappa (two annotators) or Fleiss’ kappa (multiple). Also track per-class confusion matrices.
Detection: compute agreement via IoU matching plus class match. For example, count a match if IoU > 0.5 and class matches; analyze missed vs extra boxes.
Segmentation: use pixel IoU/Dice between annotators, possibly restricted to non-ignore pixels.

Use IAA as a diagnostic: break down by class, scene type, and object size. Often, disagreement concentrates in a few edge cases that can be fixed with clearer rules.

Spot checks (random and targeted)

Spot checks catch systematic issues early (e.g., one annotator consistently draws loose boxes, or a tool export flips coordinates). Use two modes:

Random sampling: review a fixed percentage per batch (e.g., 2–5%).
Targeted sampling: oversample risky items: rare classes, small objects, heavy occlusion, or images from new sources.

Record outcomes as structured data (pass/fail + reason codes) so you can track trends over time.

Gold sets and calibration

A gold set is a curated subset with trusted labels (often created by senior annotators or consensus). It is used to measure annotator accuracy and drift.

Build a gold set that covers common cases and known edge cases.
Hide gold items inside normal work so annotators cannot treat them differently.
Score annotators regularly and provide feedback; retrain or restrict access if quality drops.

Audit trails and label provenance

When a model fails, you need to answer: “Which labels were used, who created them, with which guideline version, and what changed?” An audit trail makes this possible.

Store per-annotation metadata: annotator id, timestamp, tool version, guideline version, and review status.
Keep diffs when labels are edited (before/after), not just the latest snapshot.
Track dataset versions as immutable releases (e.g., dataset_v3.0) used by training runs.

Step-by-step: a practical QC workflow

Step 1: Run automated validation on every export (schema checks, bounds checks, missing fields).
Step 2: Compute batch-level stats (class counts, box sizes, ignore rates) and alert on anomalies.
Step 3: Perform spot checks (random + targeted) and log issues with reason codes.
Step 4: Measure IAA weekly on a shared subset; update guidelines if disagreement clusters.
Step 5: Maintain a gold set and use it for ongoing annotator calibration and drift detection.
Step 6: Require review/approval for guideline changes and re-audit affected data.

4) Dataset splits and leakage prevention

Train/validation/test splits are not just bookkeeping; they define what “generalization” means. Leakage happens when near-duplicates or correlated samples appear in multiple splits, inflating metrics and hiding real failure modes.

Define what must be independent

Independence should match your deployment scenario. If you will deploy on new locations, new cameras, or new people, your test set should reflect that. Common leakage sources:

Same scene: multiple photos of the same room or intersection.
Same video: adjacent frames are highly correlated.
Same person/object identity: the same individual appears in multiple splits.
Near-duplicates: resized/cropped versions, burst shots, or re-encoded copies.

Group-based splitting (recommended)

Instead of splitting by individual images, split by a group key that captures correlation. Examples: video_id, scene_id, camera_id, subject_id, location_id, or capture_session. All items with the same key must go to the same split.

# Pseudocode for group split idea (conceptual)groups = unique(group_key)shuffle(groups)assign groups to train/val/test by ratiofor each item: split = split_of(item.group_key)

Temporal splitting for video

If your deployment predicts on future time, consider splitting by time: earlier sessions for training, later sessions for testing. For video, avoid putting adjacent frames across splits; even with group splitting, ensure that clips are not overlapping.

Stratification without leakage

You often want class balance across splits, but naive stratification can break group constraints. Use group-aware stratification: assign groups while tracking class distribution at the group level. If perfect balance is impossible, prioritize leakage prevention over balance; you can address imbalance during training and data collection.

Step-by-step: leakage-resistant splitting

Step 1: Decide the grouping keys that represent correlation (at least one; often multiple).
Step 2: Create a metadata table with one row per item and columns for group keys.
Step 3: Detect near-duplicates (hashing/perceptual hashing) and assign duplicates to the same group.
Step 4: Perform group-based split; optionally group-aware stratify on class presence.
Step 5: Verify no group key appears in multiple splits; run automated checks.
Step 6: Freeze the split definition as part of the dataset release so results are comparable.

5) Class imbalance and long-tail problems

Most real datasets are long-tailed: a few classes dominate, while many classes are rare. If you ignore this, your model may achieve high overall accuracy while failing on the classes you care about. Address imbalance at the dataset level (collection and labeling) and at the sampling level (how you construct batches).

Diagnose imbalance with the right metrics

Start by measuring:

Per-class counts (instances and images containing the class).
Per-class difficulty proxies: average object size, occlusion rate, truncation rate.
Per-class performance: not just overall metrics; track per-class precision/recall or IoU.

Long-tail issues often show up as: rare classes with low recall, or confusion between visually similar rare classes.

Targeted data collection (often best)

Collecting more representative data usually beats algorithmic tricks because it improves coverage of real-world variation.

Define the gap: which classes or conditions are underrepresented (e.g., “bicycle at night,” “small objects,” “rain”).
Collect intentionally: capture sessions designed to include those cases, not random sampling.
Label efficiently: pre-filter candidate images with simple heuristics or weak signals, then annotate.

Oversampling and reweighting (use carefully)

If you cannot collect more data quickly, you can change how you sample training data.

Oversampling rare classes: include rare-class images more often in training batches. Risk: overfitting to a small set of examples.
Undersampling common classes: reduces dominance but may discard useful diversity.
Class-balanced sampling: sample by class presence rather than uniform images; for multi-label data, define rules to avoid biasing toward multi-class images.

To reduce overfitting when oversampling, ensure rare-class items are diverse (different scenes, viewpoints, lighting) and monitor performance on a leakage-free validation set.

Labeling strategy for rare classes

Rare classes are where guidelines and QC matter most, because annotators see them less often and make more mistakes.

Provide extra examples in the guideline doc for rare classes.
Route rare-class items to more experienced annotators or require review.
Use targeted gold items that focus on rare-class edge cases.

Step-by-step: choosing between targeted collection and oversampling

Step 1: Quantify the tail (counts + per-class validation performance).
Step 2: Check diversity of rare-class examples (unique scenes/subjects). If low, prioritize collection.
Step 3: If collection is slow, implement oversampling with caps (do not repeat the same few items excessively).
Step 4: Add stricter QC for rare classes (review + gold set coverage).
Step 5: Re-evaluate after each dataset release; stop oversampling once the dataset becomes naturally more balanced.

Now answer the exercise about the content:

Why is group-based splitting recommended when creating train/validation/test splits for a computer vision dataset?

You are right! Congratulations, now go to the next page

You missed! Try again.

Group-based splitting assigns all items sharing a correlation key (like video_id or scene_id) to the same split, preventing near-duplicates or correlated samples from leaking across splits and inflating performance.