Why data structure matters
The way your dataset is structured determines which summaries, statistical methods, and graphs are valid. If you treat categories as numbers, averages become meaningless. If you ignore time order, you can miss trends and seasonality. If one row mixes multiple observational units, you can accidentally double-count or compute incorrect rates.
Before calculating anything, answer three questions:
- What is the observational unit? (What does one row represent?)
- What are the variables? (What is measured about each unit?)
- What are the data types? (What kind of values does each variable take?)
Observational units and variables
Observational unit (the “row”)
An observational unit is the entity you observe: a person, a transaction, a day, a machine, a hospital visit, a product, etc. In a well-organized dataset, each row corresponds to one observational unit.
Example: A clinic tracks appointments. If each row is an appointment, then patient age appears repeatedly across multiple rows (one per appointment). That is not wrong, but it changes what summaries mean (e.g., average age of appointments vs average age of patients).
Variables (the “columns”)
A variable is a characteristic recorded for each observational unit. Variables can be:
- Listen to the audio with the screen off.
- Earn a certificate upon completion.
- Over 5000 courses for you to explore!
Download the app
- Measured (e.g., blood pressure)
- Recorded (e.g., appointment date)
- Assigned (e.g., treatment group)
Always verify that each column has a single meaning and consistent units (e.g., weight always in kg, not sometimes lb).
Common data types and what summaries/graphs make sense
Categorical vs quantitative
Categorical (qualitative) variables place units into groups or labels.
- Nominal: no natural order (e.g., blood type, country, device model).
- Ordinal: ordered categories (e.g., satisfaction: low/medium/high; pain scale: none/mild/moderate/severe).
Quantitative (numeric) variables represent amounts where arithmetic is meaningful (e.g., height, revenue, response time).
What to use:
| Variable type | Good summaries | Good graphs | Avoid |
|---|---|---|---|
| Categorical (nominal) | Counts, proportions, mode | Bar chart, stacked bar | Mean/median of category codes |
| Categorical (ordinal) | Counts by level, median category, percentiles (careful) | Ordered bar chart | Assuming equal spacing between levels unless justified |
| Quantitative | Mean, median, SD, IQR, min/max | Histogram, box plot, density plot, scatter plot | Bar chart of every unique numeric value (usually) |
Discrete vs continuous (within quantitative)
Discrete variables take countable values (often integers): number of purchases, defects, calls.
Continuous variables can take any value in an interval: time, temperature, weight.
Why it matters: Discrete counts often have many zeros and right-skew; continuous measures may need rounding rules and unit consistency. Histograms work for both, but bin choices differ; for small-count discrete variables, a bar chart of counts can be reasonable.
Time series vs cross-sectional vs panel
Cross-sectional data: many units measured at one point in time (e.g., survey of 1,000 people in June).
Time series data: one unit measured repeatedly over time (e.g., daily website visits).
Panel/longitudinal data: many units measured repeatedly over time (e.g., monthly sales for 200 stores).
What to use:
- Cross-sectional: compare groups (bar charts, box plots, scatter plots).
- Time series: line charts, moving averages, seasonal plots; summaries should respect order (e.g., week-over-week change).
- Panel: line charts per unit (often sampled), small multiples, or summaries by time and group; be careful about repeated measures (rows are not independent in many analyses).
Mini-exercise: classify variables
For each variable, classify it as categorical/quantitative, and if applicable nominal/ordinal and discrete/continuous. Then choose one appropriate summary and one appropriate graph.
- A)
payment_method(cash, card, transfer) - B)
number_of_logins_last_week - C)
customer_satisfaction(1–5 stars) - D)
delivery_time_minutes - E)
month(Jan–Dec) in a dataset of monthly revenue
Organizing data: tidy datasets
A practical standard for analysis is tidy data. In tidy form:
- Each row is one observational unit.
- Each column is one variable.
- Each cell contains one value.
Step-by-step: reshape messy data into tidy form
Messy example (wide, repeated columns): A fitness study records step counts for each participant across days as separate columns.
participant_id | steps_day1 | steps_day2 | steps_day3Tidy (long) version: one row per participant-day.
participant_id | day | stepsSteps to tidy it:
- 1) Identify the observational unit you need for analysis (here: participant-day).
- 2) Create a single column for the repeated dimension (here:
day). - 3) Create a single column for the measurement (here:
steps). - 4) Ensure variable names describe content, not positions (use
day=1,2,3 rather than encoding in column names).
Common tidy-data pitfalls
- Multiple variables in one column:
blood_pressurestored as120/80should be split intosystolicanddiastolic. - One variable spread across multiple columns:
temp_morning,temp_eveningmight be better astime_of_dayandtemperaturedepending on the analysis. - Multiple observational units in one table: mixing customer-level fields (birthdate) with transaction-level fields (purchase amount) can be fine, but clarify the row unit (transaction) and avoid interpreting summaries as customer-level unless you aggregate.
Mini-exercise: choose the observational unit
You have a dataset with columns: customer_id, order_id, order_date, order_total, customer_region. If each row is an order, which variables are repeated across rows? If you want average spending per customer, what transformation do you need before summarizing?
Coding schemes: making categories usable and consistent
A coding scheme defines how values are represented. Good coding reduces ambiguity and prevents errors during analysis.
Principles for coding categorical variables
- Use consistent labels: choose
Female/MaleorF/M, not both. - Avoid encoding meaning in free text: prefer a controlled set of categories.
- Keep a separate codebook: do not rely on memory.
- Don’t use numbers as labels unless necessary: if you code
1=Yes and2=No, store a label mapping and never compute averages of that column.
Ordinal coding
For ordinal variables, store both:
- Ordered labels (e.g.,
low,medium,high) - Optionally an ordered numeric score (e.g., 1,2,3) if you will use methods that require numeric input, and only if the spacing assumption is acceptable for your purpose.
Step-by-step: define a coding scheme
- 1) List all expected categories (including “unknown/not provided”).
- 2) Decide the stored representation (text labels or numeric codes).
- 3) Write the mapping in a data dictionary (see below).
- 4) Validate incoming data against the allowed set (flag new/unexpected categories).
Missing values: recognize, document, and avoid deletion bias
Missing values occur when a variable is not recorded for an observational unit. Handling missingness is not just a technical step; it can change results if missingness is related to the outcome.
Recognize missingness correctly
Missing values can be represented in many ways:
- Truly empty cells / system missing
- Placeholders:
NA,N/A,NULL,. - Sentinel values:
-999,9999,0used to mean “unknown” - Text like
unknown,not sure
Rule: convert placeholders and sentinel values to a single missing representation in your tool, but keep the original meaning documented (e.g., “not asked” vs “refused”).
Document why values are missing
When possible, distinguish types of missingness with an additional variable:
income(numeric)income_missing_reason(categorical: not asked / refused / data error)
This prevents mixing different situations and supports better decisions later.
Avoid accidental deletion bias
A common mistake is to drop all rows with any missing value (often called “complete-case analysis”) without checking the pattern. This can bias results if the remaining rows are not representative.
Example: If older participants are more likely to skip an online follow-up survey, deleting missing follow-up outcomes can make the sample look younger and healthier than reality.
Step-by-step: a safe missing-data workflow
- 1) Profile missingness: compute missing counts and percentages per variable.
- 2) Check patterns: is missingness concentrated in certain groups (e.g., region, age band, time period)?
- 3) Decide per variable: drop, impute, or keep as “unknown” category (for categorical variables).
- 4) Record decisions: note rules used (e.g., “treat -999 as missing; keep unknown as its own category for education level”).
Mini-exercise: missingness decisions
You are analyzing delivery performance. Variable delivery_time_minutes is missing for 12% of orders, mostly for shipping_method = pickup. What could be the reason? Should you drop these rows, set them to 0, or treat them differently? Write one sentence describing a defensible rule.
Creating a simple data dictionary (codebook)
A data dictionary is a structured description of each variable: what it means, how it is coded, and what values are allowed. It makes analysis reproducible and prevents misinterpretation.
Recommended fields
- Variable name (exact column name)
- Description (plain-language meaning)
- Type (categorical nominal/ordinal, quantitative discrete/continuous, date/time)
- Unit (minutes, dollars, kg, etc.)
- Allowed values / range (e.g., 0–120)
- Missing value codes (how missing is represented)
- Notes (collection quirks, transformations)
Example data dictionary (excerpt)
| Variable | Description | Type | Unit/Levels | Allowed | Missing |
|---|---|---|---|---|---|
order_id | Unique order identifier | Categorical (nominal) | — | Unique, non-empty | Not allowed |
order_date | Date order was placed | Date | YYYY-MM-DD | Valid calendar date | Allowed (rare) |
shipping_method | How the order is fulfilled | Categorical (nominal) | delivery/pickup | Set of known labels | unknown if not recorded |
delivery_time_minutes | Minutes from dispatch to delivery | Quantitative (continuous) | minutes | >= 0 | Missing for pickup |
Checking data quality before summarizing
Data quality checks catch issues that can silently distort summaries and graphs.
Range checks
Verify numeric variables fall within plausible limits.
- Human age: typically 0–120
- Percentage: 0–100
- Delivery time: >= 0
Tip: Use both hard limits (impossible values) and soft limits (unlikely values that need review).
Duplicates and uniqueness
Decide which columns should uniquely identify a row.
- If the observational unit is an order,
order_idshould be unique. - If the unit is a store-day, the pair
(store_id, date)should be unique.
Duplicates can come from repeated exports, merges, or data-entry errors.
Impossible combinations and consistency checks
- Date logic:
ship_dateshould not be beforeorder_date. - Conditional missingness: if
shipping_method = pickup, thendelivery_time_minutesshould be missing (or 0 only if defined that way). - Unit consistency: if some weights are in grams and others in kilograms, values may cluster around 0.2 and 200.
Step-by-step: a basic quality checklist
- 1) Confirm the observational unit and unique identifier(s).
- 2) Check column types (numbers stored as text, dates parsed correctly).
- 3) Run missingness counts per variable and per key groups.
- 4) Run range checks for quantitative variables.
- 5) Check category levels for unexpected labels (e.g.,
Delivarytypo). - 6) Check duplicates on the intended key.
- 7) Check logical constraints across columns (date order, nonnegative times).
Mini-exercise: pick checks and summaries
Dataset: one row per employee. Variables: employee_id, department, hire_date, salary_usd, performance_rating (poor/fair/good/excellent).
- A) Which variable(s) should be unique? What duplicate check would you run?
- B) Suggest one range check for
salary_usdand one consistency check involvinghire_date. - C) Choose an appropriate summary and graph for
department. - D) Choose an appropriate summary and graph for
salary_usd. - E) For
performance_rating, would you treat it as nominal or ordinal? What graph fits best?
Choosing summaries and graphs based on structure
Once types and structure are clear, you can select summaries that match the question and the data.
Quick selection guide
- One categorical variable: frequency table, bar chart.
- One quantitative variable: mean/median, SD/IQR, histogram or box plot.
- Categorical + quantitative: group-wise summaries (mean/median per group), side-by-side box plots.
- Two quantitative variables: scatter plot, correlation (with caution about outliers).
- Time series: line plot over time, changes and rolling summaries.
Mini-exercise: match the graph to the scenario
- A) Compare
delivery_time_minutesacrossshipping_method. - B) Show monthly revenue over two years.
- C) Show the distribution of
number_of_support_ticketsper customer last month. - D) Show relationship between
ad_spend_usdandsales_usdacross campaigns.