Data Types, Measurement, and Organizing a Dataset

Capítulo 2

Estimated reading time: 10 minutes

+ Exercise

Why data structure matters

The way your dataset is structured determines which summaries, statistical methods, and graphs are valid. If you treat categories as numbers, averages become meaningless. If you ignore time order, you can miss trends and seasonality. If one row mixes multiple observational units, you can accidentally double-count or compute incorrect rates.

Before calculating anything, answer three questions:

  • What is the observational unit? (What does one row represent?)
  • What are the variables? (What is measured about each unit?)
  • What are the data types? (What kind of values does each variable take?)

Observational units and variables

Observational unit (the “row”)

An observational unit is the entity you observe: a person, a transaction, a day, a machine, a hospital visit, a product, etc. In a well-organized dataset, each row corresponds to one observational unit.

Example: A clinic tracks appointments. If each row is an appointment, then patient age appears repeatedly across multiple rows (one per appointment). That is not wrong, but it changes what summaries mean (e.g., average age of appointments vs average age of patients).

Variables (the “columns”)

A variable is a characteristic recorded for each observational unit. Variables can be:

Continue in our app.
  • Listen to the audio with the screen off.
  • Earn a certificate upon completion.
  • Over 5000 courses for you to explore!
Or continue reading below...
Download App

Download the app

  • Measured (e.g., blood pressure)
  • Recorded (e.g., appointment date)
  • Assigned (e.g., treatment group)

Always verify that each column has a single meaning and consistent units (e.g., weight always in kg, not sometimes lb).

Common data types and what summaries/graphs make sense

Categorical vs quantitative

Categorical (qualitative) variables place units into groups or labels.

  • Nominal: no natural order (e.g., blood type, country, device model).
  • Ordinal: ordered categories (e.g., satisfaction: low/medium/high; pain scale: none/mild/moderate/severe).

Quantitative (numeric) variables represent amounts where arithmetic is meaningful (e.g., height, revenue, response time).

What to use:

Variable typeGood summariesGood graphsAvoid
Categorical (nominal)Counts, proportions, modeBar chart, stacked barMean/median of category codes
Categorical (ordinal)Counts by level, median category, percentiles (careful)Ordered bar chartAssuming equal spacing between levels unless justified
QuantitativeMean, median, SD, IQR, min/maxHistogram, box plot, density plot, scatter plotBar chart of every unique numeric value (usually)

Discrete vs continuous (within quantitative)

Discrete variables take countable values (often integers): number of purchases, defects, calls.

Continuous variables can take any value in an interval: time, temperature, weight.

Why it matters: Discrete counts often have many zeros and right-skew; continuous measures may need rounding rules and unit consistency. Histograms work for both, but bin choices differ; for small-count discrete variables, a bar chart of counts can be reasonable.

Time series vs cross-sectional vs panel

Cross-sectional data: many units measured at one point in time (e.g., survey of 1,000 people in June).

Time series data: one unit measured repeatedly over time (e.g., daily website visits).

Panel/longitudinal data: many units measured repeatedly over time (e.g., monthly sales for 200 stores).

What to use:

  • Cross-sectional: compare groups (bar charts, box plots, scatter plots).
  • Time series: line charts, moving averages, seasonal plots; summaries should respect order (e.g., week-over-week change).
  • Panel: line charts per unit (often sampled), small multiples, or summaries by time and group; be careful about repeated measures (rows are not independent in many analyses).

Mini-exercise: classify variables

For each variable, classify it as categorical/quantitative, and if applicable nominal/ordinal and discrete/continuous. Then choose one appropriate summary and one appropriate graph.

  • A) payment_method (cash, card, transfer)
  • B) number_of_logins_last_week
  • C) customer_satisfaction (1–5 stars)
  • D) delivery_time_minutes
  • E) month (Jan–Dec) in a dataset of monthly revenue

Organizing data: tidy datasets

A practical standard for analysis is tidy data. In tidy form:

  • Each row is one observational unit.
  • Each column is one variable.
  • Each cell contains one value.

Step-by-step: reshape messy data into tidy form

Messy example (wide, repeated columns): A fitness study records step counts for each participant across days as separate columns.

participant_id | steps_day1 | steps_day2 | steps_day3

Tidy (long) version: one row per participant-day.

participant_id | day | steps

Steps to tidy it:

  • 1) Identify the observational unit you need for analysis (here: participant-day).
  • 2) Create a single column for the repeated dimension (here: day).
  • 3) Create a single column for the measurement (here: steps).
  • 4) Ensure variable names describe content, not positions (use day=1,2,3 rather than encoding in column names).

Common tidy-data pitfalls

  • Multiple variables in one column: blood_pressure stored as 120/80 should be split into systolic and diastolic.
  • One variable spread across multiple columns: temp_morning, temp_evening might be better as time_of_day and temperature depending on the analysis.
  • Multiple observational units in one table: mixing customer-level fields (birthdate) with transaction-level fields (purchase amount) can be fine, but clarify the row unit (transaction) and avoid interpreting summaries as customer-level unless you aggregate.

Mini-exercise: choose the observational unit

You have a dataset with columns: customer_id, order_id, order_date, order_total, customer_region. If each row is an order, which variables are repeated across rows? If you want average spending per customer, what transformation do you need before summarizing?

Coding schemes: making categories usable and consistent

A coding scheme defines how values are represented. Good coding reduces ambiguity and prevents errors during analysis.

Principles for coding categorical variables

  • Use consistent labels: choose Female/Male or F/M, not both.
  • Avoid encoding meaning in free text: prefer a controlled set of categories.
  • Keep a separate codebook: do not rely on memory.
  • Don’t use numbers as labels unless necessary: if you code 1=Yes and 2=No, store a label mapping and never compute averages of that column.

Ordinal coding

For ordinal variables, store both:

  • Ordered labels (e.g., low, medium, high)
  • Optionally an ordered numeric score (e.g., 1,2,3) if you will use methods that require numeric input, and only if the spacing assumption is acceptable for your purpose.

Step-by-step: define a coding scheme

  • 1) List all expected categories (including “unknown/not provided”).
  • 2) Decide the stored representation (text labels or numeric codes).
  • 3) Write the mapping in a data dictionary (see below).
  • 4) Validate incoming data against the allowed set (flag new/unexpected categories).

Missing values: recognize, document, and avoid deletion bias

Missing values occur when a variable is not recorded for an observational unit. Handling missingness is not just a technical step; it can change results if missingness is related to the outcome.

Recognize missingness correctly

Missing values can be represented in many ways:

  • Truly empty cells / system missing
  • Placeholders: NA, N/A, NULL, .
  • Sentinel values: -999, 9999, 0 used to mean “unknown”
  • Text like unknown, not sure

Rule: convert placeholders and sentinel values to a single missing representation in your tool, but keep the original meaning documented (e.g., “not asked” vs “refused”).

Document why values are missing

When possible, distinguish types of missingness with an additional variable:

  • income (numeric)
  • income_missing_reason (categorical: not asked / refused / data error)

This prevents mixing different situations and supports better decisions later.

Avoid accidental deletion bias

A common mistake is to drop all rows with any missing value (often called “complete-case analysis”) without checking the pattern. This can bias results if the remaining rows are not representative.

Example: If older participants are more likely to skip an online follow-up survey, deleting missing follow-up outcomes can make the sample look younger and healthier than reality.

Step-by-step: a safe missing-data workflow

  • 1) Profile missingness: compute missing counts and percentages per variable.
  • 2) Check patterns: is missingness concentrated in certain groups (e.g., region, age band, time period)?
  • 3) Decide per variable: drop, impute, or keep as “unknown” category (for categorical variables).
  • 4) Record decisions: note rules used (e.g., “treat -999 as missing; keep unknown as its own category for education level”).

Mini-exercise: missingness decisions

You are analyzing delivery performance. Variable delivery_time_minutes is missing for 12% of orders, mostly for shipping_method = pickup. What could be the reason? Should you drop these rows, set them to 0, or treat them differently? Write one sentence describing a defensible rule.

Creating a simple data dictionary (codebook)

A data dictionary is a structured description of each variable: what it means, how it is coded, and what values are allowed. It makes analysis reproducible and prevents misinterpretation.

Recommended fields

  • Variable name (exact column name)
  • Description (plain-language meaning)
  • Type (categorical nominal/ordinal, quantitative discrete/continuous, date/time)
  • Unit (minutes, dollars, kg, etc.)
  • Allowed values / range (e.g., 0–120)
  • Missing value codes (how missing is represented)
  • Notes (collection quirks, transformations)

Example data dictionary (excerpt)

VariableDescriptionTypeUnit/LevelsAllowedMissing
order_idUnique order identifierCategorical (nominal)Unique, non-emptyNot allowed
order_dateDate order was placedDateYYYY-MM-DDValid calendar dateAllowed (rare)
shipping_methodHow the order is fulfilledCategorical (nominal)delivery/pickupSet of known labelsunknown if not recorded
delivery_time_minutesMinutes from dispatch to deliveryQuantitative (continuous)minutes>= 0Missing for pickup

Checking data quality before summarizing

Data quality checks catch issues that can silently distort summaries and graphs.

Range checks

Verify numeric variables fall within plausible limits.

  • Human age: typically 0–120
  • Percentage: 0–100
  • Delivery time: >= 0

Tip: Use both hard limits (impossible values) and soft limits (unlikely values that need review).

Duplicates and uniqueness

Decide which columns should uniquely identify a row.

  • If the observational unit is an order, order_id should be unique.
  • If the unit is a store-day, the pair (store_id, date) should be unique.

Duplicates can come from repeated exports, merges, or data-entry errors.

Impossible combinations and consistency checks

  • Date logic: ship_date should not be before order_date.
  • Conditional missingness: if shipping_method = pickup, then delivery_time_minutes should be missing (or 0 only if defined that way).
  • Unit consistency: if some weights are in grams and others in kilograms, values may cluster around 0.2 and 200.

Step-by-step: a basic quality checklist

  • 1) Confirm the observational unit and unique identifier(s).
  • 2) Check column types (numbers stored as text, dates parsed correctly).
  • 3) Run missingness counts per variable and per key groups.
  • 4) Run range checks for quantitative variables.
  • 5) Check category levels for unexpected labels (e.g., Delivary typo).
  • 6) Check duplicates on the intended key.
  • 7) Check logical constraints across columns (date order, nonnegative times).

Mini-exercise: pick checks and summaries

Dataset: one row per employee. Variables: employee_id, department, hire_date, salary_usd, performance_rating (poor/fair/good/excellent).

  • A) Which variable(s) should be unique? What duplicate check would you run?
  • B) Suggest one range check for salary_usd and one consistency check involving hire_date.
  • C) Choose an appropriate summary and graph for department.
  • D) Choose an appropriate summary and graph for salary_usd.
  • E) For performance_rating, would you treat it as nominal or ordinal? What graph fits best?

Choosing summaries and graphs based on structure

Once types and structure are clear, you can select summaries that match the question and the data.

Quick selection guide

  • One categorical variable: frequency table, bar chart.
  • One quantitative variable: mean/median, SD/IQR, histogram or box plot.
  • Categorical + quantitative: group-wise summaries (mean/median per group), side-by-side box plots.
  • Two quantitative variables: scatter plot, correlation (with caution about outliers).
  • Time series: line plot over time, changes and rolling summaries.

Mini-exercise: match the graph to the scenario

  • A) Compare delivery_time_minutes across shipping_method.
  • B) Show monthly revenue over two years.
  • C) Show the distribution of number_of_support_tickets per customer last month.
  • D) Show relationship between ad_spend_usd and sales_usd across campaigns.

Now answer the exercise about the content:

A fitness study records step counts as columns steps_day1, steps_day2, and steps_day3 for each participant. Which change best converts this dataset into tidy form for analysis?

You are right! Congratulations, now go to the next page

You missed! Try again.

Tidy data uses one row per observational unit and one value per cell. Here the needed unit is participant-day, so the repeated day columns should be converted into a single day variable and a single steps measurement column.

Next chapter

Describing Categorical Data with Counts, Proportions, and Rates

Arrow Right Icon
Free Ebook cover Statistics Fundamentals: From Data to Decisions
17%

Statistics Fundamentals: From Data to Decisions

New course

12 pages

Download the app to earn free Certification and listen to the courses in the background, even with the screen off.