Describing Categorical Data with Counts, Proportions, and Rates

Capítulo 3

Estimated reading time: 7 minutes

+ Exercise

Counts: the starting point for categorical summaries

Categorical data describe membership in groups (e.g., Yes/No, Brand A/B/C, Low/Medium/High). The most basic summary is a count: how many observations fall in each category.

One-way frequency table (counts)

A one-way frequency table lists categories for a single variable and their counts. Example: a clinic records the primary reason for visit for 40 patients.

Reason for visitCount
Checkup18
Flu-like symptoms12
Injury6
Other4
Total40

Counts are easy to interpret, but they are not always comparable across datasets of different sizes. That is why we often convert counts into proportions or percentages.

Proportions, relative frequency, and percentages

A proportion (also called relative frequency) is a count divided by a total. A percentage is a proportion multiplied by 100.

Denominator discipline: every proportion or percentage must clearly state what it is out of. The denominator determines the meaning.

Continue in our app.
  • Listen to the audio with the screen off.
  • Earn a certificate upon completion.
  • Over 5000 courses for you to explore!
Or continue reading below...
Download App

Download the app

Step-by-step: from counts to relative frequency and percent

Using the clinic table (total = 40):

  • Relative frequency for Checkup = 18 / 40 = 0.45
  • Percent for Checkup = 0.45 × 100 = 45%
Reason for visitCountRelative frequencyPercent
Checkup180.4545%
Flu-like symptoms120.3030%
Injury60.1515%
Other40.1010%
Total401.00100%

Common denominator mistakes (and how to prevent them)

  • Mistake: reporting “30% had flu-like symptoms” without saying out of what. Fix: “30% of the 40 patients had flu-like symptoms.”
  • Mistake: mixing denominators in the same sentence (e.g., one percent out of all patients, another out of a subgroup). Fix: keep denominators consistent or explicitly label each one.
  • Mistake: using percentages when the total is tiny (e.g., 1 out of 3 = 33.3%). Fix: show both count and percent: “1 of 3 (33%).”

Rates: adding a time, population, or exposure denominator

A rate is a ratio where the denominator represents an amount of opportunity for the event to occur—often time, population size, or exposure. Rates are essential when groups differ in size or observation time.

Examples of rates

  • Per population: cases / population (often scaled per 1,000 or 100,000)
  • Per time: events / month, calls / hour
  • Per exposure: defects / 1,000 units, accidents / 1,000 miles

Step-by-step: computing and scaling a rate

A city reports 84 bicycle thefts in a year, population 120,000.

  • Unscaled rate: 84 / 120000 = 0.0007 thefts per person-year
  • Per 100,000 people: 0.0007 × 100000 = 70 thefts per 100,000 people per year

Denominator discipline for rates: always include the unit and the scaling factor (e.g., “per 100,000 per year”). Without it, the number is ambiguous.

Two-way tables: describing relationships between two categorical variables

A two-way table (contingency table) cross-classifies observations by two categorical variables. It allows you to compute:

  • Joint counts/proportions (a specific combination of categories)
  • Marginal counts/proportions (totals by row or column)
  • Conditional proportions (percentages within a row or within a column)

Example: product choice by customer segment

A store surveys 200 customers: segment (New vs Returning) and purchase (Basic vs Premium).

BasicPremiumTotal
New7030100
Returning4060100
Total11090200

Joint and marginal proportions

  • Joint proportion of New & Premium: 30 / 200 = 0.15 = 15% (out of all customers)
  • Marginal proportion Premium overall: 90 / 200 = 45% (out of all customers)
  • Marginal proportion Returning overall: 100 / 200 = 50% (out of all customers)

Conditional proportions: the key to comparing groups

Conditional proportions answer questions like “Among New customers, what percent buy Premium?” Here the denominator is the subgroup total.

Compute Premium rate within each segment:

  • Among New: 30 / 100 = 30%
  • Among Returning: 60 / 100 = 60%

Notice how the denominator changed from 200 (overall) to 100 (within segment). This change is not a technicality—it changes the question being answered.

Row vs column percentages (choose based on the question)

In a two-way table, you can compute percentages in two main ways:

  • Row percentages: divide each cell by its row total (answers “within this row category, how are outcomes distributed?”)
  • Column percentages: divide each cell by its column total (answers “within this column category, where did these cases come from?”)

Using the table above:

  • Row % (Premium within segment): New = 30%, Returning = 60%.
  • Column % (segment composition among Premium buyers): among Premium buyers, New = 30/90=33.3%, Returning = 60/90=66.7%.

Both are correct, but they answer different questions because they use different denominators.

Comparing groups with risk (probability), risk difference, and relative risk

When the outcome is binary (e.g., purchase Premium: Yes/No; churn: Yes/No; side effect: Yes/No), a conditional proportion is often called a risk (an intuitive probability within a group).

Risk in each group

From the customer example, define the “event” as buying Premium.

  • Risk (New) = 30/100 = 0.30
  • Risk (Returning) = 60/100 = 0.60

Risk difference (absolute comparison)

Risk difference (RD) is the subtraction of risks:

RD = Risk(Returning) − Risk(New) = 0.60 − 0.30 = 0.30

Interpretation: Returning customers have a Premium purchase rate that is 30 percentage points higher than New customers. Risk difference is an absolute gap.

Relative risk (multiplicative comparison)

Relative risk (RR) is the ratio of risks:

RR = Risk(Returning) / Risk(New) = 0.60 / 0.30 = 2.0

Interpretation: Returning customers are buying Premium at about 2 times the rate of New customers. Relative risk is a multiplicative comparison.

Why RD and RR can tell different stories

Absolute and relative comparisons emphasize different aspects:

  • If risks are small (e.g., 1% vs 2%), the RR may look large (2×) while the RD is only 1 percentage point.
  • If risks are moderate (e.g., 30% vs 60%), both RD (30 points) and RR (2×) may feel substantial.

Good reporting often includes both, with clear denominators: “60% (60/100) vs 30% (30/100); RD = 30 points; RR = 2.0.”

How changing denominators changes interpretation

Many misunderstandings come from switching the denominator without noticing. Practice identifying the denominator in each statement.

Example: the same count, different denominators

Suppose 30 customers bought Premium.

  • Out of all customers (200): 30/200 = 15% bought Premium and were New (joint proportion).
  • Out of New customers (100): 30/100 = 30% of New customers bought Premium (conditional risk).
  • Out of Premium buyers (90): 30/90 = 33.3% of Premium buyers were New (a different conditional proportion).

All three use the same numerator (30) but answer three different questions. Always label the denominator in words.

Visualizing categorical data: bar charts, pie charts, and common pitfalls

Bar charts: usually the best default

Bar charts compare category sizes using lengths, which people judge accurately. They work well for counts or percentages.

  • Use a simple bar chart for one-way tables.
  • Use grouped (side-by-side) bars to compare conditional percentages across groups.
  • Use 100% stacked bars to compare composition (conditional distribution) when you want each group to sum to 100%.

Pie charts: limited use cases

Pie charts show parts of a whole, but comparing angles is harder than comparing bar lengths. If you use a pie chart:

  • Limit to a small number of categories (often 5 or fewer).
  • Ensure it truly represents a single whole with a clear denominator (e.g., “out of all 40 patients”).
  • Avoid using multiple pie charts to compare groups; use grouped or 100% stacked bars instead.

When stacked bars mislead

Stacked bars can be confusing because only the first segment has a common baseline. Problems arise when:

  • You use stacked counts to compare groups of different sizes: the taller bar might reflect a larger group, not a higher rate.
  • You want to compare a middle segment across groups: without a shared baseline, differences are hard to see.

Fixes:

  • If the goal is comparing rates, use 100% stacked bars (each bar sums to 100%) or grouped bars of conditional percentages.
  • If the goal is comparing totals, use counts but clearly label group sizes and consider separate panels (small multiples).

Annotating charts to prevent misinterpretation

Clear labels enforce denominator discipline and reduce confusion.

  • Title should state the denominator: “Premium purchases as a percent of customers within each segment.”
  • Axis label should include units: “Percent of customers (within segment)” vs “Number of customers.”
  • Show group sizes: annotate “New (n=100)” and “Returning (n=100)” so viewers know denominators.
  • Label bars directly when possible: e.g., “30%” and “60%” on the bars.
  • For rates, include the scale: “per 100,000 per year” or “per 1,000 units.”
  • Avoid ambiguous percent signs: write “% of Premium buyers” or “% within segment” rather than just “%.”

Worked example: from one-way table to two-way comparisons

Scenario

A company tracks whether a customer renewed a subscription (Yes/No) and whether they received a reminder email (Yes/No). Data from 500 customers:

Renewed: YesRenewed: NoTotal
Reminder: Yes180120300
Reminder: No90110200
Total270230500

Step 1: one-way summaries (marginals)

  • Overall renewal rate: 270/500 = 54%
  • Reminder coverage: 300/500 = 60% received a reminder

Step 2: conditional renewal rates (the key comparison)

  • Renewal risk with reminder: 180/300 = 60%
  • Renewal risk without reminder: 90/200 = 45%

These are within-group percentages, so the denominators are 300 and 200, not 500.

Step 3: risk difference and relative risk

  • Risk difference: 0.60 − 0.45 = 0.15 → 15 percentage points higher renewal with reminders
  • Relative risk: 0.60 / 0.45 = 1.33 → renewal is about 1.33× as high with reminders

Step 4: choose a chart that matches the question

  • If the question is “Do reminders change renewal rates?” use a grouped bar chart of 60% vs 45% (label “% renewed within reminder group”).
  • If the question is “How many renewals came from each reminder group?” use counts (180 vs 90) but annotate group sizes (n=300, n=200) to avoid implying a rate comparison.

Now answer the exercise about the content:

In a two-way table comparing Premium purchases between New and Returning customers, which calculation correctly gives the conditional percentage of Premium purchases among New customers?

You are right! Congratulations, now go to the next page

You missed! Try again.

Conditional percentages use the subgroup total as the denominator. “Among New customers” means divide New-and-Premium (30) by all New customers (100), giving 30%.

Next chapter

Describing Quantitative Data with Center and Spread

Arrow Right Icon
Free Ebook cover Statistics Fundamentals: From Data to Decisions
25%

Statistics Fundamentals: From Data to Decisions

New course

12 pages

Download the app to earn free Certification and listen to the courses in the background, even with the screen off.