Understanding Distributions: Shape, Skew, and Outliers

Capítulo 5

Estimated reading time: 10 minutes

+ Exercise

Distributions as “patterns” of values

A distribution is the overall pattern of values a quantitative variable takes and how often those values occur. Instead of focusing on single observations, you look for structure: where values concentrate, how they taper off into tails, whether there are multiple clusters, and whether any values look unusual compared with the rest.

When you describe a distribution, always anchor your description in the real-world meaning of the variable: what is measured and in what units. A distribution of “response time (ms)” tells a different story than a distribution of “number of logins per day,” even if the plots look similar.

A reliable description checklist

Use the same order every time so you don’t miss important features:

  • Variable + units: What is being measured? What does one unit mean?
  • Center: Where is the distribution “typical”?
  • Spread: How variable are values around the center?
  • Shape: Symmetric or skewed? Unimodal or bimodal? How heavy are tails?
  • Unusual features: Outliers, gaps, clusters, truncation, heaping (rounding).

This chapter focuses on shape and unusual features, and on how different plots reveal them.

Plots for distributions: what each one is good at

Dot plots (small to moderate sample sizes)

A dot plot places one dot per observation along a number line. It is excellent for seeing individual values, repeated values, small clusters, and gaps.

Continue in our app.
  • Listen to the audio with the screen off.
  • Earn a certificate upon completion.
  • Over 5000 courses for you to explore!
Or continue reading below...
Download App

Download the app

  • Best for: n up to a few dozen (sometimes a couple hundred if values don’t overlap too much).
  • Strength: You can literally see the data.
  • Weakness: Gets cluttered for large datasets.
Example (dot plot idea): quiz scores out of 10 for 20 students

If many students scored 8, you’ll see a stack of dots at 8. If nobody scored 3–4, you’ll see a gap.

Histograms (workhorse for quantitative distributions)

A histogram groups values into intervals called bins and draws a bar for each bin. The bar height represents how many observations fall in that interval (or the proportion, depending on scaling).

  • Best for: moderate to large n.
  • Strength: Reveals overall shape, clusters, and tails.
  • Weakness: Appearance depends on bin width and bin boundaries.

How to create a histogram (step-by-step)

  • Step 1: Choose a bin width (or number of bins). Start with something that yields about 10–20 bins for a typical dataset, then adjust.
  • Step 2: Set bin boundaries (e.g., 0–5, 5–10, 10–15). Use boundaries that make sense in the units.
  • Step 3: Count observations in each bin.
  • Step 4: Draw bars with bar width equal to the bin width and height equal to the count (or proportion).
  • Step 5: Re-check with a different bin width to ensure key features (like bimodality) are not artifacts of binning.

Common binning pitfall: A histogram can look unimodal with wide bins and bimodal with narrower bins. Treat bin choice as part of analysis, not a cosmetic decision.

Common histogram misreads

  • Mistake: Reading a bar height as “the value.” Fix: The x-axis shows values; the bar height shows how many values fall in that range.
  • Mistake: Thinking bars represent individual observations. Fix: Each bar aggregates many observations (unless bins are extremely narrow).
  • Mistake: Comparing two histograms with different y-axis scales without noticing. Fix: Check whether the y-axis is counts, proportions, or density.

Density plots (conceptual view of a smoothed histogram)

A density plot is a smooth curve that represents the distribution. Conceptually, it is like a histogram with very fine bins that has been smoothed. The key idea is that area under the curve corresponds to probability (or proportion of observations), not the height at a point.

  • Best for: seeing shape clearly, comparing groups (overlay curves).
  • Strength: Smooths noise; makes skew and multimodality easier to see.
  • Weakness: Smoothing can hide small clusters or create the illusion of structure if over-smoothed or under-smoothed.

Common density misread: Interpreting the peak height as “most people have that exact value.” For continuous variables, exact values are less meaningful; what matters is the area over an interval.

Box plots (compact summary + outlier flagging)

A box plot summarizes a distribution using quartiles. The box spans the middle 50% of values (from Q1 to Q3), with a line at the median. “Whiskers” extend to typical extremes, and points beyond may be flagged as outliers (often using the 1.5×IQR rule).

  • Best for: comparing distributions across categories (side-by-side box plots).
  • Strength: Compact; highlights center, spread, and potential outliers.
  • Weakness: Hides multimodality and detailed shape; different datasets can share the same box plot.

Box plot caution: A box plot does not show whether the distribution is unimodal or bimodal. Use it alongside a histogram/dot plot when shape matters.

Describing shape: what to look for

Modality: unimodal vs bimodal (and beyond)

Unimodal distributions have one clear peak (one main cluster). Bimodal distributions have two peaks, often indicating two subpopulations or two processes generating the data.

Example: Commute times in a city might be bimodal if many people either live close (10–20 minutes) or far (45–60 minutes), with fewer in between.

Practical check: If you see two peaks, ask whether there is an unrecorded grouping variable (e.g., remote vs in-office, two different machines, weekday vs weekend).

Symmetry vs skewness

A distribution is symmetric if the left and right sides mirror each other around the center. It is skewed if one tail extends farther than the other.

  • Right-skewed (positive skew): Long tail to the right (toward larger values). Common for incomes, waiting times, counts of events.
  • Left-skewed (negative skew): Long tail to the left (toward smaller values). Can occur with scores that have a ceiling (many high scores, few low ones).

How to identify skew quickly: Look at which side has the longer, thinner tail. The tail direction names the skew.

Common misread: Assuming “most data are normal.” Many real variables are skewed, truncated, or multimodal. Treat normality as a hypothesis to check, not a default assumption.

Tails: light vs heavy, and why they matter

Tails are the extremes of the distribution. Two datasets can have similar centers and spreads but different tail behavior.

  • Light tails: Extremes are rare; values drop off quickly away from the center.
  • Heavy tails: Extremes occur more often than you might expect; outliers are more common.

Why you care: Tail behavior affects risk and planning. For example, if delivery times have a heavy right tail, “rare” long delays may be frequent enough to impact customer experience.

Clusters and gaps

Clusters are regions with many observations; gaps are ranges with few or none.

  • Clusters can indicate subgroups, seasonality, or different operating regimes.
  • Gaps can indicate data collection issues, rounding/heaping, eligibility rules, or true separation between subpopulations.

Example: A gap in “age” between 18 and 21 might reflect a sampling rule (e.g., only surveying college students and older adults) rather than a real-world absence.

Outliers: spotting them and thinking clearly about them

What counts as an outlier?

An outlier is an observation that is unusually far from the rest of the data. “Unusual” depends on context, sample size, and the distribution’s tail behavior.

Box plots often flag outliers using the 1.5×IQR rule:

  • Lower fence: Q1 - 1.5 * IQR
  • Upper fence: Q3 + 1.5 * IQR

Values beyond these fences are flagged as potential outliers. This is a screening rule, not a verdict.

How to investigate an outlier (step-by-step)

  • Step 1: Verify the value (data entry error, unit mismatch, duplicated decimal, impossible value).
  • Step 2: Check units and transformations (e.g., seconds vs milliseconds; dollars vs cents).
  • Step 3: Look for a subgroup explanation (different device, region, cohort, time period).
  • Step 4: Decide how to handle it based on purpose: keep it (real extreme), correct it (error), or analyze with and without it (sensitivity check).

Common misread: “Outliers should be removed.” Sometimes outliers are the most important cases (fraud, failures, rare adverse events). Removal needs a defensible reason tied to data quality or study scope.

Reading plots with a structured approach (worked examples)

Example A: Histogram of delivery times (minutes)

Imagine a histogram where most deliveries fall between 20 and 40 minutes, with a long tail extending to 120 minutes.

  • Variable + units: Delivery time (minutes).
  • Center: Typical deliveries around 30 minutes.
  • Spread: Many deliveries within about 20–40 minutes, but some much larger.
  • Shape: Unimodal and right-skewed (long tail to higher times).
  • Unusual features: A few very long deliveries (potential outliers). Investigate weather, distance, staffing, or system outages.

Interpretation tip: In right-skewed time variables, the tail often represents occasional disruptions. The “typical” experience and the “worst-case” experience can be very different.

Example B: Dot plot of exam scores (points)

Suppose a dot plot shows two clusters: one around 55–65 and another around 85–95, with few scores in 70–80.

  • Variable + units: Exam score (points).
  • Center: There may not be a single meaningful center; two groups suggest two centers.
  • Spread: Each cluster has its own spread; overall spread is large.
  • Shape: Bimodal (two peaks).
  • Unusual features: A gap around 70–80; check whether two versions of the exam existed, or whether one group had prior training.

Common misread: Summarizing a bimodal distribution with a single “typical” value can be misleading. The key story is the presence of two groups.

Example C: Box plot comparison across groups

Imagine side-by-side box plots of “daily steps” for two teams. Team A has a higher median and a tighter box; Team B has a lower median, wider box, and several high outliers.

  • Variable + units: Daily steps (steps/day).
  • Center: Team A’s typical steps are higher (median line is higher).
  • Spread: Team B is more variable (wider box and longer whiskers).
  • Shape (limited): Box plots hint at skew if the median is off-center in the box or whiskers are uneven, but they do not show modality.
  • Unusual features: Team B has several high outliers—maybe a few very active individuals or device syncing issues.

Practical check: If decisions depend on whether Team B truly has two subgroups (sedentary vs very active), follow up with histograms for each team.

Subtle but important issues when interpreting distributions

Heaping and rounding

Some variables show spikes at “round” numbers (e.g., weights ending in 0 or 5, ages reported as multiples of 5). This heaping can create artificial peaks and gaps.

  • What it looks like: Dot plot stacks or histogram bars that jump at round values.
  • What to do: Consider measurement process; if precision is low, avoid over-interpreting fine-grained structure.

Truncation and censoring

Sometimes the data cannot go beyond a limit due to measurement or rules (e.g., survey top-coded at “$200k+”, test scores capped at 100). This can create sharp cutoffs and apparent skew.

  • What it looks like: A pile-up at a maximum/minimum value or a sudden stop in the tail.
  • What to do: Note the limit explicitly; interpret tails cautiously.

Sample size and “noisy” shape

With small samples, histograms can look jagged and misleading. Dot plots help because they show the actual observations. With large samples, small bumps may appear that are statistically real but practically irrelevant.

  • Small n: Prefer dot plots or fewer/wider bins; avoid strong claims about modality.
  • Large n: Focus on meaningful features (substantial clusters, consistent skew, important tail risk).

Quick reference: what each plot emphasizes

PlotShows individual values?Best forCommon pitfall
Dot plotYesSmall datasets, gaps, repeatsOverplotting with large n
HistogramNo (binned)Overall shape, tails, clustersBin choice changes appearance
Density plotNo (smoothed)Comparing shapes, skewConfusing height with probability at a point
Box plotNo (summary)Comparing groups, outlier screeningHides multimodality and fine structure

A practical “read any distribution” script you can reuse

When you face a new plot, write 2–5 sentences using this template:

  • Variable + units: “This plot shows the distribution of ____ measured in ____.”
  • Center: “A typical value is around ____.”
  • Spread: “Most values fall between ____ and ____ (approximately).”
  • Shape: “The distribution is unimodal/bimodal and roughly symmetric/right-skewed/left-skewed, with tails that are ____.”
  • Unusual features: “Notable features include ____ (outliers/gaps/clusters/truncation).”

Use a histogram or density plot to describe shape, and use a dot plot to verify suspicious gaps or outliers. Use box plots to compare multiple groups quickly, then drill down with histograms when shape details matter.

Now answer the exercise about the content:

When interpreting a density plot for a continuous variable, which statement best describes how to judge where observations are most common?

You are right! Congratulations, now go to the next page

You missed! Try again.

In a density plot, probability is represented by area under the curve over a range of values. The peak height alone does not mean most observations have that exact value.

Next chapter

Relationships Between Variables: Correlation, Association, and Confounding

Arrow Right Icon
Free Ebook cover Statistics Fundamentals: From Data to Decisions
42%

Statistics Fundamentals: From Data to Decisions

New course

12 pages

Download the app to earn free Certification and listen to the courses in the background, even with the screen off.