Three Common Relationship Setups
When you ask whether two variables “move together,” the right tool depends on the variable types. The goal is to describe patterns (if any), quantify them when appropriate, and avoid causal claims unless the design supports them.
| Variable types | Primary display | Typical question | Common numeric summary |
|---|---|---|---|
| Quantitative–Quantitative | Scatterplot | As X increases, what tends to happen to Y? | Covariance, correlation, regression (later) |
| Categorical–Quantitative | Grouped summaries (side-by-side boxplots, dotplots) | Do groups differ in typical value or spread? | Group means/medians, differences, effect sizes |
| Categorical–Categorical | Contingency table (and segmented bar chart) | Are categories associated? | Conditional proportions, risk ratio/odds ratio (later) |
Quantitative–Quantitative: Exploring with Scatterplots
How to build and read a scatterplot
A scatterplot places one quantitative variable on the x-axis and the other on the y-axis. Each point is one observational unit (person, day, product, etc.).
- Step 1: Choose axes intentionally. Put the variable you think may explain or precede the other on the x-axis (often called the “explanatory” variable), and the outcome on the y-axis.
- Step 2: Plot points and scan the overall pattern. Look for direction (up/down), form (linear/curved), and strength (tight/loose).
- Step 3: Check for unusual points. Identify outliers (unusual y-values), high-leverage points (extreme x-values), and clusters.
- Step 4: Consider context and measurement. Ask whether the pattern could be driven by a third variable, restricted range, or mixing different subgroups.
What to describe: direction, form, strength
- Direction: Positive (higher X tends to go with higher Y) or negative (higher X tends to go with lower Y).
- Form: Linear (points follow a roughly straight trend) or nonlinear (curved, threshold, U-shaped).
- Strength: How tightly points cluster around the form. Strong relationships show little scatter around the trend; weak relationships show wide scatter.
Spotting nonlinearity (why correlation can miss it)
Correlation summarizes linear association. A strong curved relationship can have a correlation near 0 even though X and Y are clearly related.
Common nonlinear patterns to watch for:
- Curved increase/decrease: diminishing returns (fast rise then leveling off).
- U-shape or inverted U: outcomes best at moderate values of X.
- Threshold: little change until X passes a point, then rapid change.
Practical check: If the scatterplot looks curved, do not rely on correlation alone. Describe the shape and consider transforming variables or using nonlinear modeling (later topics).
- Listen to the audio with the screen off.
- Earn a certificate upon completion.
- Over 5000 courses for you to explore!
Download the app
Outliers and influential points
Outliers can distort summaries and create misleading impressions.
- Vertical outlier: unusual Y for its X. It may inflate scatter and weaken correlation.
- High-leverage point: extreme X value. It can strongly affect correlation and any fitted line.
- Influential point: a point whose removal noticeably changes the perceived trend.
Step-by-step sensitivity check:
- Identify suspicious points visually.
- Compute correlation with all points.
- Recompute correlation after removing the point(s) temporarily.
- If results change a lot, report that the association is sensitive to those observations and investigate data quality or subgroup differences.
Covariance and Correlation: Numeric Summaries of Linear Association
Covariance: direction of joint variation
Covariance measures whether X and Y tend to be above their means together (positive) or one above while the other is below (negative). For a dataset with pairs (x_i, y_i):
cov(X, Y) = Σ (x_i - x̄)(y_i - ȳ) / (n - 1)- Sign: Positive covariance suggests a positive linear tendency; negative suggests a negative linear tendency.
- Limitation: The magnitude depends on the units of X and Y, so it is hard to interpret across different scales.
Correlation: standardized linear association
The Pearson correlation coefficient r standardizes covariance so it is unitless and always between −1 and +1:
r = cov(X, Y) / (s_X s_Y)- Direction:
r > 0positive;r < 0negative. - Strength (linear): values closer to −1 or +1 indicate a stronger linear relationship; values near 0 indicate weak linear relationship.
- Interpretation is about tendency, not certainty: Even with
r = 0.8, points can be scattered; correlation does not mean “Y is determined by X.”
Limitations you must state when using correlation
- Only linear:
rcan be near 0 for strong nonlinear relationships. - Outlier sensitivity: a single extreme point can change
rsubstantially. - Range restriction: if X varies only a little in your sample, correlation can look weak even if a relationship exists in the broader population.
- Mixing groups: combining distinct subgroups can hide or reverse patterns (a form of confounding; see below).
- Not causation: correlation alone cannot establish that changing X would change Y.
A practical “strength” guide (use cautiously)
There is no universal cutoff, but a common rough guide for linear association is:
- |r| ≈ 0.0–0.2: very weak linear association
- |r| ≈ 0.2–0.4: weak
- |r| ≈ 0.4–0.6: moderate
- |r| ≈ 0.6–0.8: strong
- |r| ≈ 0.8–1.0: very strong
Always pair r with a scatterplot and a brief note about outliers/nonlinearity.
Categorical–Quantitative: Grouped Summaries (Comparing Distributions Across Groups)
What to use
When one variable is categorical (groups) and the other is quantitative (a measurement), you explore how the quantitative values differ by group.
- Visual: side-by-side boxplots, dotplots, or violin plots (if available).
- Numeric: group means/medians, group standard deviations/IQRs, and differences between groups.
Step-by-step workflow
- Step 1: List groups and sample sizes. Small groups can look noisy; note imbalance.
- Step 2: Compute a summary per group. Often mean (for roughly symmetric data) or median (if skewed or with outliers).
- Step 3: Compare centers. Report differences (e.g., mean difference) in the original units.
- Step 4: Compare spreads and overlap. Similar centers can hide different variability; heavy overlap suggests weak separation.
- Step 5: Look for outliers within groups. A few extreme values can shift means and create misleading group differences.
Example interpretation language
Suppose you compare commute time (minutes) across work mode (car, transit, bike). A careful description might be:
- “Median commute time is highest for transit and lowest for bike; car is in between.”
- “Transit shows the widest spread, suggesting commute times vary more for transit riders.”
- “There is substantial overlap between car and transit times, so mode alone does not strongly separate commute time for individuals.”
Notice this describes differences without claiming mode causes commute time; distance to work could confound the comparison.
Categorical–Categorical: Contingency Tables and Conditional Proportions
Contingency tables: counts arranged by categories
A contingency table cross-classifies two categorical variables. The key to assessing association is to compare conditional proportions (percentages within a row or within a column), not just raw counts.
Step-by-step: reading association from a table
- Step 1: Decide the conditioning direction. Ask: “Within each level of A, what is the distribution of B?” Choose rows as A and compute row percentages (or vice versa).
- Step 2: Compute conditional proportions. For each row, divide each cell by the row total.
- Step 3: Compare conditional distributions across rows. If the conditional proportions are similar across rows, there is little association. If they differ, there is association.
- Step 4: Report differences in proportions. Use percentage-point differences (e.g., 18% vs 10% is an 8-point difference).
Mini example (structure)
Imagine smoking status (smoker/non-smoker) and cough (yes/no). You would compare:
- P(cough = yes | smoker) versus P(cough = yes | non-smoker)
If these conditional proportions differ meaningfully, smoking status and cough are associated in the sample.
Association Is Not Causation
Why a pattern can be real but not causal
An observed association can occur even when X does not cause Y. Common reasons:
- Confounding: a third variable influences both X and Y, creating or masking an association.
- Reverse causation: Y influences X (the direction is opposite of what you assumed).
- Selection bias: the way data were collected makes the sample unrepresentative in a way related to both variables.
- Coincidence (random variation): especially in small samples.
Confounding: the “third variable” problem
A confounder is a variable Z that is related to both X and Y and can produce a misleading relationship between X and Y.
Classic structure:
- Z affects (or is associated with) X
- Z affects (or is associated with) Y
- When you ignore Z, X and Y appear associated (or the association changes)
How confounding can create or reverse patterns
Example scenario (conceptual): Suppose you observe that people who carry lighters have higher rates of lung disease. Carrying a lighter does not cause lung disease; smoking is a confounder because it increases the chance of carrying a lighter and increases lung disease risk.
Simpson’s paradox (pattern reversal): An overall association can reverse when you stratify by a confounder. For example, a treatment might look worse overall but better within each severity group if severity influences both treatment choice and outcomes.
Step-by-step: checking for confounding in practice
- Step 1: Brainstorm plausible confounders. Ask: “What could influence both variables?” (age, baseline severity, income, season, location, prior behavior).
- Step 2: Stratify or adjust. Compare the relationship within levels of the confounder (e.g., separate scatterplots by group, or separate contingency tables by strata).
- Step 3: Compare within-stratum patterns to the overall pattern. If the association weakens, disappears, or reverses, confounding is likely.
- Step 4: Report cautiously. State that the observed association may be explained by the confounder and that causal interpretation is not supported without stronger design (e.g., randomization).
Guided Interpretation Templates (Avoiding Overclaiming)
Template: interpreting a scatterplot
Use this fill-in structure to describe what you see without implying causality:
- Context: “Each point represents [unit], with X = [x variable] and Y = [y variable].”
- Direction: “There is a [positive/negative/no clear] association: as X increases, Y tends to [increase/decrease/stay similar].”
- Form: “The pattern appears [roughly linear/curved/U-shaped/threshold].”
- Strength: “The relationship is [weak/moderate/strong] because points are [tightly/loosely] clustered around the pattern.”
- Unusual features: “There are [no/some] outliers/high-leverage points around [describe region], which may affect summaries.”
- Caution: “This is an association in the observed data; it does not by itself show that X causes Y. A possible confounder is [Z].”
Template: interpreting a correlation value
When reporting r, pair it with a scatterplot description:
- Report: “The Pearson correlation between X and Y is
r = [value].” - Direction: “This indicates a [positive/negative] linear association.”
- Strength (linear): “The linear association is [weak/moderate/strong] in this sample.”
- Scope: “This summarizes linear association only; it may miss nonlinear patterns.”
- Robustness: “The scatterplot shows [no/some] outliers/high-leverage points; correlation may be [stable/sensitive] to them.”
- No causality claim: “Correlation does not imply causation; the association could be influenced by [potential confounder].”
Template: interpreting grouped summaries (categorical–quantitative)
- Setup: “We compare Y across groups defined by X.”
- Centers: “Typical Y is [highest/lowest] in group [name]; the difference in [mean/median] between [A] and [B] is about [value] [units].”
- Spread/overlap: “Groups show [similar/different] variability, with [substantial/little] overlap.”
- Caution: “Group differences are descriptive; they may reflect confounding by [Z].”
Template: interpreting a contingency table (categorical–categorical)
- Conditioning: “Within each level of X, we examine the proportion in each category of Y.”
- Key comparison: “The proportion of Y = [category] is [p1] for X = [level1] versus [p2] for X = [level2] (difference: [p1 − p2] percentage points).”
- Association statement: “Because these conditional proportions differ, X and Y are associated in this sample.”
- Caution: “This does not establish causation; a confounder such as [Z] could explain the pattern.”