Clear Data Visualization in R with ggplot2

Capítulo 6

Estimated reading time: 10 minutes

+ Exercise

The grammar of graphics in ggplot2

ggplot2 builds plots by combining layers. You start with a dataset, map variables to visual properties (aesthetics), choose a geometric representation (geom), and then refine scales, labels, and themes. This approach makes plots consistent and easy to extend.

Core structure: data + aes() + geoms

A typical ggplot has three parts: (1) a data frame, (2) an aesthetic mapping with aes(), and (3) one or more geoms. You can add layers with +.

library(ggplot2)
# Template you will reuse constantly
ggplot(data = df, aes(x = x_var, y = y_var)) + geom_point()

Aesthetics (aes): what you encode visually

Common aesthetics include x, y, color, fill, size, shape, alpha, and group. Use aesthetics to encode meaning, not decoration.

  • color is usually for points/lines; fill is for bars/areas/boxplots.
  • group controls how lines connect points and how summaries are computed per group.
  • Map aesthetics inside aes() when they depend on data; set aesthetics outside aes() when you want a constant.
# Mapped (varies by data): color depends on species
ggplot(iris, aes(Sepal.Length, Sepal.Width, color = Species)) + geom_point()
# Set (constant): all points are semi-transparent
ggplot(iris, aes(Sepal.Length, Sepal.Width, color = Species)) + geom_point(alpha = 0.6)

Layers: adding geoms and statistical summaries

Layers let you combine raw data with summaries. Many geoms have a default statistical transformation (for example, histograms bin data). You can also add explicit model or summary layers.

# Scatter plot with a smooth trend line
ggplot(mtcars, aes(wt, mpg)) + geom_point() + geom_smooth(se = FALSE)

Common plot types for analysis

Histogram: distribution of a numeric variable

Use a histogram to understand shape, spread, skewness, and potential outliers. The key choice is the bin width (or number of bins). Too many bins adds noise; too few hides structure.

Continue in our app.
  • Listen to the audio with the screen off.
  • Earn a certificate upon completion.
  • Over 5000 courses for you to explore!
Or continue reading below...
Download App

Download the app

ggplot(mtcars, aes(x = mpg)) + geom_histogram(bins = 12, color = "white")

To compare distributions across categories, consider faceting or a density plot. If you overlay histograms, use transparency and consistent binning.

ggplot(mtcars, aes(x = mpg, fill = factor(cyl))) + geom_histogram(bins = 12, alpha = 0.6, position = "identity")

Boxplot: compare distributions across categories

Boxplots summarize median, quartiles, and potential outliers, making them effective for comparing groups. They work best when categories are not too numerous.

ggplot(mtcars, aes(x = factor(cyl), y = mpg)) + geom_boxplot()

To show individual observations along with the boxplot, add jittered points. This helps when sample sizes are small or when you want to see clustering.

ggplot(mtcars, aes(x = factor(cyl), y = mpg)) + geom_boxplot(outlier.shape = NA) + geom_jitter(width = 0.15, alpha = 0.6)

Scatter plot: relationship between two numeric variables

Scatter plots are the default for exploring relationships, clusters, and outliers. Add color or shape for categories, and consider transparency when points overlap.

ggplot(mtcars, aes(x = wt, y = mpg)) + geom_point()
# Encode a category and reduce overplotting
ggplot(mtcars, aes(wt, mpg, color = factor(cyl))) + geom_point(alpha = 0.7, size = 2)

If you have groups that should form separate lines or trends, specify group (often implied by color, but not always).

# Example: separate smooth per group
ggplot(mtcars, aes(wt, mpg, color = factor(cyl))) + geom_point(alpha = 0.7) + geom_smooth(se = FALSE)

Line chart: change over an ordered variable

Line charts are best for time or any ordered x-axis. The most common issue is incorrect grouping: if you have multiple series, you must map group (and usually color) to the series identifier.

# Example dataset for line charts
economics_small <- economics[economics$date >= as.Date("2000-01-01"), ]
ggplot(economics_small, aes(x = date, y = unemploy)) + geom_line()

For multiple series, your data should be in a long format with a column indicating the series. Then map that column to color and group.

# Minimal example of long data for multiple series
df_long <- data.frame( date = rep(as.Date(c("2020-01-01","2020-02-01","2020-03-01")), 2), series = rep(c("A","B"), each = 3), value = c(10, 12, 11, 7, 9, 10) )
ggplot(df_long, aes(date, value, color = series, group = series)) + geom_line(linewidth = 1)

Encoding categories and groups effectively

Choose the right channel: color, shape, linetype, and facets

  • Use position (separate panels or separate x positions) when exact comparisons matter.
  • Use color for a small number of categories; avoid relying on color alone when categories are many.
  • Use shape for points when printing in grayscale, but keep categories limited (shapes are hard to distinguish beyond 5–6).
  • Use linetype for multiple lines when color is not available.
  • Use facets when overlays become cluttered.
# Color + shape for a small number of groups
ggplot(iris, aes(Sepal.Length, Petal.Length, color = Species, shape = Species)) + geom_point(size = 2, alpha = 0.8)

Ordering categories for readability

When categories have a natural order (or you want to rank them), reorder the x-axis so the plot reads clearly. A common approach is to reorder by a summary statistic such as the median.

# Reorder cylinders by median mpg (base R approach without extra packages)
mtcars$cyl_f <- factor(mtcars$cyl)
med_by_cyl <- tapply(mtcars$mpg, mtcars$cyl_f, median)
mtcars$cyl_f <- factor(mtcars$cyl_f, levels = names(sort(med_by_cyl)))
ggplot(mtcars, aes(cyl_f, mpg)) + geom_boxplot()

Scales, labels, and themes for clarity

Labels: make the plot self-explanatory

Use labs() to set a clear title, subtitle (optional), axis labels, and legend titles. Prefer specific units and meaningful names.

ggplot(mtcars, aes(wt, mpg, color = factor(cyl))) + geom_point(alpha = 0.7) + labs( title = "Fuel efficiency decreases as weight increases", x = "Weight (1000 lbs)", y = "Miles per gallon (mpg)", color = "Cylinders" )

Scales: control breaks, limits, and transformations

Scales control how data values map to visual values. Use them to set axis limits, tick marks, and transformations. Avoid trimming data with xlim()/ylim() if you want statistics (like smoothing) computed on the full data; instead, use coord_cartesian() to zoom.

# Zoom without dropping data from computations
ggplot(mtcars, aes(wt, mpg)) + geom_point() + geom_smooth(se = FALSE) + coord_cartesian(xlim = c(1.5, 5.0), ylim = c(10, 35))
# Control axis tick marks
ggplot(mtcars, aes(wt, mpg)) + geom_point() + scale_x_continuous(breaks = seq(1.5, 5.5, by = 1))

Use log scales when values span orders of magnitude, but label clearly so readers understand the transformation.

# Example pattern (use when appropriate for your data)
ggplot(economics_small, aes(date, uempmed)) + geom_line() + scale_y_log10()

Color choices: readable and accessible

Prefer palettes that remain distinguishable for color-vision deficiencies and that print well. For categorical data, use discrete palettes; for continuous data, use perceptually uniform gradients.

# Discrete palette (built-in)
ggplot(mtcars, aes(wt, mpg, color = factor(cyl))) + geom_point(size = 2) + scale_color_brewer(palette = "Dark2")
# Continuous palette (built-in)
ggplot(mtcars, aes(wt, mpg, color = hp)) + geom_point(size = 2) + scale_color_viridis_c()

Use alpha to reduce overplotting, and avoid using too many saturated colors at once.

Themes: reduce clutter and standardize style

Themes control non-data ink: background, grid lines, fonts, and spacing. Start with a clean theme and then adjust a few elements consistently across figures.

base_theme <- theme_minimal(base_size = 12) + theme( panel.grid.minor = element_blank(), legend.position = "right" )
ggplot(mtcars, aes(wt, mpg, color = factor(cyl))) + geom_point(alpha = 0.8) + base_theme

Use theme() to improve readability: rotate crowded axis labels, increase title size, or move the legend.

ggplot(mtcars, aes(factor(cyl), mpg)) + geom_boxplot() + base_theme + theme( axis.text.x = element_text(angle = 0, vjust = 0.5), plot.title = element_text(face = "bold") ) + labs(title = "MPG by cylinder count", x = "Cylinders", y = "MPG")

Faceting for comparisons

Faceting creates small multiples: the same plot repeated for subsets of the data. This is often clearer than overlaying many groups in one panel.

Facet by one variable

ggplot(mtcars, aes(wt, mpg)) + geom_point(alpha = 0.7) + facet_wrap(~ cyl) + base_theme

Facet by two variables

Use facet_grid() when you want a grid defined by two variables (rows and columns). This is most useful when both variables have a small number of levels.

# Example with iris: rows by Species, columns by a derived category
iris$SepalWide <- ifelse(iris$Sepal.Width >= median(iris$Sepal.Width), "Wide", "Narrow")
ggplot(iris, aes(Sepal.Length, Petal.Length)) + geom_point(alpha = 0.7) + facet_grid(Species ~ SepalWide) + base_theme

Hands-on: build a visualization set (3 plots) and refine them

In this exercise you will create three plots that answer specific questions. Use the same theme, consistent labels, and export settings so the figures look like a coherent set.

Setup: define a shared style

library(ggplot2)
base_theme <- theme_minimal(base_size = 12) + theme( panel.grid.minor = element_blank(), plot.title = element_text(face = "bold"), legend.title = element_text(face = "bold") )

Plot 1 (distribution question): How is fuel efficiency distributed?

Question: Is mpg roughly symmetric, skewed, or multi-modal?

Step-by-step: start with a histogram, choose bins, then refine labels and styling.

p1 <- ggplot(mtcars, aes(x = mpg)) + geom_histogram(bins = 12, color = "white", fill = "steelblue") + labs( title = "Distribution of fuel efficiency", x = "Miles per gallon (mpg)", y = "Number of cars" ) + base_theme
p1

Plot 2 (group comparison question): Do cylinder groups differ in mpg?

Question: Which cylinder group tends to have higher mpg, and how variable is it?

Step-by-step: boxplot for group comparison, then add jitter to show sample size and spread.

p2 <- ggplot(mtcars, aes(x = factor(cyl), y = mpg, fill = factor(cyl))) + geom_boxplot(outlier.shape = NA, alpha = 0.7) + geom_jitter(width = 0.15, alpha = 0.6, size = 1.8) + scale_fill_brewer(palette = "Dark2") + labs( title = "MPG differs by cylinder count", x = "Cylinders", y = "Miles per gallon (mpg)", fill = "Cylinders" ) + base_theme + theme(legend.position = "none")
p2

Plot 3 (relationship question): How does weight relate to mpg, and does it differ by cylinders?

Question: Is there a negative relationship between wt and mpg, and do groups behave differently?

Step-by-step: scatter plot, encode groups by color, add a trend line, then facet if overlays feel crowded.

p3 <- ggplot(mtcars, aes(x = wt, y = mpg, color = factor(cyl))) + geom_point(alpha = 0.75, size = 2) + geom_smooth(se = FALSE, linewidth = 0.9) + scale_color_brewer(palette = "Dark2") + labs( title = "Heavier cars tend to have lower MPG", x = "Weight (1000 lbs)", y = "Miles per gallon (mpg)", color = "Cylinders" ) + base_theme
p3

If the colored overlays are hard to read, facet by cylinder count and remove the legend.

p3_facet <- ggplot(mtcars, aes(x = wt, y = mpg)) + geom_point(alpha = 0.75, size = 2, color = "steelblue") + geom_smooth(se = FALSE, linewidth = 0.9, color = "gray30") + facet_wrap(~ cyl) + labs( title = "Weight vs MPG by cylinder group", x = "Weight (1000 lbs)", y = "Miles per gallon (mpg)" ) + base_theme
p3_facet

Refinement checklist (apply to all three plots)

  • Does the title state the main takeaway rather than restating the axes?
  • Are axis labels specific and do they include units where relevant?
  • Is the legend necessary, and is its title meaningful?
  • Are colors distinguishable and not overly saturated?
  • Is there overplotting (if yes, use alpha, jitter, or faceting)?
  • Are scales appropriate (consider coord_cartesian() to zoom, or a transformation if needed)?
  • Is the theme consistent across figures (fonts, grid lines, spacing)?

Export figures with consistent sizing and resolution

Use ggsave() to export plots. Decide on a consistent width and height (in inches) and a resolution (dpi). For reports and slides, 300 dpi is a common choice for raster formats like PNG. For vector output (PDF/SVG), dpi is less relevant, but sizing still matters.

Export PNG with consistent dimensions

# Create an output folder if needed
dir.create("figures", showWarnings = FALSE)
ggsave(filename = "figures/plot1_mpg_hist.png", plot = p1, width = 7, height = 4.5, units = "in", dpi = 300)
ggsave(filename = "figures/plot2_mpg_by_cyl.png", plot = p2, width = 7, height = 4.5, units = "in", dpi = 300)
ggsave(filename = "figures/plot3_wt_vs_mpg.png", plot = p3, width = 7, height = 4.5, units = "in", dpi = 300)

Export vector formats for crisp text

ggsave(filename = "figures/plot3_wt_vs_mpg.pdf", plot = p3, width = 7, height = 4.5, units = "in")

Keep the same size across plots so they align neatly in documents. If you later change base_size in your theme, re-export all figures to maintain consistent typography.

Now answer the exercise about the content:

In ggplot2, how should you apply an aesthetic like alpha when you want every point to have the same transparency regardless of the data?

You are right! Congratulations, now go to the next page

You missed! Try again.

Map aesthetics inside aes() only when they depend on data. If you want a constant value (like the same transparency for all points), set it outside aes(), e.g., geom_point(alpha = 0.6).

Next chapter

From Analysis to Shareable Results: R Markdown Reporting Workflow

Arrow Right Icon
Free Ebook cover R Programming for Data Analysis: The Practical Starter Guide
75%

R Programming for Data Analysis: The Practical Starter Guide

New course

8 pages

Download the app to earn free Certification and listen to the courses in the background, even with the screen off.