The grammar of graphics in ggplot2
ggplot2 builds plots by combining layers. You start with a dataset, map variables to visual properties (aesthetics), choose a geometric representation (geom), and then refine scales, labels, and themes. This approach makes plots consistent and easy to extend.
Core structure: data + aes() + geoms
A typical ggplot has three parts: (1) a data frame, (2) an aesthetic mapping with aes(), and (3) one or more geoms. You can add layers with +.
library(ggplot2)# Template you will reuse constantlyggplot(data = df, aes(x = x_var, y = y_var)) + geom_point()Aesthetics (aes): what you encode visually
Common aesthetics include x, y, color, fill, size, shape, alpha, and group. Use aesthetics to encode meaning, not decoration.
coloris usually for points/lines;fillis for bars/areas/boxplots.groupcontrols how lines connect points and how summaries are computed per group.- Map aesthetics inside
aes()when they depend on data; set aesthetics outsideaes()when you want a constant.
# Mapped (varies by data): color depends on speciesggplot(iris, aes(Sepal.Length, Sepal.Width, color = Species)) + geom_point()# Set (constant): all points are semi-transparentggplot(iris, aes(Sepal.Length, Sepal.Width, color = Species)) + geom_point(alpha = 0.6)Layers: adding geoms and statistical summaries
Layers let you combine raw data with summaries. Many geoms have a default statistical transformation (for example, histograms bin data). You can also add explicit model or summary layers.
# Scatter plot with a smooth trend lineggplot(mtcars, aes(wt, mpg)) + geom_point() + geom_smooth(se = FALSE)Common plot types for analysis
Histogram: distribution of a numeric variable
Use a histogram to understand shape, spread, skewness, and potential outliers. The key choice is the bin width (or number of bins). Too many bins adds noise; too few hides structure.
- Listen to the audio with the screen off.
- Earn a certificate upon completion.
- Over 5000 courses for you to explore!
Download the app
ggplot(mtcars, aes(x = mpg)) + geom_histogram(bins = 12, color = "white")To compare distributions across categories, consider faceting or a density plot. If you overlay histograms, use transparency and consistent binning.
ggplot(mtcars, aes(x = mpg, fill = factor(cyl))) + geom_histogram(bins = 12, alpha = 0.6, position = "identity")Boxplot: compare distributions across categories
Boxplots summarize median, quartiles, and potential outliers, making them effective for comparing groups. They work best when categories are not too numerous.
ggplot(mtcars, aes(x = factor(cyl), y = mpg)) + geom_boxplot()To show individual observations along with the boxplot, add jittered points. This helps when sample sizes are small or when you want to see clustering.
ggplot(mtcars, aes(x = factor(cyl), y = mpg)) + geom_boxplot(outlier.shape = NA) + geom_jitter(width = 0.15, alpha = 0.6)Scatter plot: relationship between two numeric variables
Scatter plots are the default for exploring relationships, clusters, and outliers. Add color or shape for categories, and consider transparency when points overlap.
ggplot(mtcars, aes(x = wt, y = mpg)) + geom_point()# Encode a category and reduce overplottingggplot(mtcars, aes(wt, mpg, color = factor(cyl))) + geom_point(alpha = 0.7, size = 2)If you have groups that should form separate lines or trends, specify group (often implied by color, but not always).
# Example: separate smooth per groupggplot(mtcars, aes(wt, mpg, color = factor(cyl))) + geom_point(alpha = 0.7) + geom_smooth(se = FALSE)Line chart: change over an ordered variable
Line charts are best for time or any ordered x-axis. The most common issue is incorrect grouping: if you have multiple series, you must map group (and usually color) to the series identifier.
# Example dataset for line chartseconomics_small <- economics[economics$date >= as.Date("2000-01-01"), ]ggplot(economics_small, aes(x = date, y = unemploy)) + geom_line()For multiple series, your data should be in a long format with a column indicating the series. Then map that column to color and group.
# Minimal example of long data for multiple seriesdf_long <- data.frame( date = rep(as.Date(c("2020-01-01","2020-02-01","2020-03-01")), 2), series = rep(c("A","B"), each = 3), value = c(10, 12, 11, 7, 9, 10) )ggplot(df_long, aes(date, value, color = series, group = series)) + geom_line(linewidth = 1)Encoding categories and groups effectively
Choose the right channel: color, shape, linetype, and facets
- Use position (separate panels or separate x positions) when exact comparisons matter.
- Use color for a small number of categories; avoid relying on color alone when categories are many.
- Use shape for points when printing in grayscale, but keep categories limited (shapes are hard to distinguish beyond 5–6).
- Use linetype for multiple lines when color is not available.
- Use facets when overlays become cluttered.
# Color + shape for a small number of groupsggplot(iris, aes(Sepal.Length, Petal.Length, color = Species, shape = Species)) + geom_point(size = 2, alpha = 0.8)Ordering categories for readability
When categories have a natural order (or you want to rank them), reorder the x-axis so the plot reads clearly. A common approach is to reorder by a summary statistic such as the median.
# Reorder cylinders by median mpg (base R approach without extra packages)mtcars$cyl_f <- factor(mtcars$cyl)med_by_cyl <- tapply(mtcars$mpg, mtcars$cyl_f, median)mtcars$cyl_f <- factor(mtcars$cyl_f, levels = names(sort(med_by_cyl)))ggplot(mtcars, aes(cyl_f, mpg)) + geom_boxplot()Scales, labels, and themes for clarity
Labels: make the plot self-explanatory
Use labs() to set a clear title, subtitle (optional), axis labels, and legend titles. Prefer specific units and meaningful names.
ggplot(mtcars, aes(wt, mpg, color = factor(cyl))) + geom_point(alpha = 0.7) + labs( title = "Fuel efficiency decreases as weight increases", x = "Weight (1000 lbs)", y = "Miles per gallon (mpg)", color = "Cylinders" )Scales: control breaks, limits, and transformations
Scales control how data values map to visual values. Use them to set axis limits, tick marks, and transformations. Avoid trimming data with xlim()/ylim() if you want statistics (like smoothing) computed on the full data; instead, use coord_cartesian() to zoom.
# Zoom without dropping data from computationsggplot(mtcars, aes(wt, mpg)) + geom_point() + geom_smooth(se = FALSE) + coord_cartesian(xlim = c(1.5, 5.0), ylim = c(10, 35))# Control axis tick marksggplot(mtcars, aes(wt, mpg)) + geom_point() + scale_x_continuous(breaks = seq(1.5, 5.5, by = 1))Use log scales when values span orders of magnitude, but label clearly so readers understand the transformation.
# Example pattern (use when appropriate for your data)ggplot(economics_small, aes(date, uempmed)) + geom_line() + scale_y_log10()Color choices: readable and accessible
Prefer palettes that remain distinguishable for color-vision deficiencies and that print well. For categorical data, use discrete palettes; for continuous data, use perceptually uniform gradients.
# Discrete palette (built-in)ggplot(mtcars, aes(wt, mpg, color = factor(cyl))) + geom_point(size = 2) + scale_color_brewer(palette = "Dark2")# Continuous palette (built-in)ggplot(mtcars, aes(wt, mpg, color = hp)) + geom_point(size = 2) + scale_color_viridis_c()Use alpha to reduce overplotting, and avoid using too many saturated colors at once.
Themes: reduce clutter and standardize style
Themes control non-data ink: background, grid lines, fonts, and spacing. Start with a clean theme and then adjust a few elements consistently across figures.
base_theme <- theme_minimal(base_size = 12) + theme( panel.grid.minor = element_blank(), legend.position = "right" )ggplot(mtcars, aes(wt, mpg, color = factor(cyl))) + geom_point(alpha = 0.8) + base_themeUse theme() to improve readability: rotate crowded axis labels, increase title size, or move the legend.
ggplot(mtcars, aes(factor(cyl), mpg)) + geom_boxplot() + base_theme + theme( axis.text.x = element_text(angle = 0, vjust = 0.5), plot.title = element_text(face = "bold") ) + labs(title = "MPG by cylinder count", x = "Cylinders", y = "MPG")Faceting for comparisons
Faceting creates small multiples: the same plot repeated for subsets of the data. This is often clearer than overlaying many groups in one panel.
Facet by one variable
ggplot(mtcars, aes(wt, mpg)) + geom_point(alpha = 0.7) + facet_wrap(~ cyl) + base_themeFacet by two variables
Use facet_grid() when you want a grid defined by two variables (rows and columns). This is most useful when both variables have a small number of levels.
# Example with iris: rows by Species, columns by a derived categoryiris$SepalWide <- ifelse(iris$Sepal.Width >= median(iris$Sepal.Width), "Wide", "Narrow")ggplot(iris, aes(Sepal.Length, Petal.Length)) + geom_point(alpha = 0.7) + facet_grid(Species ~ SepalWide) + base_themeHands-on: build a visualization set (3 plots) and refine them
In this exercise you will create three plots that answer specific questions. Use the same theme, consistent labels, and export settings so the figures look like a coherent set.
Setup: define a shared style
library(ggplot2)base_theme <- theme_minimal(base_size = 12) + theme( panel.grid.minor = element_blank(), plot.title = element_text(face = "bold"), legend.title = element_text(face = "bold") )Plot 1 (distribution question): How is fuel efficiency distributed?
Question: Is mpg roughly symmetric, skewed, or multi-modal?
Step-by-step: start with a histogram, choose bins, then refine labels and styling.
p1 <- ggplot(mtcars, aes(x = mpg)) + geom_histogram(bins = 12, color = "white", fill = "steelblue") + labs( title = "Distribution of fuel efficiency", x = "Miles per gallon (mpg)", y = "Number of cars" ) + base_themep1Plot 2 (group comparison question): Do cylinder groups differ in mpg?
Question: Which cylinder group tends to have higher mpg, and how variable is it?
Step-by-step: boxplot for group comparison, then add jitter to show sample size and spread.
p2 <- ggplot(mtcars, aes(x = factor(cyl), y = mpg, fill = factor(cyl))) + geom_boxplot(outlier.shape = NA, alpha = 0.7) + geom_jitter(width = 0.15, alpha = 0.6, size = 1.8) + scale_fill_brewer(palette = "Dark2") + labs( title = "MPG differs by cylinder count", x = "Cylinders", y = "Miles per gallon (mpg)", fill = "Cylinders" ) + base_theme + theme(legend.position = "none")p2Plot 3 (relationship question): How does weight relate to mpg, and does it differ by cylinders?
Question: Is there a negative relationship between wt and mpg, and do groups behave differently?
Step-by-step: scatter plot, encode groups by color, add a trend line, then facet if overlays feel crowded.
p3 <- ggplot(mtcars, aes(x = wt, y = mpg, color = factor(cyl))) + geom_point(alpha = 0.75, size = 2) + geom_smooth(se = FALSE, linewidth = 0.9) + scale_color_brewer(palette = "Dark2") + labs( title = "Heavier cars tend to have lower MPG", x = "Weight (1000 lbs)", y = "Miles per gallon (mpg)", color = "Cylinders" ) + base_themep3If the colored overlays are hard to read, facet by cylinder count and remove the legend.
p3_facet <- ggplot(mtcars, aes(x = wt, y = mpg)) + geom_point(alpha = 0.75, size = 2, color = "steelblue") + geom_smooth(se = FALSE, linewidth = 0.9, color = "gray30") + facet_wrap(~ cyl) + labs( title = "Weight vs MPG by cylinder group", x = "Weight (1000 lbs)", y = "Miles per gallon (mpg)" ) + base_themep3_facetRefinement checklist (apply to all three plots)
- Does the title state the main takeaway rather than restating the axes?
- Are axis labels specific and do they include units where relevant?
- Is the legend necessary, and is its title meaningful?
- Are colors distinguishable and not overly saturated?
- Is there overplotting (if yes, use
alpha, jitter, or faceting)? - Are scales appropriate (consider
coord_cartesian()to zoom, or a transformation if needed)? - Is the theme consistent across figures (fonts, grid lines, spacing)?
Export figures with consistent sizing and resolution
Use ggsave() to export plots. Decide on a consistent width and height (in inches) and a resolution (dpi). For reports and slides, 300 dpi is a common choice for raster formats like PNG. For vector output (PDF/SVG), dpi is less relevant, but sizing still matters.
Export PNG with consistent dimensions
# Create an output folder if neededdir.create("figures", showWarnings = FALSE)ggsave(filename = "figures/plot1_mpg_hist.png", plot = p1, width = 7, height = 4.5, units = "in", dpi = 300)ggsave(filename = "figures/plot2_mpg_by_cyl.png", plot = p2, width = 7, height = 4.5, units = "in", dpi = 300)ggsave(filename = "figures/plot3_wt_vs_mpg.png", plot = p3, width = 7, height = 4.5, units = "in", dpi = 300)Export vector formats for crisp text
ggsave(filename = "figures/plot3_wt_vs_mpg.pdf", plot = p3, width = 7, height = 4.5, units = "in")Keep the same size across plots so they align neatly in documents. If you later change base_size in your theme, re-export all figures to maintain consistent typography.