Tidy transformation mindset with dplyr
Data transformation is the step where you turn raw tables into an analysis-ready dataset: the right rows, the right columns, consistent types, and clearly defined metrics. The dplyr package supports this with a small set of verbs that you combine into readable pipelines using the pipe operator |>. Each verb does one job well, and the pipeline reads top-to-bottom like a recipe.
In this chapter you will learn the core verbs through progressively more complex tasks: filter(), select(), mutate(), arrange(), group_by(), and summarise(). You will also practice common patterns: creating new variables, conditional logic with if_else() and case_when(), handling missing values, and computing grouped metrics.
library(dplyr)Core idea: write transformations as a pipeline
A good dplyr pipeline has these traits:
- Each step is small and named by the verb you use.
- Intermediate objects are avoided unless they improve clarity.
- Columns are created with explicit names and consistent units.
- Checks are built in: counts, summaries, and spot validation.
General pattern:
analysis_table <- raw_table |> filter(...) |> select(...) |> mutate(...) |> group_by(...) |> summarise(...)Task 1: Keep the right rows with filter()
filter() keeps rows that match conditions. Conditions are combined with & (and), | (or), and ! (not). Missing values (NA) require special handling because comparisons with NA return NA, not TRUE/FALSE.
- Listen to the audio with the screen off.
- Earn a certificate upon completion.
- Over 5000 courses for you to explore!
Download the app
Example dataset for practice
sales_raw <- tibble::tibble( order_id = c(1, 2, 3, 4, 5, 6), customer_id = c("C01", "C02", "C01", "C03", "C04", "C02"), region = c("North", "South", "North", "West", "South", NA), order_date = as.Date(c("2025-01-02","2025-01-05","2025-01-05","2025-02-01","2025-02-03","2025-02-03")), product = c("A","B","A","C","B","B"), qty = c(2, 1, 3, 1, NA, 4), unit_price = c(10, 20, 10, 15, 20, 20), status = c("paid","paid","refunded","paid","paid","paid"))Step-by-step filtering patterns
1) Filter by a simple condition:
sales_raw |> filter(status == "paid")2) Filter by multiple conditions:
sales_raw |> filter(status == "paid", region == "South")3) Filter by date range:
sales_raw |> filter(order_date >= as.Date("2025-02-01"), order_date <= as.Date("2025-02-28"))4) Filter while handling missing values explicitly:
sales_raw |> filter(!is.na(region))5) Keep rows where a value is in a set:
sales_raw |> filter(product %in% c("A", "B"))Task 2: Keep and rename columns with select()
select() chooses columns. It is also commonly used to reorder columns and to rename them for clarity. Selecting early in a pipeline reduces noise and makes later steps easier to read.
Common select patterns
1) Select a specific set of columns:
sales_raw |> select(order_id, customer_id, order_date, product, qty, unit_price)2) Reorder columns (same as selecting in the desired order):
sales_raw |> select(order_date, order_id, customer_id, region, product, qty, unit_price, status)3) Rename while selecting:
sales_raw |> select(order_id, customer_id, date = order_date, area = region)4) Select by helper functions (useful on wide tables):
sales_raw |> select(starts_with("unit"), ends_with("id"))Task 3: Create new variables with mutate()
mutate() adds new columns or modifies existing ones. New columns can reference columns created earlier in the same mutate() call, which helps you build calculations in a readable sequence.
Compute derived measures
Compute revenue and a cleaned quantity:
sales_raw |> mutate( qty_clean = if_else(is.na(qty), 0, qty), revenue = qty_clean * unit_price )Notes:
if_else()is strict about types: both outcomes must be the same type (numeric with numeric, character with character).- Replacing missing quantities with 0 is a business rule; choose rules that match your context.
Conditional logic with if_else()
Create a flag for high-value orders:
sales_raw |> mutate( qty_clean = if_else(is.na(qty), 0, qty), revenue = qty_clean * unit_price, high_value = if_else(revenue >= 50, TRUE, FALSE) )Multi-branch logic with case_when()
case_when() is ideal when you have multiple categories. The first matching condition wins, so order your rules carefully.
sales_raw |> mutate( qty_clean = if_else(is.na(qty), 0, qty), revenue = qty_clean * unit_price, revenue_band = case_when( revenue == 0 ~ "zero", revenue < 30 ~ "low", revenue < 60 ~ "medium", TRUE ~ "high" ) )Handling missing values in mutate()
Two common patterns are: replacing missing values and creating explicit missing categories.
sales_raw |> mutate( region_clean = if_else(is.na(region), "Unknown", region), qty_clean = if_else(is.na(qty), 0, qty) )If you need to keep missing as missing but still compute safely, use explicit logic:
sales_raw |> mutate( revenue = if_else(is.na(qty), NA_real_, qty * unit_price) )Task 4: Sort rows with arrange()
arrange() sorts rows. Use it for inspection, reporting, and spot checks. Sorting does not change the data values, only their order.
Sort by one or multiple columns
sales_raw |> arrange(order_date, order_id)Descending order uses desc():
sales_raw |> mutate(qty_clean = if_else(is.na(qty), 0, qty), revenue = qty_clean * unit_price) |> arrange(desc(revenue))Task 5: Compute grouped metrics with group_by() and summarise()
group_by() changes the unit of analysis. After grouping, summarise() reduces each group to one row of metrics. This is how you build totals, averages, counts, and rates by region, product, customer, or time period.
Basic grouped summary
sales_raw |> mutate(qty_clean = if_else(is.na(qty), 0, qty), revenue = qty_clean * unit_price) |> group_by(region) |> summarise( orders = n(), total_qty = sum(qty_clean), total_revenue = sum(revenue), avg_revenue = mean(revenue), .groups = "drop" )Key functions used in summaries:
n(): number of rows in the group.sum(),mean(): numeric aggregation (watch out forNA).n_distinct(x): number of unique values (useful for customers, products).
Missing values in summaries
Many summary functions accept na.rm = TRUE to ignore missing values:
sales_raw |> group_by(product) |> summarise( avg_qty = mean(qty, na.rm = TRUE), missing_qty = sum(is.na(qty)), .groups = "drop" )Multiple grouping variables
Group by region and product to get a small cube of metrics:
sales_raw |> mutate(qty_clean = if_else(is.na(qty), 0, qty), revenue = qty_clean * unit_price) |> group_by(region, product) |> summarise( orders = n(), total_revenue = sum(revenue), customers = n_distinct(customer_id), .groups = "drop" ) |> arrange(desc(total_revenue))Readable sequences: building transformations in layers
As tasks grow, pipelines can become long. A practical approach is to structure them in layers: first clean and standardize, then filter to the analysis scope, then compute derived variables, then summarise. Keep each layer focused.
sales_clean <- sales_raw |> mutate( region = if_else(is.na(region), "Unknown", region), qty = if_else(is.na(qty), 0, qty), revenue = qty * unit_price ) |> filter(status == "paid") |> select(order_id, customer_id, order_date, region, product, qty, unit_price, revenue)At this point, sales_clean is a good base table for analysis because it has consistent fields and a clear definition of revenue.
Transformation project: build an analysis-ready table from raw input
Goal: create a customer-by-month table with paid-order metrics, consistent handling of missing values, and clear validation checks.
Step 1: Start from raw and standardize fields
We will define business rules explicitly:
- Only
status == "paid"counts toward revenue metrics. - Missing
regionbecomes"Unknown". - Missing
qtybecomes 0 for paid orders (so revenue becomes 0). - Create a month key for grouping.
sales_base <- sales_raw |> filter(status == "paid") |> mutate( region = if_else(is.na(region), "Unknown", region), qty = if_else(is.na(qty), 0, qty), revenue = qty * unit_price, month = as.Date(format(order_date, "%Y-%m-01")) ) |> select(order_id, customer_id, region, order_date, month, product, qty, unit_price, revenue)Step 2: Add useful categorical variables with case_when()
Example: tag orders by size based on revenue.
sales_enriched <- sales_base |> mutate( order_size = case_when( revenue == 0 ~ "zero", revenue < 30 ~ "small", revenue < 60 ~ "medium", TRUE ~ "large" ) )Step 3: Build the analysis-ready customer-month table
This table has one row per customer per month, with metrics ready for plotting or modeling.
customer_month <- sales_enriched |> group_by(customer_id, month) |> summarise( orders = n(), total_qty = sum(qty), total_revenue = sum(revenue), avg_order_revenue = mean(revenue), regions = n_distinct(region), top_size = dplyr::first(order_size[order(revenue, decreasing = TRUE)]), .groups = "drop" ) |> arrange(customer_id, month)Note on top_size: it selects the size label of the highest-revenue order within each customer-month. This is a simple way to create a descriptive feature from row-level data.
Step 4: Verify the result with targeted checks
Checks should confirm that the transformation did what you intended and did not silently drop or duplicate data.
Check A: row counts and uniqueness
Confirm that customer_month has unique keys (customer_id, month):
customer_month |> summarise( rows = n(), unique_keys = n_distinct(paste(customer_id, month)) )If rows and unique_keys differ, you have duplicate keys and should revisit the grouping logic.
Check B: reconcile totals back to the base table
Total revenue in the aggregated table should match total revenue in the cleaned base table (within the same scope):
sales_enriched |> summarise(total_revenue_base = sum(revenue))customer_month |> summarise(total_revenue_agg = sum(total_revenue))Check C: missingness and value ranges
Look for unexpected missing values or negative numbers:
customer_month |> summarise( missing_total_revenue = sum(is.na(total_revenue)), min_total_revenue = min(total_revenue, na.rm = TRUE), max_total_revenue = max(total_revenue, na.rm = TRUE) )Check D: spot validation on a single customer-month
Pick one customer-month and compare the raw rows to the aggregated metrics. This is one of the fastest ways to catch logic errors.
one_key <- sales_enriched |> filter(customer_id == "C02", month == as.Date("2025-02-01")) |> arrange(order_date, order_id)one_keyone_key |> summarise( orders = n(), total_qty = sum(qty), total_revenue = sum(revenue), avg_order_revenue = mean(revenue) )Compare these values to the corresponding row in customer_month for the same customer and month:
customer_month |> filter(customer_id == "C02", month == as.Date("2025-02-01"))