All courses > Technology and Programming > Data Science and Business Intelligence ::

Data Transformation with Tidy Principles Using dplyr

Capítulo 4

Estimated reading time: 9 minutes

Tidy transformation mindset with dplyr

Data transformation is the step where you turn raw tables into an analysis-ready dataset: the right rows, the right columns, consistent types, and clearly defined metrics. The dplyr package supports this with a small set of verbs that you combine into readable pipelines using the pipe operator |>. Each verb does one job well, and the pipeline reads top-to-bottom like a recipe.

In this chapter you will learn the core verbs through progressively more complex tasks: filter(), select(), mutate(), arrange(), group_by(), and summarise(). You will also practice common patterns: creating new variables, conditional logic with if_else() and case_when(), handling missing values, and computing grouped metrics.

library(dplyr)

Core idea: write transformations as a pipeline

A good dplyr pipeline has these traits:

Each step is small and named by the verb you use.
Intermediate objects are avoided unless they improve clarity.
Columns are created with explicit names and consistent units.
Checks are built in: counts, summaries, and spot validation.

General pattern:

analysis_table <- raw_table |>  filter(...) |>  select(...) |>  mutate(...) |>  group_by(...) |>  summarise(...)

Task 1: Keep the right rows with filter()

filter() keeps rows that match conditions. Conditions are combined with & (and), | (or), and ! (not). Missing values (NA) require special handling because comparisons with NA return NA, not TRUE/FALSE.

Continue in our app.

Listen to the audio with the screen off.
Earn a certificate upon completion.
Over 5000 courses for you to explore!

Or continue reading below...

Download the app

Example dataset for practice

sales_raw <- tibble::tibble(  order_id = c(1, 2, 3, 4, 5, 6),  customer_id = c("C01", "C02", "C01", "C03", "C04", "C02"),  region = c("North", "South", "North", "West", "South", NA),  order_date = as.Date(c("2025-01-02","2025-01-05","2025-01-05","2025-02-01","2025-02-03","2025-02-03")),  product = c("A","B","A","C","B","B"),  qty = c(2, 1, 3, 1, NA, 4),  unit_price = c(10, 20, 10, 15, 20, 20),  status = c("paid","paid","refunded","paid","paid","paid"))

Step-by-step filtering patterns

1) Filter by a simple condition:

sales_raw |>  filter(status == "paid")

2) Filter by multiple conditions:

sales_raw |>  filter(status == "paid", region == "South")

3) Filter by date range:

sales_raw |>  filter(order_date >= as.Date("2025-02-01"), order_date <= as.Date("2025-02-28"))

4) Filter while handling missing values explicitly:

sales_raw |>  filter(!is.na(region))

5) Keep rows where a value is in a set:

sales_raw |>  filter(product %in% c("A", "B"))

Task 2: Keep and rename columns with select()

select() chooses columns. It is also commonly used to reorder columns and to rename them for clarity. Selecting early in a pipeline reduces noise and makes later steps easier to read.

Common select patterns

1) Select a specific set of columns:

sales_raw |>  select(order_id, customer_id, order_date, product, qty, unit_price)

2) Reorder columns (same as selecting in the desired order):

sales_raw |>  select(order_date, order_id, customer_id, region, product, qty, unit_price, status)

3) Rename while selecting:

sales_raw |>  select(order_id, customer_id, date = order_date, area = region)

4) Select by helper functions (useful on wide tables):

sales_raw |>  select(starts_with("unit"), ends_with("id"))

Task 3: Create new variables with mutate()

mutate() adds new columns or modifies existing ones. New columns can reference columns created earlier in the same mutate() call, which helps you build calculations in a readable sequence.

Compute derived measures

Compute revenue and a cleaned quantity:

sales_raw |>  mutate(    qty_clean = if_else(is.na(qty), 0, qty),    revenue = qty_clean * unit_price  )

Notes:

if_else() is strict about types: both outcomes must be the same type (numeric with numeric, character with character).
Replacing missing quantities with 0 is a business rule; choose rules that match your context.

Conditional logic with if_else()

Create a flag for high-value orders:

sales_raw |>  mutate(    qty_clean = if_else(is.na(qty), 0, qty),    revenue = qty_clean * unit_price,    high_value = if_else(revenue >= 50, TRUE, FALSE)  )

Multi-branch logic with case_when()

case_when() is ideal when you have multiple categories. The first matching condition wins, so order your rules carefully.

sales_raw |>  mutate(    qty_clean = if_else(is.na(qty), 0, qty),    revenue = qty_clean * unit_price,    revenue_band = case_when(      revenue == 0 ~ "zero",      revenue < 30 ~ "low",      revenue < 60 ~ "medium",      TRUE ~ "high"    )  )

Handling missing values in mutate()

Two common patterns are: replacing missing values and creating explicit missing categories.

sales_raw |>  mutate(    region_clean = if_else(is.na(region), "Unknown", region),    qty_clean = if_else(is.na(qty), 0, qty)  )

If you need to keep missing as missing but still compute safely, use explicit logic:

sales_raw |>  mutate(    revenue = if_else(is.na(qty), NA_real_, qty * unit_price)  )

Task 4: Sort rows with arrange()

arrange() sorts rows. Use it for inspection, reporting, and spot checks. Sorting does not change the data values, only their order.

Sort by one or multiple columns

sales_raw |>  arrange(order_date, order_id)

Descending order uses desc():

sales_raw |>  mutate(qty_clean = if_else(is.na(qty), 0, qty), revenue = qty_clean * unit_price) |>  arrange(desc(revenue))

Task 5: Compute grouped metrics with group_by() and summarise()

group_by() changes the unit of analysis. After grouping, summarise() reduces each group to one row of metrics. This is how you build totals, averages, counts, and rates by region, product, customer, or time period.

Basic grouped summary

sales_raw |>  mutate(qty_clean = if_else(is.na(qty), 0, qty), revenue = qty_clean * unit_price) |>  group_by(region) |>  summarise(    orders = n(),    total_qty = sum(qty_clean),    total_revenue = sum(revenue),    avg_revenue = mean(revenue),    .groups = "drop"  )

Key functions used in summaries:

n(): number of rows in the group.
sum(), mean(): numeric aggregation (watch out for NA).
n_distinct(x): number of unique values (useful for customers, products).

Missing values in summaries

Many summary functions accept na.rm = TRUE to ignore missing values:

sales_raw |>  group_by(product) |>  summarise(    avg_qty = mean(qty, na.rm = TRUE),    missing_qty = sum(is.na(qty)),    .groups = "drop"  )

Multiple grouping variables

Group by region and product to get a small cube of metrics:

sales_raw |>  mutate(qty_clean = if_else(is.na(qty), 0, qty), revenue = qty_clean * unit_price) |>  group_by(region, product) |>  summarise(    orders = n(),    total_revenue = sum(revenue),    customers = n_distinct(customer_id),    .groups = "drop"  ) |>  arrange(desc(total_revenue))

Readable sequences: building transformations in layers

As tasks grow, pipelines can become long. A practical approach is to structure them in layers: first clean and standardize, then filter to the analysis scope, then compute derived variables, then summarise. Keep each layer focused.

sales_clean <- sales_raw |>  mutate(    region = if_else(is.na(region), "Unknown", region),    qty = if_else(is.na(qty), 0, qty),    revenue = qty * unit_price  ) |>  filter(status == "paid") |>  select(order_id, customer_id, order_date, region, product, qty, unit_price, revenue)

At this point, sales_clean is a good base table for analysis because it has consistent fields and a clear definition of revenue.

Transformation project: build an analysis-ready table from raw input

Goal: create a customer-by-month table with paid-order metrics, consistent handling of missing values, and clear validation checks.

Step 1: Start from raw and standardize fields

We will define business rules explicitly:

Only status == "paid" counts toward revenue metrics.
Missing region becomes "Unknown".
Missing qty becomes 0 for paid orders (so revenue becomes 0).
Create a month key for grouping.

sales_base <- sales_raw |>  filter(status == "paid") |>  mutate(    region = if_else(is.na(region), "Unknown", region),    qty = if_else(is.na(qty), 0, qty),    revenue = qty * unit_price,    month = as.Date(format(order_date, "%Y-%m-01"))  ) |>  select(order_id, customer_id, region, order_date, month, product, qty, unit_price, revenue)

Step 2: Add useful categorical variables with case_when()

Example: tag orders by size based on revenue.

sales_enriched <- sales_base |>  mutate(    order_size = case_when(      revenue == 0 ~ "zero",      revenue < 30 ~ "small",      revenue < 60 ~ "medium",      TRUE ~ "large"    )  )

Step 3: Build the analysis-ready customer-month table

This table has one row per customer per month, with metrics ready for plotting or modeling.

customer_month <- sales_enriched |>  group_by(customer_id, month) |>  summarise(    orders = n(),    total_qty = sum(qty),    total_revenue = sum(revenue),    avg_order_revenue = mean(revenue),    regions = n_distinct(region),    top_size = dplyr::first(order_size[order(revenue, decreasing = TRUE)]),    .groups = "drop"  ) |>  arrange(customer_id, month)

Note on top_size: it selects the size label of the highest-revenue order within each customer-month. This is a simple way to create a descriptive feature from row-level data.

Step 4: Verify the result with targeted checks

Checks should confirm that the transformation did what you intended and did not silently drop or duplicate data.

Check A: row counts and uniqueness

Confirm that customer_month has unique keys (customer_id, month):

customer_month |>  summarise(    rows = n(),    unique_keys = n_distinct(paste(customer_id, month))  )

If rows and unique_keys differ, you have duplicate keys and should revisit the grouping logic.

Check B: reconcile totals back to the base table

Total revenue in the aggregated table should match total revenue in the cleaned base table (within the same scope):

sales_enriched |>  summarise(total_revenue_base = sum(revenue))

customer_month |>  summarise(total_revenue_agg = sum(total_revenue))

Check C: missingness and value ranges

Look for unexpected missing values or negative numbers:

customer_month |>  summarise(    missing_total_revenue = sum(is.na(total_revenue)),    min_total_revenue = min(total_revenue, na.rm = TRUE),    max_total_revenue = max(total_revenue, na.rm = TRUE)  )

Check D: spot validation on a single customer-month

Pick one customer-month and compare the raw rows to the aggregated metrics. This is one of the fastest ways to catch logic errors.

one_key <- sales_enriched |>  filter(customer_id == "C02", month == as.Date("2025-02-01")) |>  arrange(order_date, order_id)

one_key

one_key |>  summarise(    orders = n(),    total_qty = sum(qty),    total_revenue = sum(revenue),    avg_order_revenue = mean(revenue)  )

Compare these values to the corresponding row in customer_month for the same customer and month:

customer_month |>  filter(customer_id == "C02", month == as.Date("2025-02-01"))

Now answer the exercise about the content:

In a dplyr workflow, what is the main effect of using group_by(customer_id, month) followed by summarise(...) when building a customer-by-month table?

You are right! Congratulations, now go to the next page

You missed! Try again.

group_by() defines groups (here, by customer and month). Then summarise() reduces each group to one row containing metrics like counts, totals, and averages.