Why reshaping and combining matter
Real analysis rarely lives in a single table. You might receive one file with customer details, another with orders, and a third with monthly metrics. To analyze or plot, you typically need to (1) stack similar tables (row-binding), (2) attach columns side-by-side (col-binding), (3) merge related tables using keys (joins), and (4) reshape between wide and long formats (pivoting) so each variable has a clear place.
Setup: packages and example data
This chapter uses functions from dplyr and tidyr. The examples below create small tibbles you can run as-is.
library(dplyr) library(tidyr) customers <- tibble( customer_id = c(101, 102, 103, 104), name = c("Asha", "Ben", "Chen", "Dina"), city = c("Leeds", "Leeds", "York", "York") ) orders <- tibble( order_id = c("o1", "o2", "o3", "o4", "o5"), customer_id = c(101, 101, 102, 999, 104), order_date = as.Date(c("2025-01-02","2025-01-10","2025-01-11","2025-01-12","2025-01-20")), amount = c(120, 80, 50, 60, 200) )Binding data: bind_rows() and bind_cols()
bind_rows(): stack tables with the same meaning
Use bind_rows() when you have the same columns (or mostly the same) and want to append observations. Typical case: monthly extracts with identical structure.
jan <- tibble(customer_id = c(101, 102), spend = c(200, 50)) feb <- tibble(customer_id = c(101, 103), spend = c(180, 90)) spend_q1 <- bind_rows(jan, feb) spend_q1If columns differ, bind_rows() will create missing columns and fill with NA. This is useful, but it can also hide schema drift, so inspect the result.
feb_extra <- tibble(customer_id = c(104), spend = c(70), promo = c("Y")) spend_mixed <- bind_rows(jan, feb_extra) spend_mixedTo track the source table, use .id.
- Listen to the audio with the screen off.
- Earn a certificate upon completion.
- Over 5000 courses for you to explore!
Download the app
spend_tagged <- bind_rows(jan = jan, feb = feb, .id = "month") spend_taggedbind_cols(): attach columns by position
bind_cols() glues columns side-by-side based on row order, not on keys. Use it only when you are sure rows correspond exactly (same number of rows and same ordering).
a <- tibble(id = c(1, 2, 3)) b <- tibble(score = c(10, 20, 30)) bind_cols(a, b)Risk: if b is in a different order, you silently misalign data. If you need matching by an identifier, use a join instead.
Joins: combining tables by keys
A join merges tables by matching key columns (identifiers). In most analysis workflows, joins are safer than column-binding because they match explicitly rather than by row position.
Key vocabulary
- Key: column(s) used to match rows (e.g.,
customer_id). - Primary key: uniquely identifies rows in a table (should have no duplicates and no missing values).
- Foreign key: references a primary key in another table (may have duplicates; missing/unmatched values indicate data quality issues).
Join types you will use most
- Left join (
left_join()): keep all rows from the left table; bring matching columns from the right table; unmatched right-side values becomeNA. - Inner join (
inner_join()): keep only rows with matches in both tables. - Full join (
full_join()): keep all rows from both tables; unmatched values becomeNAon the other side.
Step-by-step: a simple left join
Goal: attach customer attributes (name, city) to each order.
orders_enriched <- orders %>% left_join(customers, by = "customer_id") orders_enrichedNotice what happens to customer_id == 999: it has no match in customers, so name and city become NA. That is a key quality signal, not something to ignore.
Inner join to keep only matched records
If you want to analyze only orders with known customers, use inner_join().
orders_matched <- orders %>% inner_join(customers, by = "customer_id") orders_matchedFull join to reconcile two sources
Use full_join() when you want to see everything from both tables and identify gaps.
all_links <- orders %>% full_join(customers, by = "customer_id") all_linksJoining on differently named keys
Sometimes the same identifier has different column names. Use a named vector in by.
cust_lookup <- customers %>% transmute(cust_id = customer_id, segment = c("A","B","A","C")) customers %>% left_join(cust_lookup, by = c("customer_id" = "cust_id"))Key quality checks before and after joins
Most join bugs come from key problems: missing values, duplicates where you expected uniqueness, or mismatched formats (e.g., "001" vs 1). Build quick checks into your workflow.
Check 1: missing keys
customers %>% summarise(missing_customer_id = sum(is.na(customer_id))) orders %>% summarise(missing_customer_id = sum(is.na(customer_id)))Check 2: uniqueness of a supposed primary key
If customer_id should be unique in customers, verify it.
customers %>% count(customer_id) %>% filter(n > 1)Check 3: unmatched keys (anti-joins)
Use anti_join() to find rows that will not match. This is one of the fastest ways to debug joins.
orders %>% anti_join(customers, by = "customer_id") customers %>% anti_join(orders, by = "customer_id")Check 4: join relationship expectations
Many joins should be one-to-many (one customer, many orders). If both sides have duplicates for the join key, you can accidentally create a many-to-many join that multiplies rows. Recent dplyr versions can warn about this; you should still check explicitly.
orders %>% count(customer_id) %>% filter(n > 1) customers %>% count(customer_id) %>% filter(n > 1)Handling duplicates and many-to-many joins
Example: duplicate keys on the right side
Suppose a customer table accidentally has two rows for the same customer_id. A join will duplicate matching orders.
customers_dup <- bind_rows(customers, tibble(customer_id = 101, name = "Asha", city = "Leeds")) orders %>% left_join(customers_dup, by = "customer_id") %>% count(order_id) %>% filter(n > 1)Fix options depend on your business rule:
- If duplicates are truly erroneous, remove them (after deciding which row to keep).
- If duplicates represent multiple valid records (e.g., multiple addresses), join to a more appropriate key or reshape first.
A simple de-duplication approach is to keep the first row per key (only do this if it matches your rule).
customers_dedup <- customers_dup %>% distinct(customer_id, .keep_all = TRUE) orders %>% left_join(customers_dedup, by = "customer_id")Example: duplicates on both sides (many-to-many)
Many-to-many joins can explode row counts. Here is a small illustration using a product tagging table where a product can have multiple tags and an order can contain multiple products; joining without care can multiply rows.
order_items <- tibble(order_id = c("o1","o1","o2"), product_id = c("p1","p2","p1")) product_tags <- tibble(product_id = c("p1","p1","p2"), tag = c("sale","eco","new")) order_items %>% left_join(product_tags, by = "product_id")This multiplication is expected here because each product can have multiple tags. The key is to anticipate it and decide whether you want that expanded structure (often yes, for counting tags) or whether you should aggregate first.
Suffix collisions: when both tables have the same column names
If both tables contain a column with the same name (other than the key), dplyr adds suffixes like .x and .y. You can control suffixes to keep meaning clear.
customers_status <- customers %>% mutate(status = c("active","active","inactive","active")) orders_status <- orders %>% mutate(status = c("paid","paid","refunded","paid","paid")) orders_status %>% left_join(customers_status, by = "customer_id", suffix = c("_order", "_customer"))Pivoting: switching between wide and long formats
Pivoting changes the shape of a dataset without changing its information. Many analyses and plots prefer long format (one measurement per row), while reporting tables often prefer wide format (one column per measurement type).
pivot_longer(): wide to long
Example: monthly spend stored in separate columns. This is wide format.
spend_wide <- tibble( customer_id = c(101, 102, 103), spend_jan = c(200, 50, 0), spend_feb = c(180, 0, 90) )Convert to long format with a month column and a spend column.
spend_long <- spend_wide %>% pivot_longer( cols = starts_with("spend_"), names_to = "month", values_to = "spend" ) spend_longOften you want to clean the names while pivoting (e.g., remove the spend_ prefix).
spend_long2 <- spend_wide %>% pivot_longer( cols = starts_with("spend_"), names_to = "month", names_prefix = "spend_", values_to = "spend" ) spend_long2pivot_wider(): long to wide
Convert back to wide when you need one row per customer with separate month columns.
spend_wide2 <- spend_long2 %>% pivot_wider( names_from = month, values_from = spend ) spend_wide2If there can be multiple values per key combination, you must decide how to combine them. For example, if you have multiple transactions per customer per month, you might sum them while widening.
tx <- tibble( customer_id = c(101,101,101,102), month = c("jan","jan","feb","jan"), spend = c(120, 80, 180, 50) ) tx %>% pivot_wider( names_from = month, values_from = spend, values_fn = sum, values_fill = 0 )Exercises: from simple joins to messy realities
Exercise 1: basic matching keys
Task: join customers onto orders so each order has name and city.
- Create
orders_enrichedusingleft_join(). - Count how many orders have missing
name.
orders_enriched <- orders %>% left_join(customers, by = "customer_id") orders_enriched %>% summarise(missing_name = sum(is.na(name)))Exercise 2: find unmatched records before joining
Task: identify orders whose customer_id is not present in customers.
unmatched_orders <- orders %>% anti_join(customers, by = "customer_id") unmatched_ordersExtension: decide a rule. For example, filter them out for analysis, or keep them but label them as "Unknown".
orders_labeled <- orders_enriched %>% mutate(name = if_else(is.na(name), "Unknown", name), city = if_else(is.na(city), "Unknown", city))Exercise 3: duplicates causing row multiplication
Task: simulate a duplicate customer row and observe the effect on row counts.
- Create
customers_dupwith an extra row forcustomer_id == 101. - Join to orders and compare
nrow()before and after. - Fix by de-duplicating and re-joining.
customers_dup <- bind_rows(customers, tibble(customer_id = 101, name = "Asha", city = "Leeds")) nrow(orders) nrow(orders %>% left_join(customers_dup, by = "customer_id")) customers_dedup <- customers_dup %>% distinct(customer_id, .keep_all = TRUE) nrow(orders %>% left_join(customers_dedup, by = "customer_id"))Exercise 4: suffix collisions
Task: join two tables that both contain a status column and set meaningful suffixes.
customers_status <- customers %>% mutate(status = c("active","active","inactive","active")) orders_status <- orders %>% mutate(status = c("paid","paid","refunded","paid","paid")) joined <- orders_status %>% left_join(customers_status, by = "customer_id", suffix = c("_order", "_customer")) joined %>% select(order_id, customer_id, status_order, status_customer)Exercise 5: pivot for analysis needs
Task: reshape transaction data to support plotting monthly totals per customer.
- Start from
tx(long). - Compute totals per customer per month (if needed).
- Widen to one row per customer with month columns, filling missing with 0.
tx_totals <- tx %>% group_by(customer_id, month) %>% summarise(spend = sum(spend), .groups = "drop") tx_wide <- tx_totals %>% pivot_wider(names_from = month, values_from = spend, values_fill = 0) tx_wideCase task: merge two related tables, reshape, and produce a tidy dataset
You will build a tidy dataset ready for plotting monthly revenue by city. You have two tables: orders (transactions) and customers (attributes). The final dataset should have one row per city and month with total revenue.
Step 1: check key quality and unmatched records
orders %>% anti_join(customers, by = "customer_id")Decide a rule for unmatched customers. Here we keep them but label city as "Unknown" so totals still reconcile to the original orders table.
Step 2: join and create a month field
orders_joined <- orders %>% left_join(customers, by = "customer_id") %>% mutate(city = if_else(is.na(city), "Unknown", city), month = format(order_date, "%Y-%m"))Step 3: aggregate to a tidy long dataset (best for plotting)
revenue_city_month <- orders_joined %>% group_by(city, month) %>% summarise(revenue = sum(amount), orders = n(), .groups = "drop") revenue_city_monthStep 4: optional reshape to wide for reporting
If you want a report-style table with one row per city and one column per month, widen it. This is optional; for most plotting, keep the long table from Step 3.
revenue_wide <- revenue_city_month %>% select(city, month, revenue) %>% pivot_wider(names_from = month, values_from = revenue, values_fill = 0) revenue_wideStep 5: final tidy dataset ready for plotting
Your plotting-ready dataset is revenue_city_month: each row is one city-month observation with numeric revenue and a count of orders.
revenue_city_month %>% arrange(city, month)