All courses > Technology and Programming > Data Science and Business Intelligence ::

Reshaping and Combining Data: Joins, Bindings, and Pivoting

Capítulo 5

Estimated reading time: 11 minutes

Why reshaping and combining matter

Real analysis rarely lives in a single table. You might receive one file with customer details, another with orders, and a third with monthly metrics. To analyze or plot, you typically need to (1) stack similar tables (row-binding), (2) attach columns side-by-side (col-binding), (3) merge related tables using keys (joins), and (4) reshape between wide and long formats (pivoting) so each variable has a clear place.

Setup: packages and example data

This chapter uses functions from dplyr and tidyr. The examples below create small tibbles you can run as-is.

library(dplyr) library(tidyr) customers <- tibble( customer_id = c(101, 102, 103, 104), name = c("Asha", "Ben", "Chen", "Dina"), city = c("Leeds", "Leeds", "York", "York") ) orders <- tibble( order_id = c("o1", "o2", "o3", "o4", "o5"), customer_id = c(101, 101, 102, 999, 104), order_date = as.Date(c("2025-01-02","2025-01-10","2025-01-11","2025-01-12","2025-01-20")), amount = c(120, 80, 50, 60, 200) )

Binding data: `bind_rows()` and `bind_cols()`

`bind_rows()`: stack tables with the same meaning

Use bind_rows() when you have the same columns (or mostly the same) and want to append observations. Typical case: monthly extracts with identical structure.

jan <- tibble(customer_id = c(101, 102), spend = c(200, 50)) feb <- tibble(customer_id = c(101, 103), spend = c(180, 90)) spend_q1 <- bind_rows(jan, feb) spend_q1

If columns differ, bind_rows() will create missing columns and fill with NA. This is useful, but it can also hide schema drift, so inspect the result.

feb_extra <- tibble(customer_id = c(104), spend = c(70), promo = c("Y")) spend_mixed <- bind_rows(jan, feb_extra) spend_mixed

To track the source table, use .id.

Continue in our app.

Listen to the audio with the screen off.
Earn a certificate upon completion.
Over 5000 courses for you to explore!

Or continue reading below...

Download the app

spend_tagged <- bind_rows(jan = jan, feb = feb, .id = "month") spend_tagged

`bind_cols()`: attach columns by position

bind_cols() glues columns side-by-side based on row order, not on keys. Use it only when you are sure rows correspond exactly (same number of rows and same ordering).

a <- tibble(id = c(1, 2, 3)) b <- tibble(score = c(10, 20, 30)) bind_cols(a, b)

Risk: if b is in a different order, you silently misalign data. If you need matching by an identifier, use a join instead.

Joins: combining tables by keys

A join merges tables by matching key columns (identifiers). In most analysis workflows, joins are safer than column-binding because they match explicitly rather than by row position.

Key vocabulary

Key: column(s) used to match rows (e.g., customer_id).
Primary key: uniquely identifies rows in a table (should have no duplicates and no missing values).
Foreign key: references a primary key in another table (may have duplicates; missing/unmatched values indicate data quality issues).

Join types you will use most

Left join (left_join()): keep all rows from the left table; bring matching columns from the right table; unmatched right-side values become NA.
Inner join (inner_join()): keep only rows with matches in both tables.
Full join (full_join()): keep all rows from both tables; unmatched values become NA on the other side.

Step-by-step: a simple left join

Goal: attach customer attributes (name, city) to each order.

orders_enriched <- orders %>% left_join(customers, by = "customer_id") orders_enriched

Notice what happens to customer_id == 999: it has no match in customers, so name and city become NA. That is a key quality signal, not something to ignore.

Inner join to keep only matched records

If you want to analyze only orders with known customers, use inner_join().

orders_matched <- orders %>% inner_join(customers, by = "customer_id") orders_matched

Full join to reconcile two sources

Use full_join() when you want to see everything from both tables and identify gaps.

all_links <- orders %>% full_join(customers, by = "customer_id") all_links

Joining on differently named keys

Sometimes the same identifier has different column names. Use a named vector in by.

cust_lookup <- customers %>% transmute(cust_id = customer_id, segment = c("A","B","A","C")) customers %>% left_join(cust_lookup, by = c("customer_id" = "cust_id"))

Key quality checks before and after joins

Most join bugs come from key problems: missing values, duplicates where you expected uniqueness, or mismatched formats (e.g., "001" vs 1). Build quick checks into your workflow.

Check 1: missing keys

customers %>% summarise(missing_customer_id = sum(is.na(customer_id))) orders %>% summarise(missing_customer_id = sum(is.na(customer_id)))

Check 2: uniqueness of a supposed primary key

If customer_id should be unique in customers, verify it.

customers %>% count(customer_id) %>% filter(n > 1)

Check 3: unmatched keys (anti-joins)

Use anti_join() to find rows that will not match. This is one of the fastest ways to debug joins.

orders %>% anti_join(customers, by = "customer_id") customers %>% anti_join(orders, by = "customer_id")

Check 4: join relationship expectations

Many joins should be one-to-many (one customer, many orders). If both sides have duplicates for the join key, you can accidentally create a many-to-many join that multiplies rows. Recent dplyr versions can warn about this; you should still check explicitly.

orders %>% count(customer_id) %>% filter(n > 1) customers %>% count(customer_id) %>% filter(n > 1)

Handling duplicates and many-to-many joins

Example: duplicate keys on the right side

Suppose a customer table accidentally has two rows for the same customer_id. A join will duplicate matching orders.

customers_dup <- bind_rows(customers, tibble(customer_id = 101, name = "Asha", city = "Leeds")) orders %>% left_join(customers_dup, by = "customer_id") %>% count(order_id) %>% filter(n > 1)

Fix options depend on your business rule:

If duplicates are truly erroneous, remove them (after deciding which row to keep).
If duplicates represent multiple valid records (e.g., multiple addresses), join to a more appropriate key or reshape first.

A simple de-duplication approach is to keep the first row per key (only do this if it matches your rule).

customers_dedup <- customers_dup %>% distinct(customer_id, .keep_all = TRUE) orders %>% left_join(customers_dedup, by = "customer_id")

Example: duplicates on both sides (many-to-many)

Many-to-many joins can explode row counts. Here is a small illustration using a product tagging table where a product can have multiple tags and an order can contain multiple products; joining without care can multiply rows.

order_items <- tibble(order_id = c("o1","o1","o2"), product_id = c("p1","p2","p1")) product_tags <- tibble(product_id = c("p1","p1","p2"), tag = c("sale","eco","new")) order_items %>% left_join(product_tags, by = "product_id")

This multiplication is expected here because each product can have multiple tags. The key is to anticipate it and decide whether you want that expanded structure (often yes, for counting tags) or whether you should aggregate first.

Suffix collisions: when both tables have the same column names

If both tables contain a column with the same name (other than the key), dplyr adds suffixes like .x and .y. You can control suffixes to keep meaning clear.

customers_status <- customers %>% mutate(status = c("active","active","inactive","active")) orders_status <- orders %>% mutate(status = c("paid","paid","refunded","paid","paid")) orders_status %>% left_join(customers_status, by = "customer_id", suffix = c("_order", "_customer"))

Pivoting: switching between wide and long formats

Pivoting changes the shape of a dataset without changing its information. Many analyses and plots prefer long format (one measurement per row), while reporting tables often prefer wide format (one column per measurement type).

`pivot_longer()`: wide to long

Example: monthly spend stored in separate columns. This is wide format.

spend_wide <- tibble( customer_id = c(101, 102, 103), spend_jan = c(200, 50, 0), spend_feb = c(180, 0, 90) )

Convert to long format with a month column and a spend column.

spend_long <- spend_wide %>% pivot_longer( cols = starts_with("spend_"), names_to = "month", values_to = "spend" ) spend_long

Often you want to clean the names while pivoting (e.g., remove the spend_ prefix).

spend_long2 <- spend_wide %>% pivot_longer( cols = starts_with("spend_"), names_to = "month", names_prefix = "spend_", values_to = "spend" ) spend_long2

`pivot_wider()`: long to wide

Convert back to wide when you need one row per customer with separate month columns.

spend_wide2 <- spend_long2 %>% pivot_wider( names_from = month, values_from = spend ) spend_wide2

If there can be multiple values per key combination, you must decide how to combine them. For example, if you have multiple transactions per customer per month, you might sum them while widening.

tx <- tibble( customer_id = c(101,101,101,102), month = c("jan","jan","feb","jan"), spend = c(120, 80, 180, 50) ) tx %>% pivot_wider( names_from = month, values_from = spend, values_fn = sum, values_fill = 0 )

Exercises: from simple joins to messy realities

Exercise 1: basic matching keys

Task: join customers onto orders so each order has name and city.

Create orders_enriched using left_join().
Count how many orders have missing name.

orders_enriched <- orders %>% left_join(customers, by = "customer_id") orders_enriched %>% summarise(missing_name = sum(is.na(name)))

Exercise 2: find unmatched records before joining

Task: identify orders whose customer_id is not present in customers.

unmatched_orders <- orders %>% anti_join(customers, by = "customer_id") unmatched_orders

Extension: decide a rule. For example, filter them out for analysis, or keep them but label them as "Unknown".

orders_labeled <- orders_enriched %>% mutate(name = if_else(is.na(name), "Unknown", name), city = if_else(is.na(city), "Unknown", city))

Exercise 3: duplicates causing row multiplication

Task: simulate a duplicate customer row and observe the effect on row counts.

Create customers_dup with an extra row for customer_id == 101.
Join to orders and compare nrow() before and after.
Fix by de-duplicating and re-joining.

customers_dup <- bind_rows(customers, tibble(customer_id = 101, name = "Asha", city = "Leeds")) nrow(orders) nrow(orders %>% left_join(customers_dup, by = "customer_id")) customers_dedup <- customers_dup %>% distinct(customer_id, .keep_all = TRUE) nrow(orders %>% left_join(customers_dedup, by = "customer_id"))

Exercise 4: suffix collisions

Task: join two tables that both contain a status column and set meaningful suffixes.

customers_status <- customers %>% mutate(status = c("active","active","inactive","active")) orders_status <- orders %>% mutate(status = c("paid","paid","refunded","paid","paid")) joined <- orders_status %>% left_join(customers_status, by = "customer_id", suffix = c("_order", "_customer")) joined %>% select(order_id, customer_id, status_order, status_customer)

Exercise 5: pivot for analysis needs

Task: reshape transaction data to support plotting monthly totals per customer.

Start from tx (long).
Compute totals per customer per month (if needed).
Widen to one row per customer with month columns, filling missing with 0.

tx_totals <- tx %>% group_by(customer_id, month) %>% summarise(spend = sum(spend), .groups = "drop") tx_wide <- tx_totals %>% pivot_wider(names_from = month, values_from = spend, values_fill = 0) tx_wide

Case task: merge two related tables, reshape, and produce a tidy dataset

You will build a tidy dataset ready for plotting monthly revenue by city. You have two tables: orders (transactions) and customers (attributes). The final dataset should have one row per city and month with total revenue.

Step 1: check key quality and unmatched records

orders %>% anti_join(customers, by = "customer_id")

Decide a rule for unmatched customers. Here we keep them but label city as "Unknown" so totals still reconcile to the original orders table.

Step 2: join and create a month field

orders_joined <- orders %>% left_join(customers, by = "customer_id") %>% mutate(city = if_else(is.na(city), "Unknown", city), month = format(order_date, "%Y-%m"))

Step 3: aggregate to a tidy long dataset (best for plotting)

revenue_city_month <- orders_joined %>% group_by(city, month) %>% summarise(revenue = sum(amount), orders = n(), .groups = "drop") revenue_city_month

Step 4: optional reshape to wide for reporting

If you want a report-style table with one row per city and one column per month, widen it. This is optional; for most plotting, keep the long table from Step 3.

revenue_wide <- revenue_city_month %>% select(city, month, revenue) %>% pivot_wider(names_from = month, values_from = revenue, values_fill = 0) revenue_wide

Step 5: final tidy dataset ready for plotting

Your plotting-ready dataset is revenue_city_month: each row is one city-month observation with numeric revenue and a count of orders.

revenue_city_month %>% arrange(city, month)

Now answer the exercise about the content:

You have an orders table and a customers table. Some orders have customer_id values that do not exist in customers. You want to keep all orders, but clearly flag missing customer attributes after merging. What approach best achieves this?

You are right! Congratulations, now go to the next page

You missed! Try again.

left_join() keeps every row from the orders table and adds customer columns when keys match. Unmatched customers produce NA values, which you can then replace with a label (e.g., "Unknown") to keep totals while flagging missing matches.