A Practical End-to-End R Data Analysis Project

Capítulo 8

Estimated reading time: 13 minutes

+ Exercise

Mini-project overview: from raw data to a shareable report

In this chapter you will run a complete, realistic analysis workflow on a small dataset. The goal is not to learn new syntax, but to integrate skills into a repeatable process: define inputs/outputs, organize files, write reusable functions, validate assumptions, produce plots, and generate a final report grounded in computed results.

Project scenario and questions

Scenario: You manage a small online store and want to understand what drives revenue and whether discounts are helping or hurting profitability.

Dataset: Order-level data with product category, region, dates, quantities, prices, and discounts.

Questions to answer:

  • How has revenue changed month-to-month overall and by region?
  • Which categories contribute most to revenue and profit?
  • Do higher discounts correlate with lower profit margin?
  • Are there regions or categories where discounting appears beneficial?
  • What are the top 10 products by revenue, and do they have healthy margins?

Scalable project structure (folders, naming, and outputs)

Folder layout

Create a project folder (ideally an RStudio Project) with a predictable structure. This keeps paths stable and makes it easy to rerun or share the work.

Continue in our app.
  • Listen to the audio with the screen off.
  • Earn a certificate upon completion.
  • Over 5000 courses for you to explore!
Or continue reading below...
Download App

Download the app

r-mini-project/  data/    raw/    processed/  R/    00_setup.R    01_import_clean.R    02_analysis.R    03_visualize.R  reports/    report.Rmd  outputs/    figures/    tables/  README.md

Naming conventions that scale

  • Files: prefix scripts with numbers to enforce run order (00_, 01_, 02_).
  • Objects: use snake_case and be explicit: orders_raw, orders_clean, monthly_kpis.
  • Columns: keep consistent units and meaning (e.g., discount_pct as 0–1, not 0–100).
  • Outputs: save with informative names and dates if needed: monthly_revenue_by_region.png.

Dataset: create a realistic CSV (so everyone can run it)

To make this chapter fully runnable without external downloads, you will generate a dataset and write it to data/raw/orders.csv. In a real project, this step would be replaced by placing your actual raw file in the same folder.

Script: R/00_setup.R

# R/00_setup.R  # Packages used across the project  library(dplyr)  library(ggplot2)  library(readr)  library(lubridate)  library(stringr)  library(tidyr)  # Create folders if missing  dir.create("data/raw", recursive = TRUE, showWarnings = FALSE) dir.create("data/processed", recursive = TRUE, showWarnings = FALSE) dir.create("outputs/figures", recursive = TRUE, showWarnings = FALSE) dir.create("outputs/tables", recursive = TRUE, showWarnings = FALSE)  # Reproducible example dataset generator  set.seed(42)  n <- 1200  products <- tibble::tibble(   product_id = sprintf("P%03d", 1:60),   category = rep(c("Accessories", "Electronics", "Home", "Office"), length.out = 60),   base_price = round(runif(60, 8, 300), 2) )  orders <- tibble::tibble(   order_id = sprintf("O%05d", 1:n),   order_date = sample(seq.Date(as.Date("2024-01-01"), as.Date("2024-12-31"), by = "day"), n, replace = TRUE),   region = sample(c("North", "South", "East", "West"), n, replace = TRUE, prob = c(0.27, 0.23, 0.28, 0.22)),   product_id = sample(products$product_id, n, replace = TRUE),   quantity = sample(1:6, n, replace = TRUE, prob = c(0.35, 0.25, 0.18, 0.12, 0.07, 0.03)),   discount_pct = pmin(pmax(rnorm(n, mean = 0.12, sd = 0.10), 0), 0.45) ) %>%   left_join(products, by = "product_id") %>%   mutate(     unit_price = base_price * runif(n(), 0.95, 1.10),     unit_cost = base_price * runif(n(), 0.55, 0.80),     revenue = quantity * unit_price * (1 - discount_pct),     cost = quantity * unit_cost,     profit = revenue - cost   ) %>%   select(order_id, order_date, region, product_id, category, quantity, unit_price, unit_cost, discount_pct, revenue, cost, profit)  # Introduce a few realistic data issues to practice cleaning  orders_dirty <- orders %>%   mutate(     region = ifelse(runif(n()) < 0.01, "", region),                 # blank region     discount_pct = ifelse(runif(n()) < 0.01, -0.05, discount_pct),       # invalid negative discount     unit_price = ifelse(runif(n()) < 0.01, NA, unit_price),              # missing price     order_date = ifelse(runif(n()) < 0.01, "2024-13-40", as.character(order_date)) # invalid date string   )  write_csv(orders_dirty, "data/raw/orders.csv")

Inputs/outputs defined: input is data/raw/orders.csv; processed data will be saved to data/processed/orders_clean.csv; plots and tables will be exported to outputs/.

Import, validate, clean, and transform (with reusable functions)

Cleaning should be explicit and testable. A good pattern is to write small functions for repeated steps (e.g., checking required columns, standardizing text, validating ranges).

Script: R/01_import_clean.R

# R/01_import_clean.R  source("R/00_setup.R")  # Reusable checks -------------------------------------------------------  assert_has_cols <- function(df, cols) {   missing <- setdiff(cols, names(df))   if (length(missing) > 0) stop("Missing columns: ", paste(missing, collapse = ", "))   df }  clamp <- function(x, lower, upper) pmin(pmax(x, lower), upper)  # Import ----------------------------------------------------------------  orders_raw <- read_csv("data/raw/orders.csv", show_col_types = FALSE)  orders_raw <- assert_has_cols(orders_raw, c(   "order_id", "order_date", "region", "product_id", "category", "quantity",   "unit_price", "unit_cost", "discount_pct", "revenue", "cost", "profit" ))  # Clean -----------------------------------------------------------------  orders_clean <- orders_raw %>%   mutate(     # Parse date safely; invalid strings become NA     order_date = suppressWarnings(lubridate::ymd(order_date)),      # Standardize text fields     region = str_squish(region),     region = na_if(region, ""),     category = str_to_title(str_squish(category)),      # Fix numeric issues     discount_pct = clamp(discount_pct, 0, 0.60),     quantity = as.integer(quantity),     unit_price = as.numeric(unit_price),     unit_cost = as.numeric(unit_cost)   ) %>%   # Remove rows that cannot be analyzed reliably   filter(!is.na(order_date), !is.na(region), !is.na(unit_price), quantity > 0) %>%   # Recompute derived fields from cleaned inputs (avoid trusting dirty totals)   mutate(     revenue = quantity * unit_price * (1 - discount_pct),     cost = quantity * unit_cost,     profit = revenue - cost,     profit_margin = ifelse(revenue > 0, profit / revenue, NA_real_),     month = floor_date(order_date, unit = "month")   )  # Quick data quality summary -------------------------------------------  dq <- orders_raw %>%   summarise(     n_rows = n(),     n_bad_dates = sum(is.na(suppressWarnings(ymd(order_date)))),     n_blank_region = sum(is.na(na_if(str_squish(region), ""))),     n_missing_price = sum(is.na(unit_price)),     n_negative_discount = sum(discount_pct < 0, na.rm = TRUE)   )  write_csv(orders_clean, "data/processed/orders_clean.csv") write_csv(dq, "outputs/tables/data_quality_summary.csv")

Why recompute revenue/profit? In real datasets, totals may be inconsistent with line-item fields. Recomputing ensures your analysis is based on a single, consistent definition.

Analysis plan: define metrics and aggregation levels

Before plotting, decide the metrics and the grain (order-level, product-level, monthly, etc.). Here you will compute KPIs at multiple levels so each question has a clear data source.

Script: R/02_analysis.R

# R/02_analysis.R  source("R/00_setup.R")  orders_clean <- read_csv("data/processed/orders_clean.csv", show_col_types = FALSE)  # Core KPIs -------------------------------------------------------------  overall_kpis <- orders_clean %>%   summarise(     n_orders = n(),     revenue = sum(revenue),     profit = sum(profit),     avg_order_value = mean(revenue),     avg_margin = weighted.mean(profit_margin, w = revenue, na.rm = TRUE)   )  # Monthly revenue and profit by region ---------------------------------  monthly_kpis <- orders_clean %>%   group_by(month, region) %>%   summarise(     orders = n(),     revenue = sum(revenue),     profit = sum(profit),     margin = ifelse(sum(revenue) > 0, sum(profit) / sum(revenue), NA_real_),     .groups = "drop"   )  # Category contribution -------------------------------------------------  category_kpis <- orders_clean %>%   group_by(category) %>%   summarise(     revenue = sum(revenue),     profit = sum(profit),     margin = ifelse(sum(revenue) > 0, sum(profit) / sum(revenue), NA_real_),     .groups = "drop"   ) %>%   arrange(desc(revenue))  # Discount vs margin (order-level) -------------------------------------  discount_margin <- orders_clean %>%   transmute(     discount_pct = discount_pct,     profit_margin = profit_margin,     revenue = revenue,     region = region,     category = category   ) %>%   filter(!is.na(profit_margin), revenue > 0)  # Top products ----------------------------------------------------------  top_products <- orders_clean %>%   group_by(product_id, category) %>%   summarise(     revenue = sum(revenue),     profit = sum(profit),     margin = ifelse(sum(revenue) > 0, sum(profit) / sum(revenue), NA_real_),     orders = n(),     .groups = "drop"   ) %>%   arrange(desc(revenue)) %>%   slice_head(n = 10)  # Export tables ---------------------------------------------------------  write_csv(overall_kpis, "outputs/tables/overall_kpis.csv") write_csv(monthly_kpis, "outputs/tables/monthly_kpis.csv") write_csv(category_kpis, "outputs/tables/category_kpis.csv") write_csv(top_products, "outputs/tables/top_products.csv")

Assumption checks and sanity checks (practical, not theoretical)

Assumption checks are about avoiding misleading results. For this mini-project, focus on checks that commonly break business analyses: extreme outliers, tiny denominators, and unstable margins when revenue is near zero.

Checks to run (and how to act on them)

  • Revenue near zero: profit margin becomes unstable. Filter or weight by revenue when summarizing.
  • Outliers: a few orders can dominate totals. Inspect high-revenue orders and confirm they are plausible.
  • Discount range: ensure discounts are within expected bounds (e.g., 0–60%).
  • Group sizes: avoid interpreting patterns for groups with very few orders.

Script snippet: add to R/02_analysis.R (or a separate R/02b_checks.R)

# Sanity checks ----------------------------------------------------------  checks <- list(   revenue_negative = sum(orders_clean$revenue < 0, na.rm = TRUE),   profit_margin_outside = sum(orders_clean$profit_margin < -1 | orders_clean$profit_margin > 1, na.rm = TRUE),   discount_outside = sum(orders_clean$discount_pct < 0 | orders_clean$discount_pct > 0.60, na.rm = TRUE) )  checks_df <- tibble::enframe(checks, name = "check", value = "count") write_csv(checks_df, "outputs/tables/sanity_checks.csv")  # Inspect top 20 orders by revenue (for plausibility review)  top_orders <- orders_clean %>%   arrange(desc(revenue)) %>%   select(order_id, order_date, region, product_id, category, quantity, unit_price, discount_pct, revenue, profit) %>%   slice_head(n = 20)  write_csv(top_orders, "outputs/tables/top_orders_by_revenue.csv")

Visualize key patterns and export plots

Exporting plots is part of the deliverable. Keep figure sizes consistent and use a single theme so the report looks cohesive.

Script: R/03_visualize.R

# R/03_visualize.R  source("R/00_setup.R")  monthly_kpis <- read_csv("outputs/tables/monthly_kpis.csv", show_col_types = FALSE) category_kpis <- read_csv("outputs/tables/category_kpis.csv", show_col_types = FALSE) top_products <- read_csv("outputs/tables/top_products.csv", show_col_types = FALSE) orders_clean <- read_csv("data/processed/orders_clean.csv", show_col_types = FALSE)  theme_set(theme_minimal(base_size = 12))  # 1) Monthly revenue trend by region ------------------------------------  p_monthly <- ggplot(monthly_kpis, aes(x = month, y = revenue, color = region)) +   geom_line(linewidth = 1) +   scale_y_continuous(labels = scales::dollar) +   labs(title = "Monthly revenue by region", x = NULL, y = "Revenue")  ggsave("outputs/figures/monthly_revenue_by_region.png", p_monthly, width = 9, height = 5, dpi = 150)  # 2) Category revenue and profit ----------------------------------------  category_long <- category_kpis %>%   select(category, revenue, profit) %>%   pivot_longer(cols = c(revenue, profit), names_to = "metric", values_to = "value")  p_category <- ggplot(category_long, aes(x = reorder(category, value), y = value, fill = metric)) +   geom_col(position = "dodge") +   coord_flip() +   scale_y_continuous(labels = scales::dollar) +   labs(title = "Revenue and profit by category", x = NULL, y = NULL)  ggsave("outputs/figures/category_revenue_profit.png", p_category, width = 9, height = 5, dpi = 150)  # 3) Discount vs profit margin (weighted view) ---------------------------  # Use point size by revenue to emphasize business impact  p_discount <- ggplot(orders_clean, aes(x = discount_pct, y = profit_margin)) +   geom_hline(yintercept = 0, linewidth = 0.6, color = "grey50") +   geom_point(aes(size = revenue), alpha = 0.25) +   scale_x_continuous(labels = scales::percent) +   scale_y_continuous(labels = scales::percent) +   guides(size = "none") +   labs(title = "Discount vs profit margin (point size = revenue)", x = "Discount", y = "Profit margin")  ggsave("outputs/figures/discount_vs_margin.png", p_discount, width = 9, height = 5, dpi = 150)  # 4) Top products by revenue with margin --------------------------------  p_top <- ggplot(top_products, aes(x = reorder(product_id, revenue), y = revenue)) +   geom_col(fill = "steelblue") +   coord_flip() +   scale_y_continuous(labels = scales::dollar) +   labs(title = "Top 10 products by revenue", x = NULL, y = "Revenue")  ggsave("outputs/figures/top_products_by_revenue.png", p_top, width = 9, height = 5, dpi = 150)

Build the final report: computed results and interpretation prompts

Your report should read like a short decision memo: what you analyzed, what you found, and what the numbers show. Keep interpretation tied to exported tables and plots so it is reproducible.

Report skeleton: reports/report.Rmd (body sections)

```{r setup, include=FALSE} library(readr) library(dplyr) library(ggplot2) library(scales) ```  ```{r load-data} overall_kpis <- read_csv("../outputs/tables/overall_kpis.csv", show_col_types = FALSE) monthly_kpis <- read_csv("../outputs/tables/monthly_kpis.csv", show_col_types = FALSE) category_kpis <- read_csv("../outputs/tables/category_kpis.csv", show_col_types = FALSE) top_products <- read_csv("../outputs/tables/top_products.csv", show_col_types = FALSE) ```  ## Objective and questions - How has revenue changed month-to-month overall and by region? - Which categories contribute most to revenue and profit? - Do higher discounts correlate with lower profit margin? - What are the top products by revenue and their margins?  ## Data quality notes ```{r dq} read_csv("../outputs/tables/data_quality_summary.csv", show_col_types = FALSE) ```  ## Overall KPIs ```{r overall} overall_kpis ```  ## Trends by month and region (see exported figure) ```{r monthly-table} monthly_kpis %>% arrange(month, region) %>% head(12) ```  ## Category contribution ```{r category} category_kpis ```  ## Top products ```{r top-products} top_products ```  ## Interpretation prompts (fill using computed results) - Identify the highest-revenue region-month combinations from `monthly_kpis`. - Compare category margins from `category_kpis` and call out categories with high revenue but low margin. - Use the discount vs margin plot to describe whether margin tends to decrease as discount increases, focusing on high-revenue points. - For the top 10 products, note any high-revenue items with weak margins and quantify the margin from `top_products`.

Tip: Keep the report focused on answering the original questions. If you add extra analysis, label it clearly as exploratory and ensure it still references computed outputs.

Run order: make the project reproducible

A simple run order ensures anyone can reproduce the same outputs from scratch.

  • Run R/00_setup.R (creates folders and the example raw CSV).
  • Run R/01_import_clean.R (creates processed data and data quality table).
  • Run R/02_analysis.R (creates KPI tables).
  • Run R/03_visualize.R (exports figures).
  • Knit reports/report.Rmd to HTML/PDF (depending on your setup).

Deliverable checklist (what to submit)

  • Runnable script set: R/00_setup.R, R/01_import_clean.R, R/02_analysis.R, R/03_visualize.R run in order without manual edits.
  • Processed dataset: data/processed/orders_clean.csv.
  • Exported tables: outputs/tables/overall_kpis.csv, monthly_kpis.csv, category_kpis.csv, top_products.csv, plus data_quality_summary.csv and sanity_checks.csv.
  • Exported plots: PNG files in outputs/figures/ (monthly trend, category comparison, discount vs margin, top products).
  • Knitted report: rendered output from reports/report.Rmd that references the exported tables/figures.
  • Short interpretation section: 5–10 bullet points in the report that cite specific computed values (e.g., top category margin, lowest-margin region, month with peak revenue) rather than opinions.

Now answer the exercise about the content:

Why does the workflow recompute revenue and profit after cleaning the data instead of trusting the imported totals?

You are right! Congratulations, now go to the next page

You missed! Try again.

Recomputing derived fields (revenue, cost, profit) from cleaned inputs avoids relying on inconsistent or dirty totals and keeps the analysis grounded in one consistent definition.

Free Ebook cover R Programming for Data Analysis: The Practical Starter Guide
100%

R Programming for Data Analysis: The Practical Starter Guide

New course

8 pages

Download the app to earn free Certification and listen to the courses in the background, even with the screen off.