Mini-project overview: from raw data to a shareable report
In this chapter you will run a complete, realistic analysis workflow on a small dataset. The goal is not to learn new syntax, but to integrate skills into a repeatable process: define inputs/outputs, organize files, write reusable functions, validate assumptions, produce plots, and generate a final report grounded in computed results.
Project scenario and questions
Scenario: You manage a small online store and want to understand what drives revenue and whether discounts are helping or hurting profitability.
Dataset: Order-level data with product category, region, dates, quantities, prices, and discounts.
Questions to answer:
- How has revenue changed month-to-month overall and by region?
- Which categories contribute most to revenue and profit?
- Do higher discounts correlate with lower profit margin?
- Are there regions or categories where discounting appears beneficial?
- What are the top 10 products by revenue, and do they have healthy margins?
Scalable project structure (folders, naming, and outputs)
Folder layout
Create a project folder (ideally an RStudio Project) with a predictable structure. This keeps paths stable and makes it easy to rerun or share the work.
- Listen to the audio with the screen off.
- Earn a certificate upon completion.
- Over 5000 courses for you to explore!
Download the app
r-mini-project/ data/ raw/ processed/ R/ 00_setup.R 01_import_clean.R 02_analysis.R 03_visualize.R reports/ report.Rmd outputs/ figures/ tables/ README.mdNaming conventions that scale
- Files: prefix scripts with numbers to enforce run order (00_, 01_, 02_).
- Objects: use snake_case and be explicit:
orders_raw,orders_clean,monthly_kpis. - Columns: keep consistent units and meaning (e.g.,
discount_pctas 0–1, not 0–100). - Outputs: save with informative names and dates if needed:
monthly_revenue_by_region.png.
Dataset: create a realistic CSV (so everyone can run it)
To make this chapter fully runnable without external downloads, you will generate a dataset and write it to data/raw/orders.csv. In a real project, this step would be replaced by placing your actual raw file in the same folder.
Script: R/00_setup.R
# R/00_setup.R # Packages used across the project library(dplyr) library(ggplot2) library(readr) library(lubridate) library(stringr) library(tidyr) # Create folders if missing dir.create("data/raw", recursive = TRUE, showWarnings = FALSE) dir.create("data/processed", recursive = TRUE, showWarnings = FALSE) dir.create("outputs/figures", recursive = TRUE, showWarnings = FALSE) dir.create("outputs/tables", recursive = TRUE, showWarnings = FALSE) # Reproducible example dataset generator set.seed(42) n <- 1200 products <- tibble::tibble( product_id = sprintf("P%03d", 1:60), category = rep(c("Accessories", "Electronics", "Home", "Office"), length.out = 60), base_price = round(runif(60, 8, 300), 2) ) orders <- tibble::tibble( order_id = sprintf("O%05d", 1:n), order_date = sample(seq.Date(as.Date("2024-01-01"), as.Date("2024-12-31"), by = "day"), n, replace = TRUE), region = sample(c("North", "South", "East", "West"), n, replace = TRUE, prob = c(0.27, 0.23, 0.28, 0.22)), product_id = sample(products$product_id, n, replace = TRUE), quantity = sample(1:6, n, replace = TRUE, prob = c(0.35, 0.25, 0.18, 0.12, 0.07, 0.03)), discount_pct = pmin(pmax(rnorm(n, mean = 0.12, sd = 0.10), 0), 0.45) ) %>% left_join(products, by = "product_id") %>% mutate( unit_price = base_price * runif(n(), 0.95, 1.10), unit_cost = base_price * runif(n(), 0.55, 0.80), revenue = quantity * unit_price * (1 - discount_pct), cost = quantity * unit_cost, profit = revenue - cost ) %>% select(order_id, order_date, region, product_id, category, quantity, unit_price, unit_cost, discount_pct, revenue, cost, profit) # Introduce a few realistic data issues to practice cleaning orders_dirty <- orders %>% mutate( region = ifelse(runif(n()) < 0.01, "", region), # blank region discount_pct = ifelse(runif(n()) < 0.01, -0.05, discount_pct), # invalid negative discount unit_price = ifelse(runif(n()) < 0.01, NA, unit_price), # missing price order_date = ifelse(runif(n()) < 0.01, "2024-13-40", as.character(order_date)) # invalid date string ) write_csv(orders_dirty, "data/raw/orders.csv")Inputs/outputs defined: input is data/raw/orders.csv; processed data will be saved to data/processed/orders_clean.csv; plots and tables will be exported to outputs/.
Import, validate, clean, and transform (with reusable functions)
Cleaning should be explicit and testable. A good pattern is to write small functions for repeated steps (e.g., checking required columns, standardizing text, validating ranges).
Script: R/01_import_clean.R
# R/01_import_clean.R source("R/00_setup.R") # Reusable checks ------------------------------------------------------- assert_has_cols <- function(df, cols) { missing <- setdiff(cols, names(df)) if (length(missing) > 0) stop("Missing columns: ", paste(missing, collapse = ", ")) df } clamp <- function(x, lower, upper) pmin(pmax(x, lower), upper) # Import ---------------------------------------------------------------- orders_raw <- read_csv("data/raw/orders.csv", show_col_types = FALSE) orders_raw <- assert_has_cols(orders_raw, c( "order_id", "order_date", "region", "product_id", "category", "quantity", "unit_price", "unit_cost", "discount_pct", "revenue", "cost", "profit" )) # Clean ----------------------------------------------------------------- orders_clean <- orders_raw %>% mutate( # Parse date safely; invalid strings become NA order_date = suppressWarnings(lubridate::ymd(order_date)), # Standardize text fields region = str_squish(region), region = na_if(region, ""), category = str_to_title(str_squish(category)), # Fix numeric issues discount_pct = clamp(discount_pct, 0, 0.60), quantity = as.integer(quantity), unit_price = as.numeric(unit_price), unit_cost = as.numeric(unit_cost) ) %>% # Remove rows that cannot be analyzed reliably filter(!is.na(order_date), !is.na(region), !is.na(unit_price), quantity > 0) %>% # Recompute derived fields from cleaned inputs (avoid trusting dirty totals) mutate( revenue = quantity * unit_price * (1 - discount_pct), cost = quantity * unit_cost, profit = revenue - cost, profit_margin = ifelse(revenue > 0, profit / revenue, NA_real_), month = floor_date(order_date, unit = "month") ) # Quick data quality summary ------------------------------------------- dq <- orders_raw %>% summarise( n_rows = n(), n_bad_dates = sum(is.na(suppressWarnings(ymd(order_date)))), n_blank_region = sum(is.na(na_if(str_squish(region), ""))), n_missing_price = sum(is.na(unit_price)), n_negative_discount = sum(discount_pct < 0, na.rm = TRUE) ) write_csv(orders_clean, "data/processed/orders_clean.csv") write_csv(dq, "outputs/tables/data_quality_summary.csv")Why recompute revenue/profit? In real datasets, totals may be inconsistent with line-item fields. Recomputing ensures your analysis is based on a single, consistent definition.
Analysis plan: define metrics and aggregation levels
Before plotting, decide the metrics and the grain (order-level, product-level, monthly, etc.). Here you will compute KPIs at multiple levels so each question has a clear data source.
Script: R/02_analysis.R
# R/02_analysis.R source("R/00_setup.R") orders_clean <- read_csv("data/processed/orders_clean.csv", show_col_types = FALSE) # Core KPIs ------------------------------------------------------------- overall_kpis <- orders_clean %>% summarise( n_orders = n(), revenue = sum(revenue), profit = sum(profit), avg_order_value = mean(revenue), avg_margin = weighted.mean(profit_margin, w = revenue, na.rm = TRUE) ) # Monthly revenue and profit by region --------------------------------- monthly_kpis <- orders_clean %>% group_by(month, region) %>% summarise( orders = n(), revenue = sum(revenue), profit = sum(profit), margin = ifelse(sum(revenue) > 0, sum(profit) / sum(revenue), NA_real_), .groups = "drop" ) # Category contribution ------------------------------------------------- category_kpis <- orders_clean %>% group_by(category) %>% summarise( revenue = sum(revenue), profit = sum(profit), margin = ifelse(sum(revenue) > 0, sum(profit) / sum(revenue), NA_real_), .groups = "drop" ) %>% arrange(desc(revenue)) # Discount vs margin (order-level) ------------------------------------- discount_margin <- orders_clean %>% transmute( discount_pct = discount_pct, profit_margin = profit_margin, revenue = revenue, region = region, category = category ) %>% filter(!is.na(profit_margin), revenue > 0) # Top products ---------------------------------------------------------- top_products <- orders_clean %>% group_by(product_id, category) %>% summarise( revenue = sum(revenue), profit = sum(profit), margin = ifelse(sum(revenue) > 0, sum(profit) / sum(revenue), NA_real_), orders = n(), .groups = "drop" ) %>% arrange(desc(revenue)) %>% slice_head(n = 10) # Export tables --------------------------------------------------------- write_csv(overall_kpis, "outputs/tables/overall_kpis.csv") write_csv(monthly_kpis, "outputs/tables/monthly_kpis.csv") write_csv(category_kpis, "outputs/tables/category_kpis.csv") write_csv(top_products, "outputs/tables/top_products.csv")Assumption checks and sanity checks (practical, not theoretical)
Assumption checks are about avoiding misleading results. For this mini-project, focus on checks that commonly break business analyses: extreme outliers, tiny denominators, and unstable margins when revenue is near zero.
Checks to run (and how to act on them)
- Revenue near zero: profit margin becomes unstable. Filter or weight by revenue when summarizing.
- Outliers: a few orders can dominate totals. Inspect high-revenue orders and confirm they are plausible.
- Discount range: ensure discounts are within expected bounds (e.g., 0–60%).
- Group sizes: avoid interpreting patterns for groups with very few orders.
Script snippet: add to R/02_analysis.R (or a separate R/02b_checks.R)
# Sanity checks ---------------------------------------------------------- checks <- list( revenue_negative = sum(orders_clean$revenue < 0, na.rm = TRUE), profit_margin_outside = sum(orders_clean$profit_margin < -1 | orders_clean$profit_margin > 1, na.rm = TRUE), discount_outside = sum(orders_clean$discount_pct < 0 | orders_clean$discount_pct > 0.60, na.rm = TRUE) ) checks_df <- tibble::enframe(checks, name = "check", value = "count") write_csv(checks_df, "outputs/tables/sanity_checks.csv") # Inspect top 20 orders by revenue (for plausibility review) top_orders <- orders_clean %>% arrange(desc(revenue)) %>% select(order_id, order_date, region, product_id, category, quantity, unit_price, discount_pct, revenue, profit) %>% slice_head(n = 20) write_csv(top_orders, "outputs/tables/top_orders_by_revenue.csv")Visualize key patterns and export plots
Exporting plots is part of the deliverable. Keep figure sizes consistent and use a single theme so the report looks cohesive.
Script: R/03_visualize.R
# R/03_visualize.R source("R/00_setup.R") monthly_kpis <- read_csv("outputs/tables/monthly_kpis.csv", show_col_types = FALSE) category_kpis <- read_csv("outputs/tables/category_kpis.csv", show_col_types = FALSE) top_products <- read_csv("outputs/tables/top_products.csv", show_col_types = FALSE) orders_clean <- read_csv("data/processed/orders_clean.csv", show_col_types = FALSE) theme_set(theme_minimal(base_size = 12)) # 1) Monthly revenue trend by region ------------------------------------ p_monthly <- ggplot(monthly_kpis, aes(x = month, y = revenue, color = region)) + geom_line(linewidth = 1) + scale_y_continuous(labels = scales::dollar) + labs(title = "Monthly revenue by region", x = NULL, y = "Revenue") ggsave("outputs/figures/monthly_revenue_by_region.png", p_monthly, width = 9, height = 5, dpi = 150) # 2) Category revenue and profit ---------------------------------------- category_long <- category_kpis %>% select(category, revenue, profit) %>% pivot_longer(cols = c(revenue, profit), names_to = "metric", values_to = "value") p_category <- ggplot(category_long, aes(x = reorder(category, value), y = value, fill = metric)) + geom_col(position = "dodge") + coord_flip() + scale_y_continuous(labels = scales::dollar) + labs(title = "Revenue and profit by category", x = NULL, y = NULL) ggsave("outputs/figures/category_revenue_profit.png", p_category, width = 9, height = 5, dpi = 150) # 3) Discount vs profit margin (weighted view) --------------------------- # Use point size by revenue to emphasize business impact p_discount <- ggplot(orders_clean, aes(x = discount_pct, y = profit_margin)) + geom_hline(yintercept = 0, linewidth = 0.6, color = "grey50") + geom_point(aes(size = revenue), alpha = 0.25) + scale_x_continuous(labels = scales::percent) + scale_y_continuous(labels = scales::percent) + guides(size = "none") + labs(title = "Discount vs profit margin (point size = revenue)", x = "Discount", y = "Profit margin") ggsave("outputs/figures/discount_vs_margin.png", p_discount, width = 9, height = 5, dpi = 150) # 4) Top products by revenue with margin -------------------------------- p_top <- ggplot(top_products, aes(x = reorder(product_id, revenue), y = revenue)) + geom_col(fill = "steelblue") + coord_flip() + scale_y_continuous(labels = scales::dollar) + labs(title = "Top 10 products by revenue", x = NULL, y = "Revenue") ggsave("outputs/figures/top_products_by_revenue.png", p_top, width = 9, height = 5, dpi = 150)Build the final report: computed results and interpretation prompts
Your report should read like a short decision memo: what you analyzed, what you found, and what the numbers show. Keep interpretation tied to exported tables and plots so it is reproducible.
Report skeleton: reports/report.Rmd (body sections)
```{r setup, include=FALSE} library(readr) library(dplyr) library(ggplot2) library(scales) ``` ```{r load-data} overall_kpis <- read_csv("../outputs/tables/overall_kpis.csv", show_col_types = FALSE) monthly_kpis <- read_csv("../outputs/tables/monthly_kpis.csv", show_col_types = FALSE) category_kpis <- read_csv("../outputs/tables/category_kpis.csv", show_col_types = FALSE) top_products <- read_csv("../outputs/tables/top_products.csv", show_col_types = FALSE) ``` ## Objective and questions - How has revenue changed month-to-month overall and by region? - Which categories contribute most to revenue and profit? - Do higher discounts correlate with lower profit margin? - What are the top products by revenue and their margins? ## Data quality notes ```{r dq} read_csv("../outputs/tables/data_quality_summary.csv", show_col_types = FALSE) ``` ## Overall KPIs ```{r overall} overall_kpis ``` ## Trends by month and region (see exported figure) ```{r monthly-table} monthly_kpis %>% arrange(month, region) %>% head(12) ``` ## Category contribution ```{r category} category_kpis ``` ## Top products ```{r top-products} top_products ``` ## Interpretation prompts (fill using computed results) - Identify the highest-revenue region-month combinations from `monthly_kpis`. - Compare category margins from `category_kpis` and call out categories with high revenue but low margin. - Use the discount vs margin plot to describe whether margin tends to decrease as discount increases, focusing on high-revenue points. - For the top 10 products, note any high-revenue items with weak margins and quantify the margin from `top_products`.Tip: Keep the report focused on answering the original questions. If you add extra analysis, label it clearly as exploratory and ensure it still references computed outputs.
Run order: make the project reproducible
A simple run order ensures anyone can reproduce the same outputs from scratch.
- Run
R/00_setup.R(creates folders and the example raw CSV). - Run
R/01_import_clean.R(creates processed data and data quality table). - Run
R/02_analysis.R(creates KPI tables). - Run
R/03_visualize.R(exports figures). - Knit
reports/report.Rmdto HTML/PDF (depending on your setup).
Deliverable checklist (what to submit)
- Runnable script set:
R/00_setup.R,R/01_import_clean.R,R/02_analysis.R,R/03_visualize.Rrun in order without manual edits. - Processed dataset:
data/processed/orders_clean.csv. - Exported tables:
outputs/tables/overall_kpis.csv,monthly_kpis.csv,category_kpis.csv,top_products.csv, plusdata_quality_summary.csvandsanity_checks.csv. - Exported plots: PNG files in
outputs/figures/(monthly trend, category comparison, discount vs margin, top products). - Knitted report: rendered output from
reports/report.Rmdthat references the exported tables/figures. - Short interpretation section: 5–10 bullet points in the report that cite specific computed values (e.g., top category margin, lowest-margin region, month with peak revenue) rather than opinions.