All courses > Technology and Programming > Data Science and Business Intelligence ::

Importing Real-World Files into R (CSV, Excel, and Text Data)

Capítulo 3

Estimated reading time: 10 minutes

Why imports fail in practice (and how to make them reliable)

Real-world files rarely arrive in a perfectly clean format. Common issues include: the wrong delimiter (comma vs semicolon vs tab), unexpected header rows, inconsistent missing-value markers ("NA", "", "-", "NULL"), text encoding problems (accented characters), and ambiguous dates (01/02/2024: is it January 2 or February 1?). Reliable importing means you explicitly state your assumptions in code, verify the result with quick checks, and adjust until the imported table matches the source meaning.

Core tools: readr for text files, readxl for Excel

Packages and a repeatable file path setup

Use readr for delimited text (CSV, TSV, pipe-delimited) and readxl for .xls/.xlsx. Prefer project-relative paths so your code runs on another machine without edits.

library(readr)   # text files: csv, tsv, delimited, fixed-width helpers, parsing tools
library(readxl)  # Excel files
library(dplyr)   # for quick checks and combining
library(janitor) # for clean_names()

# Example paths (adjust to your project structure)
path_csv   <- "data/sales_2024_01.csv"
path_excel <- "data/sales_2024_02.xlsx"
path_txt   <- "data/sales_2024_03.txt"

Importing CSV and other delimited text with readr

1) Basic CSV import with explicit missing values

Start with read_csv() and declare which strings should become missing (NA). This prevents "-" or "NULL" from being treated as real values.

sales_jan <- read_csv(
  file = path_csv,
  na = c("", "NA", "N/A", "NULL", "-"),
  show_col_types = FALSE
)

2) Handling delimiters: comma, semicolon, tab, pipe

If a file uses semicolons (common in some locales) or tabs, choose the matching function or use read_delim() with delim.

# Semicolon-delimited
sales_sc <- read_delim(
  file = "data/sales_semicolon.csv",
  delim = ";",
  na = c("", "NA", "-"),
  show_col_types = FALSE
)

# Tab-delimited
sales_tsv <- read_tsv(
  file = "data/sales.tsv",
  na = c("", "NA"),
  show_col_types = FALSE
)

# Pipe-delimited
sales_pipe <- read_delim(
  file = "data/sales_pipe.txt",
  delim = "|",
  na = c("", "NA"),
  show_col_types = FALSE
)

3) Headers and skipped lines

Some exports include a title line, notes, or multiple header rows. Use skip to ignore leading lines, and set col_names when the file has no header row.

Continue in our app.

Listen to the audio with the screen off.
Earn a certificate upon completion.
Over 5000 courses for you to explore!

Or continue reading below...

Download the app

# Skip the first 2 lines (e.g., report title + generated timestamp)
sales_skipped <- read_csv(
  file = "data/sales_with_notes.csv",
  skip = 2,
  na = c("", "NA"),
  show_col_types = FALSE
)

# File has no header row; provide names
sales_no_header <- read_csv(
  file = "data/sales_no_header.csv",
  col_names = c("order_id", "order_date", "customer_id", "amount"),
  na = c("", "NA"),
  show_col_types = FALSE
)

4) Encoding problems (accented characters, smart quotes)

If you see garbled characters (e.g., "JosÃ©" instead of "José"), specify locale encoding. UTF-8 is common; Windows-1252/Latin1 appears in older exports.

sales_utf8 <- read_csv(
  file = "data/customers.csv",
  locale = locale(encoding = "UTF-8"),
  show_col_types = FALSE
)

sales_win1252 <- read_csv(
  file = "data/customers_legacy.csv",
  locale = locale(encoding = "Windows-1252"),
  show_col_types = FALSE
)

5) Date parsing and locale

Dates are a frequent source of silent errors. If your dates are day-month-year, set a locale date_format or parse explicitly after import. Prefer ISO dates (YYYY-MM-DD) when you control the export.

# If the file uses D/M/Y like 31/01/2024
sales_dates <- read_csv(
  file = "data/sales_dmy.csv",
  locale = locale(date_format = "%d/%m/%Y"),
  show_col_types = FALSE
)

Specifying column types (preventing wrong guesses)

Why you should override type guessing

readr guesses types from the first rows. This can misclassify columns when early rows are empty, when IDs look numeric but should be text (leading zeros), or when currency symbols appear. Use col_types to lock in the schema.

Common column type patterns

IDs: keep as text to preserve leading zeros (e.g., "00123").
Amounts: numeric (double) after removing currency symbols (often requires cleaning).
Dates: Date type with the correct format.
Categorical fields: character is fine; you can convert later if needed.

sales_typed <- read_csv(
  file = path_csv,
  na = c("", "NA", "-"),
  col_types = cols(
    order_id    = col_character(),
    order_date  = col_date(format = "%Y-%m-%d"),
    customer_id = col_character(),
    amount      = col_double(),
    region      = col_character()
  ),
  show_col_types = FALSE
)

Using problems() to find parsing issues

If parsing fails for some rows, readr records issues. Inspect them early to avoid downstream surprises.

problems(sales_typed)

Importing Excel files with readxl

1) Listing sheets and choosing the right one

Excel workbooks often contain multiple sheets (raw data, pivot tables, notes). Always check sheet names and import the intended sheet.

excel_sheets(path_excel)

sales_feb <- read_excel(
  path = path_excel,
  sheet = "Sales",
  na = c("", "NA", "N/A", "-", "NULL")
)

2) Skipping header rows and selecting a range

Excel exports may have a title block above the table. Use skip to drop those rows, or range to import a specific cell rectangle.

# Skip first 3 rows
sales_feb2 <- read_excel(
  path = path_excel,
  sheet = 1,
  skip = 3,
  na = c("", "NA", "-")
)

# Import a specific range
sales_feb_range <- read_excel(
  path = path_excel,
  sheet = "Sales",
  range = "A1:F5000",
  na = c("", "NA", "-")
)

3) Controlling types in Excel imports

Excel columns can mix numbers and text. readxl provides col_types to force types. Use "text" for IDs and "date" for true Excel dates. If dates are stored as text, import as text and parse after.

sales_feb_typed <- read_excel(
  path = path_excel,
  sheet = "Sales",
  col_types = c("text", "date", "text", "numeric", "text", "text"),
  na = c("", "NA", "-")
)

Quick sanity checks after every import

1) Row counts and obvious emptiness

Confirm you imported the expected number of rows and that key columns are not entirely missing.

nrow(sales_typed)

sales_typed %>% summarise(
  missing_order_id = sum(is.na(order_id)),
  missing_amount   = sum(is.na(amount))
)

2) Unique key checks (detect duplicates)

If order_id should uniquely identify rows, verify it. Duplicates can indicate repeated headers, appended totals, or multiple extracts merged incorrectly.

sales_typed %>% summarise(
  rows = n(),
  distinct_order_id = n_distinct(order_id)
)

# Inspect duplicates
sales_typed %>%
  count(order_id, sort = TRUE) %>%
  filter(n > 1)

3) Value ranges and plausibility checks

Look for impossible values (negative amounts, dates outside the expected period). These checks catch parsing problems (e.g., amount imported as 1,234 instead of 1234, or date swapped month/day).

sales_typed %>% summarise(
  min_date = min(order_date, na.rm = TRUE),
  max_date = max(order_date, na.rm = TRUE),
  min_amount = min(amount, na.rm = TRUE),
  max_amount = max(amount, na.rm = TRUE)
)

# Flag suspicious rows
sales_typed %>%
  filter(amount < 0 | amount > 100000)

Structured lab: import multiple files, standardize names, combine, and document assumptions

Lab scenario

You receive three monthly extracts from different sources: January as CSV, February as Excel, March as pipe-delimited text. Column names differ slightly across files (e.g., "Order ID" vs "order_id" vs "ORDERID"), and missing values are marked inconsistently. Your goal is to import all three, standardize column names, align column types, combine into one table, and run sanity checks.

Step 1: Define shared assumptions and a target schema

Write assumptions as code comments so future you (or a teammate) knows why choices were made.

# Assumptions:
# 1) order_id uniquely identifies an order across all months.
# 2) order_date represents the transaction date (not ship date).
# 3) customer_id may contain leading zeros, so keep as text.
# 4) amount is numeric in the source currency; missing markers include "", "-", "N/A", "NULL".
# 5) January dates are YYYY-MM-DD; February is true Excel date; March dates are D/M/Y.

na_markers <- c("", "NA", "N/A", "NULL", "-")

target_types <- cols(
  order_id    = col_character(),
  order_date  = col_date(),
  customer_id = col_character(),
  amount      = col_double(),
  region      = col_character()
)

Step 2: Import each file with explicit settings

# January (CSV, ISO dates)
sales_jan <- read_csv(
  "data/sales_2024_01.csv",
  na = na_markers,
  col_types = cols(
    order_id    = col_character(),
    order_date  = col_date(format = "%Y-%m-%d"),
    customer_id = col_character(),
    amount      = col_double(),
    region      = col_character()
  ),
  show_col_types = FALSE
)

# February (Excel)
sales_feb <- read_excel(
  "data/sales_2024_02.xlsx",
  sheet = "Sales",
  na = na_markers,
  col_types = c("text", "date", "text", "numeric", "text")
)

# March (pipe-delimited text, D/M/Y dates)
sales_mar <- read_delim(
  "data/sales_2024_03.txt",
  delim = "|",
  na = na_markers,
  locale = locale(date_format = "%d/%m/%Y"),
  col_types = cols(
    order_id    = col_character(),
    order_date  = col_date(format = "%d/%m/%Y"),
    customer_id = col_character(),
    amount      = col_double(),
    region      = col_character()
  ),
  show_col_types = FALSE
)

Step 3: Standardize column names consistently

Use janitor::clean_names() to convert names to a consistent snake_case style. Then, if needed, rename columns to match the target schema (especially if different sources use different labels).

sales_jan <- sales_jan %>% janitor::clean_names()
sales_feb <- sales_feb %>% janitor::clean_names()
sales_mar <- sales_mar %>% janitor::clean_names()

# Example: if February uses "orderid" instead of "order_id", fix it here
sales_feb <- sales_feb %>%
  rename(
    order_id = orderid,
    order_date = order_date,
    customer_id = customer_id,
    amount = amount,
    region = region
  )

Step 4: Align columns and enforce the target schema

Before combining, ensure all tables have the same columns. Add missing columns explicitly (as NA) if a source lacks them, and coerce types to match the target.

# Helper to ensure required columns exist
ensure_cols <- function(df) {
  required <- c("order_id", "order_date", "customer_id", "amount", "region")
  missing <- setdiff(required, names(df))
  if (length(missing) > 0) {
    for (m in missing) df[[m]] <- NA
  }
  df %>% select(all_of(required))
}

sales_jan <- ensure_cols(sales_jan)
sales_feb <- ensure_cols(sales_feb)
sales_mar <- ensure_cols(sales_mar)

# Enforce types (use mutate to coerce safely)
sales_jan <- sales_jan %>% mutate(
  order_id = as.character(order_id),
  customer_id = as.character(customer_id),
  amount = as.double(amount),
  order_date = as.Date(order_date)
)

sales_feb <- sales_feb %>% mutate(
  order_id = as.character(order_id),
  customer_id = as.character(customer_id),
  amount = as.double(amount),
  order_date = as.Date(order_date)
)

sales_mar <- sales_mar %>% mutate(
  order_id = as.character(order_id),
  customer_id = as.character(customer_id),
  amount = as.double(amount),
  order_date = as.Date(order_date)
)

Step 5: Combine into a single table and add provenance

When combining, add a source column so you can trace issues back to the original file.

sales_all <- bind_rows(
  sales_jan %>% mutate(source_file = "sales_2024_01.csv"),
  sales_feb %>% mutate(source_file = "sales_2024_02.xlsx"),
  sales_mar %>% mutate(source_file = "sales_2024_03.txt")
)

Step 6: Validate the combined import with sanity checks

Run checks that match your assumptions: row counts per file, uniqueness of order_id, missingness, and plausible ranges.

# Row counts by source
sales_all %>% count(source_file)

# Key uniqueness overall
sales_all %>% summarise(
  rows = n(),
  distinct_order_id = n_distinct(order_id)
)

# Duplicates with source context
sales_all %>%
  count(order_id, sort = TRUE) %>%
  filter(n > 1) %>%
  left_join(sales_all %>% select(order_id, source_file), by = "order_id") %>%
  distinct()

# Missingness summary
sales_all %>% summarise(
  missing_order_id = sum(is.na(order_id) | order_id == ""),
  missing_order_date = sum(is.na(order_date)),
  missing_amount = sum(is.na(amount))
)

# Range checks
sales_all %>% summarise(
  min_date = min(order_date, na.rm = TRUE),
  max_date = max(order_date, na.rm = TRUE),
  min_amount = min(amount, na.rm = TRUE),
  max_amount = max(amount, na.rm = TRUE)
)

# Investigate suspicious amounts
sales_all %>%
  filter(is.na(amount) | amount < 0 | amount > 100000) %>%
  select(source_file, order_id, order_date, customer_id, amount, region)

Step 7: Capture parsing issues early

If any of the text imports produced parsing problems, inspect them immediately and decide whether to adjust col_types, locale, or pre-clean the raw file.

# Only applies to readr imports (CSV/delim). If you used readr for Jan/Mar:
problems(sales_jan)
problems(sales_mar)

Now answer the exercise about the content:

When importing a real-world delimited text file into R, which approach best improves reliability and helps prevent silent data issues?

You are right! Congratulations, now go to the next page

You missed! Try again.

Reliable imports come from stating assumptions in code (e.g., delimiter, na markers, locale/encoding, col_types) and then validating with quick checks like row counts, missingness, uniqueness, and ranges.