All courses > Technology and Programming > Data Science and Business Intelligence ::

Working with Data Frames in R: Tibbles, Indexing, and Inspection

Capítulo 2

Estimated reading time: 3 minutes

Data frames as the central structure for analysis

In R, most analysis workflows revolve around a rectangular table: rows represent observations (records) and columns represent variables (fields). This structure is called a data frame. Many operations—filtering, summarizing, joining, plotting—assume your data is in a data frame-like object.

Two common “flavors” you will encounter are base R data.frame and tibble (from the tidyverse). They behave similarly, but there are important differences that affect inspection, printing, and some edge cases.

Tibbles vs. base data frames

How they print and why it matters

Base data frames often print many rows and can convert character columns to factors depending on how they were created (especially in older code). Tibbles print a compact preview and show column types, which makes it easier to spot schema issues early.

# Example: create a base data.frame and a tibble with the same content
base_df <- data.frame(
  id = 1:3,
  name = c("Ava", "Ben", "Cara"),
  score = c(10, NA, 12)
)

library(tibble)
tbl <- tibble(
  id = 1:3,
  name = c("Ava", "Ben", "Cara"),
  score = c(10, NA, 12)
)

base_df
tbl

When you print tbl, you typically see a type summary like <int>, <chr>, <dbl>. This is a fast way to confirm whether columns were parsed as expected.

Subsetting behavior: drop vs. keep

A common pitfall in base R is that selecting a single column with [, "col"] may return a vector (dropping the data frame structure). Tibbles are stricter and tend to keep data frame structure with [, which helps avoid accidental type changes in pipelines.

Continue in our app.

Listen to the audio with the screen off.
Earn a certificate upon completion.
Over 5000 courses for you to explore!

Or continue reading below...

Download the app

# Base data.frame: may drop to a vector
x1 <- base_df[, "name"]
class(x1)

# Keep as data frame explicitly
x2 <- base_df[, "name", drop = FALSE]
class(x2)

# Tibble: keeps a tibble when using [
x3 <- tbl[, "name"]
class(x3)

Rule of thumb: if you need a single column as a vector, use [[ (or $ when appropriate). If you need a one-column data frame, use [ with drop = FALSE in base R.

Column types (schema) and common pitfalls

Core types you will see

numeric: integer and double (decimals)
character: text strings
logical: TRUE/FALSE
factor: categorical values with fixed levels (useful, but can be a source of surprises)
Date/POSIXct: dates and date-times

Getting types right is essential because summaries, comparisons, sorting, and modeling depend on them.

Strings vs. factors

A classic issue is when text is stored as a factor. Factors look like text when printed, but internally they are integer codes with a level set. This can break operations such as concatenation or numeric conversion.

# Example: factor pitfall
x <- factor(c("10", "20", "30"))

as.numeric(x)        # wrong: returns underlying codes
as.numeric(as.character(x))  # correct: convert to character first

In modern R setups, character columns are usually kept as character by default, but you will still encounter factors in older datasets, imported files, or when someone explicitly created them.

Missing values (NA) and how they affect results

Missing values in R are represented by NA. Many functions propagate NA unless you explicitly tell them to remove missing values.

Also watch for “hidden missingness” such as empty strings "", the literal text "NA", or sentinel values like -99. These are not treated as NA unless you convert them.

Inspecting data frames (inspection before analysis)

Before selecting columns or computing metrics, confirm what you have: dimensions, column names, types, and obvious quality issues.

Fast structure checks: glimpse() and str()

library(dplyr)

# Compact inspection (tibbles and data frames)
glimpse(tbl)

# Base R structure view
str(base_df)

Use these to quickly answer: How many rows/columns? What are the column names? Are types correct? Do any columns look suspicious (all missing, unexpected character, etc.)?

Summary statistics: summary()

summary(base_df)
summary(tbl)

summary() provides quick per-column summaries. For numeric columns it shows min/median/mean; for character it shows length/class; for factors it shows counts per level. This is often enough to spot out-of-range values or unexpected categories.

Basic counts and missingness checks

Counts help you validate expectations (e.g., number of unique IDs, distribution of categories) and identify data quality issues (e.g., duplicates, missing values).

library(dplyr)

# Row/column counts
nrow(tbl)
ncol(tbl)

# Missing values per column
tbl %>% summarise(across(everything(), ~ sum(is.na(.))))

# Unique counts per column (useful for IDs and categories)
tbl %>% summarise(across(everything(), ~ n_distinct(.)))

For categorical columns, use frequency tables.

# Base R frequency table
table(tbl$name, useNA = "ifany")

# dplyr count
tbl %>% count(name, sort = TRUE)

Selecting rows and columns reproducibly

Reproducible analysis means your code expresses selection rules explicitly (by name, position, or pattern) rather than manual clicking or ad-hoc copying. Two main approaches are bracket indexing (base R) and tidy selection (dplyr).

Bracket indexing: [ ] , [[ ]] , and $

Use [rows, cols] to select subsets. Rows and columns can be specified by position, name, or logical conditions.

# By position
base_df[1:2, 1:2]

# By column names
base_df[, c("id", "score")]

# Single column as a vector
base_df[["score"]]
base_df$score

Row filtering with logical conditions:

# Rows where score is not missing
base_df[!is.na(base_df$score), ]

# Rows where score >= 11 (be careful with NA)
base_df[!is.na(base_df$score) & base_df$score >= 11, ]

Common pitfall: comparisons with NA yield NA, not TRUE/FALSE. Combine conditions with !is.na() when filtering numeric columns.

Tidy selection patterns with dplyr: select() and filter()

With dplyr, you typically use select() for columns and filter() for rows. Tidy selection is powerful because you can select columns by name patterns, type, or helper functions.

library(dplyr)

# Select specific columns
small <- tbl %>% select(id, score)

# Select columns by name pattern
# (e.g., all columns starting with "q_")
# df %>% select(starts_with("q_"))

# Select columns by type
# (e.g., all numeric columns)
# df %>% select(where(is.numeric))

# Filter rows
passed <- tbl %>% filter(!is.na(score), score >= 11)

These patterns are especially useful when datasets evolve (new columns added) because your selection logic can remain stable.

Safe column referencing in code

When writing functions or programmatic code, prefer explicit name-based extraction (e.g., [["col"]]) over positional extraction (e.g., [[2]]), because positions can change when columns are reordered.

# Name-based extraction is robust
get_score <- function(df) df[["score"]]

get_score(tbl)

Applied task: confirm schema and identify data quality issues

Use the dataset below as your “provided dataset.” Your task is to (1) confirm the schema (column names and types), and (2) identify data quality issues such as missing values, invalid ranges, inconsistent categories, duplicates, and suspicious placeholders.

Step 1: Create the dataset

library(tibble)

sales <- tibble(
  order_id = c(1001, 1002, 1002, 1004, 1005, 1006),
  customer = c("Ava", "Ben", "Ben", "Cara", "", "Drew"),
  region = c("North", "north", "North", "West", "West", NA),
  order_date = c("2025-01-03", "2025-01-05", "2025-01-05", "2025-02-10", "not_a_date", "2025-02-12"),
  units = c(2, 1, 1, NA, -3, 4),
  unit_price = c(9.99, 12.50, 12.50, 8.00, 10.00, NA),
  paid = c(TRUE, TRUE, TRUE, FALSE, TRUE, TRUE)
)

Step 2: Confirm schema (names and types)

# Column names
names(sales)

# Types and a preview
library(dplyr)
glimpse(sales)

# Base R structure
str(sales)

Schema checks to perform:

Are identifiers (like order_id) numeric/integer and non-missing?
Are categorical fields (like region) consistent in spelling/case?
Are dates stored as proper date types (not character)?
Are numeric fields non-negative where appropriate?

Step 3: Identify data quality issues with reproducible checks

library(dplyr)

# Missing values per column
sales %>% summarise(across(everything(), ~ sum(is.na(.))))

# Empty strings (often should be treated as missing)
sales %>% summarise(empty_customer = sum(customer == ""))

# Duplicate order IDs (should they be unique?)
sales %>% count(order_id) %>% filter(n > 1)

# Region inconsistencies (case differences)
sales %>% count(region, sort = TRUE)

# Numeric validity checks
sales %>% filter(!is.na(units) & units < 0)

# Suspicious placeholders in dates (still character here)
sales %>% filter(is.na(order_date) | order_date == "" | order_date == "not_a_date")

Step 4: Confirm date parsing and re-check types

A frequent issue is that date columns are imported as character. Convert them and then re-inspect types. Invalid dates will become NA after parsing, which is useful for detection.

library(dplyr)

sales2 <- sales %>%
  mutate(order_date = as.Date(order_date))

glimpse(sales2)

# Which rows failed date parsing?
sales2 %>% filter(is.na(order_date)) %>% select(order_id, order_date, customer, region)

Step 5: Write down the issues you found (as outputs)

Based on the checks above, list the issues as concrete findings tied to columns and rows. For example:

order_id has duplicates (1002 appears more than once).
customer contains an empty string that should likely be missing.
region has inconsistent capitalization (North vs north) and a missing value.
order_date contains an invalid value (not_a_date) that becomes NA after parsing.
units contains missing values and an invalid negative value.
unit_price contains missing values.

Now answer the exercise about the content:

When cleaning the sales tibble, how can you reliably detect invalid date strings in order_date?

You are right! Congratulations, now go to the next page

You missed! Try again.

Parsing with as.Date() converts unparseable values (e.g., placeholders) into NA. Filtering for is.na(order_date) after conversion highlights invalid date strings.