Free Ebook cover Excel Power Query Playbook: Repeatable Data Prep Without VBA

Excel Power Query Playbook: Repeatable Data Prep Without VBA

New course

10 pages

Power Query Mindset for Repeatable Data Preparation

Capítulo 1

Estimated reading time: 13 minutes

+ Exercise

What “repeatable” really means in Power Query

A repeatable data preparation process is one you can run again tomorrow (or next month) with new files, new rows, and small structural changes—without rewriting your work. In Power Query, repeatability comes from building transformations that are driven by rules rather than by one-time edits. Instead of manually cleaning a worksheet, you define a sequence of steps that Power Query can reapply whenever the source updates.

The mindset shift is: you are not “fixing this dataset”; you are designing a pipeline. A pipeline has inputs (sources), a set of transformations (steps), and outputs (tables loaded to Excel or the data model). When you adopt this mindset, you start making decisions that favor stability, clarity, and maintainability over quick one-off fixes.

Repeatable vs. reproducible vs. robust

These terms are related but distinct:

  • Repeatable: You can refresh and get the same logic applied to new data.

  • Reproducible: Someone else can open the workbook and refresh successfully because dependencies are clear and steps are understandable.

    Continue in our app.

    You can listen to the audiobook with the screen off, receive a free certificate for this course, and also have access to 5,000 other free online courses.

    Or continue reading below...
    Download App

    Download the app

  • Robust: The query tolerates common variations (extra columns, different ordering, occasional blanks, unexpected casing) without breaking.

Power Query supports all three, but only if you design for them. That design is mostly about how you think before you click.

Think in stages: Source → Shape → Standardize → Validate → Output

A practical way to build repeatable queries is to separate your work into stages. You can do this within one query (with clearly named steps) or across multiple queries (staging queries feeding final queries). The key is to avoid mixing concerns.

1) Source: isolate how data is obtained

Keep source steps minimal and stable. If you are pulling from a folder, a SharePoint site, a database, or a workbook, treat that connection as a boundary. Avoid doing heavy transformations in the same breath as connecting, because it becomes harder to troubleshoot and to reuse.

Mindset rule: source steps should answer “where is the data?” not “how should it look?”

2) Shape: get the data into a usable table structure

Shaping includes actions like promoting headers, removing top rows, splitting a column, unpivoting, or expanding nested tables. This stage is about making the dataset tabular and consistent in structure.

Mindset rule: shape for analysis, not for appearance. Power Query is not about formatting; it is about structure.

3) Standardize: enforce naming, types, and business rules

Standardization includes renaming columns to canonical names, setting data types, trimming/cleaning text, converting date formats, and mapping codes to categories.

Mindset rule: standardize early enough to prevent errors, but late enough to avoid fragile references. For example, if your source sometimes changes column order, renaming by position is fragile; renaming by column name is safer.

4) Validate: detect bad inputs before they poison outputs

Validation is the habit of adding checks: filtering out null keys, ensuring numeric columns are numeric, removing error rows, or flagging anomalies. In Power Query, validation can be explicit (separate “error rows” query) or implicit (types and merges that fail reveal issues).

Mindset rule: prefer visible, intentional validation over silent failure. If something is wrong, you want to know where and why.

5) Output: load only what you need, in the shape you need

Output is where you decide what to load to Excel, what to keep as connection-only, and what to publish to the data model. This stage is also where you decide whether to create multiple outputs from one staged dataset.

Mindset rule: minimize duplication. If two reports need the same cleaned dataset, create one cleaned query and reference it.

Design for change: assume the source will evolve

Most “broken refresh” problems happen because the query assumed the source would never change. In real work, sources change constantly: new columns appear, columns get renamed, files arrive with slightly different headers, or a vendor adds a note row at the top.

Common changes to anticipate

  • New columns added to the right (or inserted in the middle).

  • Column order changes.

  • Header text changes slightly (extra spaces, different casing, “Amt” vs “Amount”).

  • Mixed data types in a column (numbers stored as text, occasional “N/A”).

  • Extra rows at the top (titles, disclaimers) or bottom (totals).

  • Missing files in a folder refresh, or a file with a different schema.

The mindset is to treat these as normal, not exceptional. Then you build steps that are resilient: selecting columns by name, using “remove other columns” carefully, using try/otherwise for conversions, and validating assumptions.

Prefer rules over manual edits

Repeatability comes from expressing your intent as rules. If you find yourself thinking “I’ll just fix this one cell,” that’s a signal to step back and ask: what rule would fix this every time?

Examples of rule-based thinking

  • Instead of manually deleting blank rows: filter out rows where a key column is null or empty.

  • Instead of manually splitting a column for this month’s file: split by delimiter (or by position) as a step.

  • Instead of manually correcting casing: apply Text.Proper/Text.Upper consistently.

  • Instead of manually mapping codes: create a mapping table and merge.

Rule-based steps are also easier to explain to others, which improves reproducibility.

Build queries like you build spreadsheets: readable, named, and modular

Power Query steps are code, even if you create them through the UI. Treat them with the same care you would give to formulas in a shared workbook.

Make steps readable

Rename steps to reflect intent. “Filtered Rows” is less helpful than “Keep Active Customers” or “Remove Blank Invoice IDs”. When a refresh fails, readable steps make troubleshooting faster.

Use modular queries (staging + final)

A common repeatable pattern is:

  • Staging query: connect and shape to a clean, canonical table. Set to “Connection only”.

  • Final query: reference the staging query and apply report-specific filters, joins, and calculations. Load to worksheet or data model.

This modular approach reduces duplication and makes changes safer: if the source changes, you fix the staging query once and all dependent outputs update.

Practical step-by-step: turn a one-off cleanup into a repeatable pipeline

This walkthrough focuses on the mindset decisions more than on the specific buttons. Imagine you receive a monthly “Sales Export” file that often has extra columns and occasional text in numeric fields.

Goal

Create a repeatable query that loads a clean Sales table with consistent column names and types, and that flags problematic rows rather than silently failing.

Step 1: Define the contract (what the output must look like)

Before importing, write down the output contract:

  • Required columns: InvoiceID, InvoiceDate, CustomerID, ProductID, Quantity, UnitPrice

  • Optional columns: SalesRep, Region (keep if present)

  • Types: InvoiceDate = date, Quantity = whole number, UnitPrice = decimal number

  • Rule: rows without InvoiceID are invalid and should be excluded or flagged

This contract guides every decision. Without it, you tend to accept whatever the source gives you and patch problems later.

Step 2: Create a staging query and keep source steps minimal

Import the file (or folder) into Power Query. In the staging query, do only what is necessary to get a table: choose the correct sheet/table, remove obvious non-data top rows if needed, and promote headers if appropriate.

Mindset check: if you are already filtering to “this month only” in the staging query, you are mixing output concerns into the source stage. Keep staging general.

Step 3: Standardize column names to canonical names

Rename columns to match your contract. If the source sometimes uses “Invoice Id” and sometimes “InvoiceID”, standardize to one name. If you expect header variations, consider a mapping approach: a small table that lists possible source names and the canonical name, then apply renaming based on that mapping.

Even without a full mapping table, the mindset is: choose canonical names once and stick to them. This makes downstream merges and measures stable.

Step 4: Select required columns defensively

When you select columns, you are encoding assumptions. If you use “Choose Columns” and remove everything else, the query may break when a required column is missing. If you keep everything, downstream steps may become ambiguous.

A robust approach is:

  • Ensure required columns exist (or fail with a clear error).

  • Keep optional columns only if present.

In M, you can express this defensively by intersecting desired columns with existing columns.

let    Source = ... ,    ExistingCols = Table.ColumnNames(Source),    Required = {"InvoiceID","InvoiceDate","CustomerID","ProductID","Quantity","UnitPrice"},    Optional = {"SalesRep","Region"},    Keep = List.Intersect({ExistingCols, List.Combine({Required, Optional})}),    Selected = Table.SelectColumns(Source, Keep, MissingField.Ignore)in    Selected

This pattern supports change: if SalesRep disappears, the query still refreshes; if a required column disappears, you can add an explicit check to stop the refresh with a meaningful message.

Step 5: Clean and type with “try/otherwise” where needed

Type conversion is a frequent failure point when sources contain “N/A”, blanks, or stray text. Instead of letting the refresh fail, decide how to handle exceptions: convert invalid values to null, or route them to an error table.

Example: safely convert Quantity and UnitPrice.

let    WithQty = Table.TransformColumns(Selected, {        {"Quantity", each try Number.From(_) otherwise null, Int64.Type},        {"UnitPrice", each try Number.From(_) otherwise null, type number}    })in    WithQty

Mindset rule: make failure modes explicit. If you convert invalid numbers to null, you should validate later so you don’t accidentally treat null as zero.

Step 6: Validate keys and critical fields

Add a step that checks your key fields. For example, InvoiceID must not be null or empty. You can filter them out for the main output and also capture them in a separate “Rejected Rows” query for review.

Main output filter:

let    Valid = Table.SelectRows(WithQty, each [InvoiceID] <> null and Text.Trim(Text.From([InvoiceID])) <> "")in    Valid

Rejected rows query (reference the staging query and keep only invalid rows):

let    Rejected = Table.SelectRows(WithQty, each [InvoiceID] = null or Text.Trim(Text.From([InvoiceID])) = "")in    Rejected

This is a mindset upgrade: you stop thinking “bad rows are annoying” and start thinking “bad rows are a managed output of the pipeline.”

Step 7: Output queries: load clean data, keep staging connection-only

Set the staging query to “Connection only” and load only the final clean Sales table (and optionally the Rejected Rows table). This keeps your workbook lighter and reduces accidental dependencies on intermediate steps.

Prefer reference over duplicate

When you need a second version of a dataset (for example, Sales for the dashboard and Sales for a reconciliation report), avoid duplicating the query and making separate edits. Duplicates drift over time and break repeatability because fixes must be applied in multiple places.

Instead, create one canonical cleaned query and then create references:

  • Clean_Sales (staging or canonical output)

  • Sales_Dashboard (reference Clean_Sales, apply dashboard-specific filters)

  • Sales_Recon (reference Clean_Sales, apply reconciliation logic)

This is the same mindset as using helper tables and named ranges in Excel: one source of truth, multiple views.

Make transformations stable: avoid fragile step patterns

Some transformations are inherently more fragile than others. The mindset is not “never use them,” but “use them knowingly and protect them.”

Fragility pattern: referencing columns by position

If you remove columns by index or rely on “first column is X,” your query will break when the source inserts a new column. Prefer selecting columns by name.

Fragility pattern: expanding nested tables without controlling column names

When expanding a merged table, explicitly choose which columns to expand and consider prefixing to avoid collisions. If you expand “all columns,” a new column in the lookup table can unexpectedly appear in your output and change downstream behavior.

Fragility pattern: automatic type detection on changing sources

Power Query sometimes adds an automatic “Changed Type” step. If the source changes, this step can fail or mis-type columns. A repeatable mindset is to control types intentionally and to place type steps after you have stabilized column names and removed noise.

Practical step-by-step: build a “schema guard” to catch structural changes

When you rely on a consistent schema, it’s useful to add a guard step that checks whether required columns exist and stops with a clear message if not. This prevents confusing downstream errors.

Step 1: Define required columns list

Required = {"InvoiceID","InvoiceDate","CustomerID","ProductID","Quantity","UnitPrice"}

Step 2: Compare to existing columns

Existing = Table.ColumnNames(Selected), Missing = List.Difference(Required, Existing)

Step 3: Raise a clear error if missing

Checked = if List.Count(Missing) > 0 then error "Missing required columns: " & Text.Combine(Missing, ", ") else Selected

Use this guard early in the query, right after you have promoted headers and standardized names. The mindset benefit is large: when a vendor changes a header, you get a direct message instead of a cryptic “column not found” later.

Document intent inside the query

Repeatability improves when future-you (or a colleague) can understand why a step exists. Power Query supports comments in M, and you can also encode intent in step names.

Example of lightweight documentation in M:

let    // Canonicalize column names so downstream merges are stable    Renamed = Table.RenameColumns(Source, {{"Invoice Id","InvoiceID"}}, MissingField.Ignore),    // Guard against schema drift (required columns must exist)    Existing = Table.ColumnNames(Renamed),    Missing = List.Difference({"InvoiceID","InvoiceDate"}, Existing),    Checked = if List.Count(Missing) > 0 then error "Missing required columns: " & Text.Combine(Missing, ", ") else Renamedin    Checked

The goal is not to write essays in comments; it is to capture the “why” behind non-obvious steps.

Adopt a testing habit: refresh with “bad” data on purpose

A repeatable pipeline is one that behaves predictably under stress. Create a small set of test files (or copies of real files) that simulate common issues:

  • A file with an extra column.

  • A file with a missing optional column.

  • A file with “N/A” in numeric fields.

  • A file with a renamed header (“Invoice Date” vs “InvoiceDate”).

Refresh against these tests and observe where the query breaks. Then adjust steps to be more defensive (schema guard, MissingField.Ignore, try/otherwise, explicit selection). This is a mindset shift from reactive to proactive: you are building a process, so you test it like a process.

Choose where to be strict and where to be flexible

Not every part of your pipeline should be tolerant. Repeatability is not the same as “always refresh no matter what.” Sometimes you want the refresh to fail loudly when critical assumptions are violated.

Be strict on

  • Primary keys and required identifiers (InvoiceID, CustomerID).

  • Date fields that drive reporting periods.

  • Grain of the table (one row per invoice line, one row per customer, etc.).

Be flexible on

  • Optional descriptive columns (Region, SalesRep).

  • Extra columns you don’t use.

  • Minor text inconsistencies that can be normalized (case, whitespace).

This strict-vs-flexible decision is the core of the Power Query mindset: you are defining a contract and deciding what constitutes acceptable input.

Now answer the exercise about the content:

Which approach best supports a repeatable Power Query pipeline when the source files may add extra columns or slightly change headers?

You are right! Congratulations, now go to the next page

You missed! Try again.

Repeatable queries are rule-based pipelines designed for change: isolate source steps, standardize names, handle required/optional columns defensively, and validate assumptions (e.g., schema guard) so refreshes stay stable and failures are clear.

Next chapter

Importing Data from Folders and Automating File Intake

Arrow Right Icon
Download the app to earn free Certification and listen to the courses in the background, even with the screen off.