All courses > Technology and Programming > Software testing ::

Defects and Failures: Recognizing and Isolating Problems

Capítulo 8

Estimated reading time: 12 minutes

Defect vs. Failure vs. Error: Using Precise Language

When a product “does something wrong,” people often use one word for everything. In testing, separating terms helps you communicate clearly, choose the right investigation steps, and avoid blaming the wrong component.

Error (mistake): A human action that produces an incorrect result. Example: a developer misunderstands a rounding rule and implements it incorrectly; a tester configures the environment with the wrong currency.
Defect (bug, fault): A problem in a work product (code, configuration, data, script, requirement, build pipeline) that can cause incorrect behavior. Example: a function uses integer division instead of decimal division.
Failure: The observable incorrect behavior when the software runs. Example: the total price displayed is $19 instead of $19.99, or the app crashes when tapping “Pay.”

A useful mental model is: error introduces a defect, and under certain conditions the defect triggers a failure. Not every defect causes a failure in every situation. Some defects remain dormant until a specific input, timing, locale, permission set, or integration response occurs.

Why the distinction matters in daily testing

Reporting: You usually report a failure (what you observed) and help locate the defect (where it likely lives). You rarely report “an error” unless you are analyzing process issues.
Reproducibility: A failure can be reproduced; a defect can be fixed; an error can be prevented through reviews, checklists, or training.
Ownership: Failures are cross-cutting; defects may be in code, configuration, or data; errors may be in any role. Precise language reduces unproductive debates.

Recognizing Failures: What to Look For

Recognizing failures is not only about “it crashed.” Many failures are subtle and require you to compare the actual behavior to an expected behavior, a business rule, or a consistency principle.

Common failure patterns

Incorrect result: wrong totals, wrong status, wrong formatting, wrong rounding, wrong sorting.
Missing result: data not saved, notification not sent, record not created, export file empty.
Unexpected extra result: duplicate charges, duplicate emails, repeated retries that create duplicates.
Timing-related: slow response, timeouts, race conditions, stale data after refresh.
State-related: works in a fresh session but fails after navigation, fails after logout/login, fails after switching accounts.
Permission and role issues: user sees data they should not, or cannot perform allowed actions.
Integration symptoms: partial updates, inconsistent data across screens, errors only when a downstream service is slow.
UI/UX functional failures: button disabled when it should be enabled, wrong default selection, validation message incorrect or missing.

Signals that a defect may exist even without an obvious failure

Inconsistency: two screens show different totals for the same order.
Non-determinism: the same steps sometimes pass and sometimes fail.
Suspicious logs/alerts: repeated warnings, retries, or “handled exceptions” that correlate with user actions.
Data smells: nulls in fields that should be populated, impossible timestamps, negative quantities where not allowed.

From Symptom to Suspect: Isolating the Problem

Isolation is the skill of narrowing a broad symptom (“checkout failed”) into a small, testable hypothesis (“fails only when shipping address contains non-ASCII characters, due to encoding in the address validation call”). The goal is not to guess; it is to reduce uncertainty with controlled experiments.

A practical isolation workflow

Use the following workflow whenever you encounter a failure. It is designed to be repeatable and to produce actionable defect reports.

Continue in our app.

Listen to the audio with the screen off.
Earn a certificate upon completion.
Over 5000 courses for you to explore!

Or continue reading below...

Download the app

Step 1: Capture the failure precisely

Write down what you can observe without interpretation. Include the exact message, the exact screen, and the exact time. If possible, capture evidence.

What did you do (high-level)?
What did you expect?
What happened instead?
What evidence do you have (screenshot, video, console output, API response, log snippet)?

Example observation: “After clicking ‘Pay’ on Order #10492, the spinner shows for ~20 seconds and then an error banner appears: ‘Payment could not be processed.’ The order remains in ‘Pending Payment’ state.”

Step 2: Reproduce once, then aim for a stable reproduction

Re-run the same steps. If it reproduces, you have a starting point. If it does not, treat it as a non-deterministic issue and immediately start recording variables (timing, environment, data, network conditions).

If it reproduces every time: proceed to minimization.
If it reproduces sometimes: proceed to stabilization (control variables, add instrumentation, repeat with consistent data).

Step 3: Minimize the reproduction steps

Remove steps that are not necessary to trigger the failure. Minimization makes the defect easier to understand and reduces time wasted during triage.

Can you start from a simpler state (fresh session, new user, empty cart)?
Can you remove optional inputs?
Can you trigger it with one item instead of many?
Can you reproduce via API call instead of UI steps?

Example minimization: “The failure occurs even with a single item in the cart and no discount code. It does not require adding/removing items multiple times.”

Step 4: Identify the boundary conditions (A/B comparisons)

Isolation improves dramatically when you compare a failing case (A) to a passing case (B) that is as similar as possible. Change one variable at a time.

Different user role: admin vs. standard
Different data: short name vs. long name; ASCII vs. emoji; zero vs. non-zero
Different locale: currency, decimal separator, time zone
Different device/browser: Chrome vs. Safari; mobile vs. desktop
Different environment: staging vs. production-like

Write down the single variable you changed and the outcome. This creates a map of conditions that trigger the defect.

Step 5: Determine the failure scope and impact

Before you dive deeper, quickly assess how broad the issue is. This helps prioritize investigation and informs severity.

Is it limited to one user, one account, one tenant, one region?
Is it limited to one feature path or multiple?
Does it corrupt data or only block an action?
Is there a workaround?

Step 6: Localize the likely layer (UI, API, service, data, config)

You often cannot see the defect directly, but you can infer where it likely resides by checking intermediate outputs.

UI layer: validation rules, disabled controls, incorrect formatting, client-side calculations.
API layer: request/response mismatch, status codes, schema issues, authorization failures.
Service/business logic: incorrect rules, state transitions, concurrency issues.
Data layer: missing records, constraints, migrations, incorrect defaults.
Configuration: feature flags, environment variables, third-party keys, timeouts.

Practical approach: if the UI shows an error, check whether the API call failed. If the API succeeded but UI shows wrong data, suspect UI rendering or caching. If API fails with a 500, suspect service logic or downstream dependency.

Step 7: Form a testable hypothesis and validate it

Convert your observations into a hypothesis you can test with a small experiment.

Example hypothesis: “Payment fails only when the billing address contains a non-ASCII character because the address validation service rejects UTF-8 characters and the application does not handle that error.”

Validation experiments:

Try a billing address with only ASCII characters (pass?)
Try a billing address with “é” (fail?)
Try an emoji (fail?)
Inspect the API response body for validation errors

Techniques for Isolating Root Causes (Without Access to Code)

Testers often work without direct code access. You can still isolate effectively using observable behavior, tooling, and structured experiments.

Binary search on changes (when a regression is suspected)

If a feature worked yesterday but fails today, the defect may be linked to a recent change. If you have access to builds or deployments, you can narrow down the introduction point by testing earlier versions.

Find the last known good build and the first known bad build.
Test the midpoint build.
Repeat until you narrow to a small set of changes.

This is especially effective when releases are frequent and changes are numerous.

State reset and controlled environment

Many failures are caused by hidden state: cookies, local storage, cached data, feature flags, or server-side session state.

Use an incognito/private window to remove client state.
Clear cookies/local storage for the app domain.
Use a new test user to remove account state.
Repeat in a clean environment (fresh container, new emulator).

If the failure disappears after a reset, you have a strong clue that state is involved.

Data slicing: isolate by dataset

When failures involve “some records,” isolate by identifying what is special about failing data.

Compare a failing record to a passing record field-by-field.
Look for extremes: very long strings, nulls, special characters, boundary dates.
Check relationships: missing foreign keys, deleted referenced entities.

Example: “Invoices fail to export only when the customer name exceeds 80 characters.”

Time and concurrency experiments

Intermittent failures often involve timing, retries, or concurrency.

Repeat the same action 20 times and record pass/fail counts.
Try with slow network throttling to see if timeouts trigger.
Perform two actions simultaneously (two browsers) to detect race conditions.

Even without code, you can produce a reproducible pattern: “Fails 30% of the time when two users approve the same request within 2 seconds.”

Observability basics: what to capture

When available, use system outputs to narrow the defect location. You do not need deep logging expertise; you need consistent identifiers and timestamps.

Correlation IDs / request IDs: capture from response headers or UI error dialogs if shown.
Timestamps: note local time and time zone; align with server logs.
Network traces: browser devtools HAR export, including request/response status codes.
Console logs: JavaScript errors, stack traces, failed resource loads.
API responses: error codes and messages; schema validation errors.

When you attach evidence, ensure it is minimal but sufficient: include the failing request and the response, not the entire browsing session.

Writing a Defect Report That Helps Isolation

A good defect report is an isolation artifact: it lets someone else reproduce quickly and see the same evidence you saw. It also prevents “cannot reproduce” outcomes by documenting variables.

Essential fields and what “good” looks like

Title: include symptom + condition. Example: “Payment fails with ‘could not be processed’ when billing address contains non-ASCII characters.”
Environment: build/version, environment name, browser/device, OS, account/role.
Preconditions: required feature flags, test data, account state.
Steps to reproduce: minimized, numbered, unambiguous.
Expected vs. actual: specific, observable.
Evidence: screenshot/video, request/response, logs, correlation ID.
Frequency: always/sometimes; include pass/fail ratio if intermittent.
Scope notes: what you tried that did not fail (useful A/B comparisons).

Example defect report (condensed)

Title: Checkout payment fails when billing address contains 'é' (non-ASCII) Environment: Staging v2.18.4, Chrome 121, Windows 11, user role: Standard Preconditions: Feature flag PAYMENTS_V3 enabled Steps: 1) Login as user test_user_17 2) Add item SKU-1001 to cart 3) Go to Checkout 4) Set billing address line1 = "12 Rue de l'Église" 5) Click Pay Expected: Payment succeeds; order moves to Paid Actual: Error banner "Payment could not be processed"; order stays Pending Payment Evidence: HAR attached; POST /api/payments returns 400 with body {"error":"INVALID_ADDRESS"}; requestId=9f2c... Frequency: 5/5 repro Notes: Same steps pass with "12 Rue de lEglise" (ASCII only)

This report does not claim the root cause (“encoding bug”) as a fact. It provides a strong isolation clue (non-ASCII input) and concrete evidence (400 response) that helps the team locate the defect.

Severity, Priority, and Triage: Classifying the Problem Correctly

Isolation is also about classification: ensuring the right people look at the right issue at the right time.

Severity: how bad is the impact?

Critical: data loss/corruption, security exposure, system down, payment/checkout blocked for many users.
High: major feature blocked with limited workaround, incorrect calculations affecting outcomes.
Medium: partial feature degradation, confusing behavior with workaround.
Low: minor functional issue with minimal impact.

Severity should be based on impact, not on how hard the fix is.

Priority: when should it be fixed?

Priority depends on release timing, user impact, risk, and business context. A medium-severity defect might be high priority if it affects a demo tomorrow or a compliance deadline.

Triage questions that support isolation

Is this a new regression or a long-standing issue?
Is it environment-specific?
Is it data-specific?
Can we reproduce with a minimal case?
What logs/IDs do we need to locate the failing component?

Special Case: Intermittent Failures and “Cannot Reproduce”

Intermittent failures are common in distributed systems and UI-heavy applications. Treat them as a different class of problem: your job is to turn “sometimes” into “under these conditions.”

Step-by-step approach for intermittent issues

Step 1: Quantify frequency. Run a short loop test and record results (e.g., 3 failures in 20 attempts).
Step 2: Capture timing and identifiers for each attempt (timestamp, request ID, order ID).
Step 3: Control variables: same account, same data, same browser, same network.
Step 4: Introduce one variable at a time: network throttling, parallel actions, different device.
Step 5: Look for clustering: does it happen after idle time, after deployment, during peak load?

When you cannot stabilize reproduction, your report should emphasize evidence and patterns: “Observed 3 failures between 10:05–10:12 UTC; all had requestId prefix ‘ab7’; all occurred when response time exceeded 8 seconds.”

Special Case: Failures Caused by Test Environment and Test Data

Not every failure indicates a product defect. Some failures are caused by environment instability, misconfiguration, or invalid test data. Recognizing these quickly prevents wasted time.

Environment-induced failure indicators

Multiple unrelated tests failing at once
Errors like “service unavailable,” DNS failures, certificate issues
Failures that disappear after redeploy/restart
Third-party sandbox outages or rate limits

Data-induced failure indicators

Only one specific account fails
Failures tied to expired tokens, locked users, exhausted quotas
Records in unexpected states due to previous tests

Isolation tactic: reproduce with a fresh user and freshly created data. If the issue disappears, document the data dependency: “Fails only for accounts created before migration M2025-01.” That may still be a product defect, but it narrows the cause.

Turning Observations into Actionable Next Steps

After you isolate conditions and likely layers, you can propose targeted next steps that help the team fix faster. Keep these as suggestions, not assertions.

If API returns 4xx with validation error: review request payload and validation rules; check encoding/formatting; confirm UI constraints match API constraints.
If API returns 5xx: capture request ID and timestamp; check service logs; verify downstream dependency health.
If UI shows wrong values but API is correct: suspect client-side formatting, caching, or mapping; capture console errors and state snapshots.
If failure is role-specific: compare permissions and feature flags; test with minimal permission set that still reproduces.
If failure is data-specific: identify the minimal data attribute that triggers it (length, null, special character, boundary date).

Now answer the exercise about the content:

While isolating a checkout problem, the UI shows an error banner but the API request succeeds and returns correct data. Which layer is the most likely suspect to investigate next?

You are right! Congratulations, now go to the next page

You missed! Try again.

If the API succeeds and returns correct data but the UI shows incorrect behavior, the likely issue is in the UI layer (rendering, caching, mapping, or client-side formatting), so you should gather UI evidence like console errors and state snapshots.