Free Ebook cover Introduction to Large Language Models (LLMs): How They Work and What They Can (and Can’t) Do

Introduction to Large Language Models (LLMs): How They Work and What They Can (and Can’t) Do

New course

12 pages

Evaluating Quality, Reliability, and Risk

Capítulo 7

Estimated reading time: 12 minutes

+ Exercise

Why evaluation matters: quality, reliability, and risk are different

When you use an LLM in a product, a workflow, or even personal work, you are not only judging whether the output “sounds good.” You are judging whether it is fit for a purpose under real constraints: time pressure, incomplete information, ambiguous instructions, and downstream consequences. Evaluation is the discipline of measuring how well the system performs and how safely it fails.

Three terms are often mixed together but should be separated:

  • Quality: How good is the output when it is correct? This includes clarity, completeness, formatting, tone, usefulness, and adherence to instructions.
  • Reliability: How consistently does the system produce acceptable output across many cases, including edge cases? Reliability is about variance, repeatability, and robustness to small changes in input.
  • Risk: What is the harm if the system is wrong or misused? Risk depends on the domain (medical vs. marketing), the user population, and the safeguards around the model.

In practice, you evaluate all three. A model can produce high-quality prose but be unreliable (sometimes excellent, sometimes dangerously wrong). Another can be reliable but low quality (safe but bland). Your goal is to choose the right trade-off for the use case and to add controls where the model is weak.

Define “good” before you test: requirements and acceptance criteria

Evaluation starts with a specification. Without it, you will end up measuring what is easy to measure rather than what matters. A useful specification is written in the language of the task and the user, not in model-centric terms.

Step-by-step: write acceptance criteria

  • Step 1: Identify the user and the decision. Example: “Customer support agent needs a draft reply that is accurate and polite.”
  • Step 2: List non-negotiables. Example: “Must not invent order status; must ask for order ID if missing; must not request passwords.”
  • Step 3: Define output constraints. Example: “Max 120 words; include greeting and closing; use friendly tone; no internal policy references.”
  • Step 4: Define correctness sources. Example: “Order status must match CRM; refund policy must match policy document.”
  • Step 5: Define pass/fail thresholds. Example: “At least 95% of responses must be factually consistent with CRM on a weekly sample; 0 tolerance for requesting sensitive data.”

These criteria become the backbone of your evaluation rubric and your automated checks.

Continue in our app.

You can listen to the audiobook with the screen off, receive a free certificate for this course, and also have access to 5,000 other free online courses.

Or continue reading below...
Download App

Download the app

Quality evaluation: what to measure and how

Quality is multi-dimensional. A single “score” often hides important differences. Instead, break quality into components aligned with your acceptance criteria.

Common quality dimensions

  • Instruction adherence: Did it follow the requested format, length, and constraints?
  • Task completion: Did it answer the question or perform the requested transformation?
  • Factuality with respect to provided sources: Are statements supported by the given context or allowed references?
  • Clarity and readability: Is it understandable to the target audience?
  • Appropriate tone: Does it match brand voice or professional expectations?
  • Actionability: Does it provide next steps, options, or structured outputs that can be used downstream?

Rubrics: make subjective judgments consistent

Human evaluation is often necessary for tone, clarity, and usefulness. The key is to reduce evaluator drift by using a rubric with examples.

Example rubric for a “draft email reply” task:

  • Accuracy (0–2): 0 = contains incorrect claim; 1 = uncertain/unsupported claim; 2 = consistent with provided facts.
  • Policy compliance (0–2): 0 = violates policy; 1 = borderline; 2 = compliant.
  • Helpfulness (0–2): 0 = unhelpful; 1 = partially helpful; 2 = resolves user need or gives clear next steps.
  • Tone (0–2): 0 = inappropriate; 1 = acceptable; 2 = on-brand.
  • Formatting (0–2): 0 = wrong format; 1 = minor issues; 2 = correct.

To improve consistency, include “anchor examples” for each score level. For instance, show what a “1” in accuracy looks like: a response that hedges but still implies a fact not in evidence.

LLM-as-judge: useful, but validate it

Using an LLM to grade outputs can speed up evaluation, especially for style and formatting. However, it can be biased toward fluent text and may miss subtle factual errors. If you use LLM-as-judge, treat it as a measurement instrument that needs calibration.

Step-by-step: calibrate an LLM judge

  • Step 1: Create a small “gold” set of outputs graded by trusted human reviewers.
  • Step 2: Ask the judge model to grade the same items using the same rubric.
  • Step 3: Measure agreement (e.g., percent agreement or correlation). Focus on critical categories like policy compliance and factuality.
  • Step 4: Adjust the judge prompt with clearer definitions and counterexamples where it fails.
  • Step 5: Re-test periodically because model updates or prompt changes can shift behavior.

Reliability evaluation: consistency, robustness, and variance

Reliability is about how often the system meets your acceptance criteria across the full range of real inputs. A system that is “usually good” can still be unacceptable if the failure mode is severe or frequent in certain segments.

Key reliability questions

  • Stability to small input changes: Does rephrasing the user question change the answer materially?
  • Stability across runs: If you run the same prompt multiple times, do you get consistent results?
  • Coverage across segments: Does performance drop for certain product categories, languages, user skill levels, or rare edge cases?
  • Graceful degradation: When the model cannot answer, does it ask clarifying questions or does it guess?

Step-by-step: build a reliability test suite

  • Step 1: Collect representative inputs from logs, support tickets, or synthetic scenarios. Label them by segment (topic, difficulty, language, user type).
  • Step 2: Create perturbations for each input: paraphrases, reordered details, added irrelevant text, typos, and conflicting instructions.
  • Step 3: Run multiple trials per input if your system uses randomness. Record pass/fail against acceptance criteria.
  • Step 4: Compute reliability metrics: overall pass rate, pass rate by segment, worst-segment performance, and variance across trials.
  • Step 5: Investigate failure clusters: group failures by pattern (e.g., “fails when user includes two requests,” “fails when dates are ambiguous”).

Reliability testing is most valuable when it reveals “unknown unknowns”: categories you did not realize were risky until you saw repeated failures.

Determinism and configuration as reliability levers

Many LLM systems have settings that affect variability. If your use case demands consistent outputs (e.g., compliance summaries), you may prefer configurations that reduce randomness and enforce structure. Reliability also improves when you constrain outputs with templates, schemas, or structured formats that downstream systems can validate.

Risk evaluation: harm, likelihood, and mitigations

Risk is not just about whether the model is wrong; it is about the impact of being wrong and the probability that the wrong output reaches a user or triggers an action. Risk evaluation therefore combines model behavior with product design.

A practical risk framework

  • Severity: How bad is the harm? Examples: minor confusion, financial loss, legal exposure, physical harm.
  • Likelihood: How often does the failure occur in realistic usage?
  • Detectability: How likely is it that the failure will be caught before causing harm?
  • Exposure: How many users are affected and how quickly can it spread?

High-severity failures require stronger controls even if they are rare. Low-severity failures can sometimes be tolerated if they are easy to detect and correct.

Common LLM risk categories to test

  • Incorrect high-stakes guidance: legal, medical, financial, safety instructions.
  • Privacy and data leakage: revealing sensitive information, repeating private inputs, or encouraging users to share secrets.
  • Security misuse: generating instructions that facilitate wrongdoing or bypass safeguards in your system.
  • Policy and compliance violations: disallowed content, discrimination, harassment, or regulated advice.
  • Operational risk: the model triggers actions (emails, refunds, code changes) incorrectly.
  • Reputational risk: confident but wrong statements, offensive tone, or brand-unsafe language.

Step-by-step: perform a risk assessment for an LLM feature

  • Step 1: Map the user journey from input to output to downstream action. Identify where the model influences decisions.
  • Step 2: List plausible failure modes for each step (e.g., “misreads policy,” “misclassifies intent,” “outputs personal data”).
  • Step 3: Rate severity and likelihood using a simple scale (e.g., 1–5). Include detectability.
  • Step 4: Choose mitigations: human review, source citations, restricted actions, content filters, or forced clarification questions.
  • Step 5: Define monitoring signals that indicate risk is rising (spikes in user complaints, increased refusals, unusual topics).

Risk assessment should be revisited whenever you change prompts, tools, data sources, or the model version.

Designing evaluation datasets: from “toy examples” to realistic coverage

Your evaluation is only as good as your test data. If your dataset is too clean, too small, or too similar to your best-case scenarios, you will overestimate performance.

Principles for strong evaluation sets

  • Represent real distribution: include common user phrasing, incomplete requests, and messy inputs.
  • Include edge cases: ambiguous questions, conflicting constraints, unusual formats, and long inputs.
  • Include adversarial cases: attempts to override instructions, inject irrelevant content, or request disallowed actions.
  • Label by segment: so you can see where performance drops.
  • Keep a “frozen” benchmark: a stable set you do not change, so you can compare versions over time.

Golden answers vs. flexible scoring

Some tasks have a single correct answer (e.g., extracting a date). Others allow many good answers (e.g., rewriting text). For flexible tasks, prefer rubric-based scoring and check for required elements rather than exact matches.

Example: For a meeting summary, you might require: agenda items, decisions, action items with owners, and open questions. Many phrasings can satisfy that.

Automated checks: catching failures cheaply

Automated evaluation is valuable because it is fast and repeatable. It is especially effective for structural requirements and known constraints.

What you can automate well

  • Schema validation: ensure JSON outputs parse and include required fields.
  • Length and formatting: word count, bullet structure, presence of headings.
  • Forbidden content checks: disallowed phrases, requesting sensitive data, or policy keywords (with careful tuning to avoid false positives).
  • Consistency checks: if the output includes an order number, does it match the input? If it claims a date, is it in the allowed range?
  • Tool-result alignment: if the system uses an external lookup, does the answer match the tool output?

Example: JSON schema validation for a structured output

{  "type": "object",  "required": ["summary", "action_items"],  "properties": {    "summary": {"type": "string", "minLength": 20},    "action_items": {      "type": "array",      "items": {        "type": "object",        "required": ["owner", "task"],        "properties": {          "owner": {"type": "string"},          "task": {"type": "string"},          "due_date": {"type": "string"}        }      }    }  }}

This kind of check does not tell you whether the summary is insightful, but it prevents a large class of integration failures and forces the model into a predictable shape.

Human evaluation: when you need it and how to run it

Human review is essential when the goal is subjective (tone, helpfulness) or when subtle domain knowledge is required. The challenge is cost and consistency.

Step-by-step: run a lightweight human eval

  • Step 1: Choose a sample that includes typical cases and known hard cases.
  • Step 2: Train reviewers on the rubric with examples and counterexamples.
  • Step 3: Use double review on a subset: two reviewers grade the same items to measure agreement.
  • Step 4: Resolve disagreements with a short adjudication process and update rubric wording if needed.
  • Step 5: Track trends over time: if a new model version improves helpfulness but reduces compliance, you need to see that explicitly.

Even small human evaluations can be powerful if they are repeated regularly and tied to release decisions.

Measuring uncertainty and encouraging safe failure

A reliable system does not only produce good answers; it also knows when to slow down. In many applications, the best behavior under uncertainty is to ask a clarifying question, provide options, or explicitly state limits.

What to evaluate for safe failure

  • Clarification behavior: Does the system ask for missing critical details instead of guessing?
  • Boundary behavior: Does it refuse or redirect when asked for disallowed or high-risk content?
  • Confidence signaling: Does it distinguish between verified facts (from provided sources) and general suggestions?
  • Escalation: Does it recommend human review for high-stakes cases?

To test this, include prompts that are intentionally underspecified or contain conflicting requirements. Your rubric should reward appropriate questions and penalize confident guessing.

Monitoring in production: evaluation doesn’t stop at launch

Offline benchmarks are necessary but not sufficient. Real users will bring new topics, new phrasing, and new failure modes. Monitoring is how you detect drift and emerging risks.

What to monitor

  • Outcome metrics: user satisfaction ratings, task completion, time saved, escalation rates.
  • Quality proxies: frequency of re-prompts, edits made by users, or “regenerate” clicks.
  • Safety signals: flagged content rates, policy violation reports, sensitive data detection hits.
  • Reliability signals: error rates, schema validation failures, tool-call failures, latency spikes.
  • Segment breakdowns: performance by language, topic, user type, or region.

Step-by-step: set up a feedback loop

  • Step 1: Log inputs and outputs responsibly with privacy controls and redaction where needed.
  • Step 2: Sample for review using both random sampling and targeted sampling (e.g., all flagged items).
  • Step 3: Label failures using the same categories as your offline rubric so trends are comparable.
  • Step 4: Turn failures into tests: add them to your frozen benchmark or a regression suite.
  • Step 5: Gate releases: require that new versions do not regress on critical metrics.

This loop is what turns evaluation into an engineering practice rather than a one-time experiment.

Putting it together: an evaluation plan template

To make evaluation actionable, write a short plan that can be executed repeatedly.

Example evaluation plan structure

  • Use case: “Draft customer support replies for shipping issues.”
  • Critical risks: “Incorrect refund policy; requesting sensitive data; promising delivery dates without evidence.”
  • Acceptance criteria: “No sensitive data requests; must ask for order ID if missing; must match policy text; polite tone.”
  • Offline tests: “200-case benchmark with segment labels; 20 adversarial prompts; 3 paraphrases each; 3 trials each.”
  • Automated checks: “Length, forbidden content, schema, policy keyword constraints, tool-output alignment.”
  • Human review: “Monthly 50-case sample; double review on 10 cases; rubric scoring.”
  • Release gates: “No regression on compliance; reliability pass rate ≥ 95% overall and ≥ 90% in worst segment.”
  • Production monitoring: “Flag rate, escalation rate, user edits, schema failures, segment dashboards.”

With a plan like this, you can compare models, prompts, and system designs in a disciplined way, and you can justify decisions to stakeholders using evidence rather than anecdotes.

Now answer the exercise about the content:

In an evaluation plan for an LLM feature, which approach best tests reliability rather than just quality?

You are right! Congratulations, now go to the next page

You missed! Try again.

Reliability is about consistency across cases, segments, and runs. A reliability test suite uses representative inputs, perturbations, repeated trials, and pass-rate/variance metrics against acceptance criteria.

Next chapter

When LLMs Are the Right Tool

Arrow Right Icon
Download the app to earn free Certification and listen to the courses in the background, even with the screen off.