Free Ebook cover Python Data Modeling in Practice: Dataclasses, Pydantic, and Type Hints

Python Data Modeling in Practice: Dataclasses, Pydantic, and Type Hints

New course

14 pages

Pydantic-Style Validation for Robust Input Boundaries

Capítulo 10

Estimated reading time: 11 minutes

+ Exercise

Why “Input Boundaries” Need Strong Validation

In a real system, most defects around data modeling do not come from your carefully constructed domain objects; they come from the edges where data enters: HTTP requests, CLI arguments, message queues, partner webhooks, CSV imports, and configuration files. These boundaries are where you must assume data is incomplete, incorrectly typed, inconsistently formatted, or even malicious. “Pydantic-style validation” refers to a practical approach: define explicit input schemas that parse and validate raw data into well-typed, trustworthy objects, producing actionable error messages when the input is wrong.

The goal is not to “make everything a Pydantic model.” The goal is to create a robust boundary layer that converts untrusted input into safe, normalized values before the rest of the application touches it. This chapter focuses on how to design those boundary schemas, how to validate and normalize common patterns, and how to integrate them without leaking boundary concerns into the domain.

Core Idea: Parse, Validate, Normalize, Then Hand Off

Pydantic-style validation typically follows a pipeline:

  • Parse: Convert raw input types (strings, numbers, dicts) into Python types.
  • Validate: Enforce constraints (required fields, ranges, formats, allowed values, cross-field rules).
  • Normalize: Canonicalize values (trim whitespace, lowercasing, timezone normalization, defaulting).
  • Hand off: Provide a validated object to the next layer (service/domain), or return structured errors to the caller.

This approach is especially effective when inputs are messy. For example, a client might send "42" for an integer, "TRUE" for a boolean, or a date in multiple formats. A boundary schema can accept reasonable variations, normalize them, and reject the rest with clear errors.

Choosing the Right Tooling: Pydantic v2 Concepts You’ll Use

Pydantic v2 is a common implementation of this style. The most relevant building blocks for boundary validation are:

Continue in our app.

You can listen to the audiobook with the screen off, receive a free certificate for this course, and also have access to 5,000 other free online courses.

Or continue reading below...
Download App

Download the app

  • BaseModel: Defines the input schema and performs parsing/validation.
  • Field constraints: Declarative constraints like min/max, length, regex/pattern, and defaults.
  • field_validator: Per-field validation and normalization.
  • model_validator: Cross-field validation (e.g., “end_date must be after start_date”).
  • ConfigDict: Model configuration (extra fields, strictness, aliasing, etc.).
  • Error reporting: Structured errors that can be returned to API clients.

The patterns below are transferable even if you use another validation library: the key is to keep boundary schemas explicit and separate from domain objects.

Step-by-Step: Building a Boundary Schema for an API Request

Consider an endpoint that creates a user account. The boundary must handle raw JSON input, validate it, normalize it, and then pass a clean command object to the application layer.

Step 1: Define the input model with constraints

from pydantic import BaseModel, Field, ConfigDict, EmailStr, ValidationError, field_validator, model_validator
class CreateUserInput(BaseModel):    model_config = ConfigDict(extra='forbid')    email: EmailStr    display_name: str = Field(min_length=1, max_length=50)    age: int | None = Field(default=None, ge=13, le=130)    marketing_opt_in: bool = False

Key points:

  • extra='forbid' rejects unknown fields, which prevents silent acceptance of misspelled keys and reduces attack surface.
  • EmailStr parses and validates email format.
  • Field(min_length=..., max_length=...) enforces basic string constraints.
  • ge/le enforce numeric bounds.

Step 2: Normalize fields with field_validator

Normalization belongs at the boundary because it’s about accepting a variety of input representations and producing a canonical form.

class CreateUserInput(BaseModel):    model_config = ConfigDict(extra='forbid')    email: EmailStr    display_name: str = Field(min_length=1, max_length=50)    age: int | None = Field(default=None, ge=13, le=130)    marketing_opt_in: bool = False    @field_validator('display_name')    @classmethod    def normalize_display_name(cls, v: str) -> str:        v = v.strip()        if not v:            raise ValueError('display_name cannot be blank')        return v

This ensures that " Alice " becomes "Alice", and that whitespace-only names are rejected with a clear message.

Step 3: Add cross-field rules with model_validator (when needed)

Not every model needs cross-field validation, but it’s common for filters, date ranges, and conditional requirements.

class CreateUserInput(BaseModel):    model_config = ConfigDict(extra='forbid')    email: EmailStr    display_name: str = Field(min_length=1, max_length=50)    age: int | None = Field(default=None, ge=13, le=130)    marketing_opt_in: bool = False    @field_validator('display_name')    @classmethod    def normalize_display_name(cls, v: str) -> str:        v = v.strip()        if not v:            raise ValueError('display_name cannot be blank')        return v    @model_validator(mode='after')    def check_marketing_requires_age(self):        if self.marketing_opt_in and self.age is None:            raise ValueError('age is required when marketing_opt_in is true')        return self

This demonstrates a conditional requirement: opting into marketing requires an age value.

Step 4: Validate incoming data and return structured errors

At the boundary (e.g., a web handler), you validate raw input and handle errors in a consistent way.

def parse_create_user(payload: dict) -> CreateUserInput:    try:        return CreateUserInput.model_validate(payload)    except ValidationError as e:        # In a web API you would map this to a 400 response        # with e.errors() as the body.        raise

Pydantic’s ValidationError includes a list of error objects with locations and messages. This is ideal for client-facing APIs because you can point to exactly which field failed and why.

Strict vs Coercive Parsing: Decide Per Boundary

One of the most important design decisions is whether your boundary should coerce types (accept "123" for an int) or be strict (reject anything not already the correct type). Coercion can improve usability for external clients, but strictness can reduce ambiguity and unexpected behavior.

You can tune this per model. For example, to enforce strict types for a sensitive internal boundary (like a message consumed from a queue where producers are controlled), you can use strict types and configuration.

from pydantic import BaseModel, ConfigDict, StrictInt, StrictBool
class InternalEventInput(BaseModel):    model_config = ConfigDict(extra='forbid')    event_id: str    retry_count: StrictInt    is_replay: StrictBool

With strict types, "1" will not be accepted as an integer. This is useful when you want producers to fix their serialization rather than relying on the consumer to guess.

Handling “Extra Fields”: Forbid, Ignore, or Allow

Unknown fields are a common source of subtle bugs. Pydantic supports different strategies:

  • Forbid: Reject unknown keys. Best for public APIs and security-sensitive boundaries.
  • Ignore: Drop unknown keys. Useful when you want forward compatibility with clients sending extra data.
  • Allow: Keep unknown keys. Useful for pass-through scenarios, but increases complexity.
class LenientInput(BaseModel):    model_config = ConfigDict(extra='ignore')    q: str

When you choose ignore, document it and ensure you are not accidentally discarding important data. For most create/update commands, forbid is the safer default.

Aliases and Field Names: Accept External Conventions Without Leaking Them

External inputs often use different naming conventions (camelCase) than Python code (snake_case). A boundary schema can accept external names while exposing internal names to your code.

from pydantic import BaseModel, Field, ConfigDict
class PaginationInput(BaseModel):    model_config = ConfigDict(extra='forbid', populate_by_name=True)    page: int = Field(ge=1, default=1, alias='pageNumber')    page_size: int = Field(ge=1, le=200, default=50, alias='pageSize')

With populate_by_name=True, the model can accept either page_size or pageSize. This is useful during migrations or when multiple clients exist.

Reusable Validation Patterns for Common Boundary Problems

Pattern 1: Trim and collapse whitespace

Inputs from forms and CSV often contain inconsistent whitespace. Normalize it early.

import refrom pydantic import BaseModel, field_validator
class SearchInput(BaseModel):    q: str    @field_validator('q')    @classmethod    def normalize_query(cls, v: str) -> str:        v = v.strip()        v = re.sub(r'\s+', ' ', v)        if len(v) < 2:            raise ValueError('q must be at least 2 characters after trimming')        return v

Pattern 2: Parse flexible date/time formats into a canonical form

Clients may send timestamps in multiple formats. Decide what you accept, parse it, and normalize to a consistent timezone.

from datetime import datetime, timezonefrom pydantic import BaseModel, field_validator
class ReportRangeInput(BaseModel):    start: datetime    end: datetime    @field_validator('start', 'end')    @classmethod    def ensure_timezone(cls, v: datetime) -> datetime:        # If naive, assume UTC (or reject, depending on your policy)        if v.tzinfo is None:            v = v.replace(tzinfo=timezone.utc)        return v.astimezone(timezone.utc)
from pydantic import model_validator
class ReportRangeInput(BaseModel):    start: datetime    end: datetime    @field_validator('start', 'end')    @classmethod    def ensure_timezone(cls, v: datetime) -> datetime:        if v.tzinfo is None:            v = v.replace(tzinfo=timezone.utc)        return v.astimezone(timezone.utc)    @model_validator(mode='after')    def check_order(self):        if self.end <= self.start:            raise ValueError('end must be after start')        return self

This boundary model guarantees that downstream code always receives UTC-aware datetimes and a valid range.

Pattern 3: Validate lists with constraints and per-item normalization

Batch operations often accept lists of identifiers. You typically want to enforce size limits and normalize each item.

from pydantic import BaseModel, Field, field_validator
class BatchLookupInput(BaseModel):    ids: list[str] = Field(min_length=1, max_length=100)    @field_validator('ids')    @classmethod    def normalize_ids(cls, v: list[str]) -> list[str]:        cleaned = []        for item in v:            s = item.strip()            if not s:                raise ValueError('ids cannot contain blank values')            cleaned.append(s)        # Optional: enforce uniqueness while preserving order        seen = set()        unique = []        for s in cleaned:            if s not in seen:                seen.add(s)                unique.append(s)        return unique

This prevents empty IDs, enforces a maximum batch size, and optionally deduplicates.

Boundary Models as “Commands” and “Queries”

A practical way to keep boundaries clean is to define separate models for:

  • Commands: create/update actions with required fields and strict constraints.
  • Queries: filters and pagination with optional fields and normalization.

For example, a query boundary often needs to accept optional filters but still enforce that at least one filter is present, or that certain combinations are valid.

from pydantic import BaseModel, Field, model_validator
class UserSearchQuery(BaseModel):    email: str | None = None    display_name: str | None = Field(default=None, min_length=1, max_length=50)    is_active: bool | None = None    @model_validator(mode='after')    def require_some_filter(self):        if self.email is None and self.display_name is None and self.is_active is None:            raise ValueError('at least one filter must be provided')        return self

This prevents “unbounded” queries that could accidentally scan large datasets.

Mapping Validated Input to Application/Domain Types

Boundary models should not become the universal data structure across your codebase. A common pattern is:

  • Validate raw input into a boundary model.
  • Transform it into an internal command/query object (could be a dataclass) used by services.
  • Keep the transformation explicit so that boundary concerns (aliases, lenient parsing) do not leak inward.
from dataclasses import dataclass
@dataclass(frozen=True)class CreateUserCommand:    email: str    display_name: str    age: int | None    marketing_opt_in: bool
def to_command(inp: CreateUserInput) -> CreateUserCommand:    return CreateUserCommand(        email=str(inp.email),        display_name=inp.display_name,        age=inp.age,        marketing_opt_in=inp.marketing_opt_in,    )

This keeps your application layer independent from Pydantic while still benefiting from robust validation at the boundary.

Error Translation: Turning Validation Errors into Client-Friendly Responses

Pydantic provides structured error details. A boundary layer typically converts these into your API’s error format. The key is to preserve:

  • Location: which field failed (including nested paths).
  • Message: a human-readable explanation.
  • Type/code: a stable identifier for programmatic handling.
from pydantic import ValidationError
def format_validation_error(e: ValidationError) -> dict:    return {        'error': 'validation_failed',        'details': [            {                'loc': err['loc'],                'msg': err['msg'],                'type': err['type'],            }            for err in e.errors()        ],    }

This is especially helpful for frontends that want to highlight specific fields.

Security and Robustness Considerations at the Boundary

Limit sizes to prevent resource abuse

Always constrain:

  • String lengths (names, descriptions, free-form text).
  • List sizes (batch endpoints).
  • Numeric ranges (pagination, counts).

Even if your database has constraints, validating early prevents wasted work and reduces the risk of denial-of-service patterns.

Be deliberate about permissive parsing

Permissive parsing can be convenient, but it can also hide client bugs. A good compromise is:

  • Be permissive for public-facing inputs where you expect variability (e.g., booleans from forms).
  • Be strict for internal events and service-to-service boundaries where producers are controlled.

Prefer explicit normalization rules

Normalization should be deterministic and documented. Examples:

  • Lowercase emails (if your system treats them case-insensitively).
  • Normalize datetimes to UTC.
  • Strip whitespace and collapse internal whitespace for search queries.

Testing Boundary Schemas Like Any Other Critical Component

Because boundary schemas define what your system accepts, they deserve focused tests. The most valuable tests are table-driven: a list of payloads that should pass and a list that should fail with specific error locations.

import pytestfrom pydantic import ValidationError
def test_create_user_rejects_unknown_fields():    payload = {        'email': 'a@example.com',        'display_name': 'Alice',        'unknown': 123,    }    with pytest.raises(ValidationError) as exc:        CreateUserInput.model_validate(payload)    errors = exc.value.errors()    assert errors[0]['loc'] == ('unknown',)
def test_create_user_trims_display_name():    payload = {'email': 'a@example.com', 'display_name': '  Alice  '}    inp = CreateUserInput.model_validate(payload)    assert inp.display_name == 'Alice'

These tests act as executable documentation for clients and for future maintainers.

Now answer the exercise about the content:

What is the main purpose of using a boundary validation pipeline that parses, validates, normalizes, and then hands off input data?

You are right! Congratulations, now go to the next page

You missed! Try again.

The boundary layer is meant to turn raw, unreliable input into safe, normalized, well-typed objects, and to return structured, actionable errors when validation fails, without leaking boundary concerns into the domain.

Next chapter

Error Modeling and Validation Feedback Design

Arrow Right Icon
Download the app to earn free Certification and listen to the courses in the background, even with the screen off.