Free Ebook cover Python Data Modeling in Practice: Dataclasses, Pydantic, and Type Hints

Python Data Modeling in Practice: Dataclasses, Pydantic, and Type Hints

New course

14 pages

Serialization and Deserialization Strategies

Capítulo 9

Estimated reading time: 12 minutes

+ Exercise

Why serialization strategy matters

Serialization is the act of turning an in-memory Python object into a transferable representation (most commonly JSON, but also dicts, bytes, or database rows). Deserialization is the reverse: taking external data and constructing an in-memory object. In real systems, these operations sit at boundaries: HTTP APIs, message queues, caches, persistence layers, and background jobs. A “strategy” is the set of decisions you make about formats, versioning, validation, error handling, and how much of your internal model you expose.

A good strategy makes data exchange predictable and safe. A weak strategy creates subtle bugs: timezone drift, float rounding, missing fields that silently become defaults, or incompatible payloads after a refactor. This chapter focuses on practical patterns for Python data modeling using dataclasses, Pydantic, and type hints, without rehashing earlier modeling topics.

Core decisions in a serialization/deserialization strategy

1) Choose the wire format and the “shape”

The wire format is the encoding used to transmit or store data: JSON, MessagePack, Avro, Protobuf, or a database schema. The “shape” is the structure of fields and nesting. Even if you use JSON, you still must decide things like: snake_case vs camelCase, whether to include nulls, and how to represent dates and decimals.

  • JSON: ubiquitous, human-readable, but limited types (no native datetime/decimal/bytes).
  • Binary formats: smaller and faster, but require schema tooling and are harder to debug.
  • Database rows: often require mapping between columns and object fields, plus handling joins and aggregates.

2) Define a stable external contract

External payloads should be treated as contracts. Internal refactors should not automatically change the serialized output unless you intentionally version or migrate the contract. This typically implies having a dedicated “DTO” (data transfer object) representation, or at least explicit serialization rules, rather than relying on “whatever __dict__ contains”.

3) Decide how to handle unknown, missing, and extra fields

When deserializing, you will encounter:

Continue in our app.

You can listen to the audiobook with the screen off, receive a free certificate for this course, and also have access to 5,000 other free online courses.

Or continue reading below...
Download App

Download the app

  • Missing fields: older producers, partial updates, or optional data.
  • Extra fields: newer producers, forward compatibility, or vendor extensions.
  • Unknown field types: strings where numbers are expected, timestamps in different formats, etc.

Your strategy should define whether to reject, ignore, store, or log these cases. Pydantic makes these decisions explicit via configuration.

4) Decide canonical representations for tricky types

Some Python types do not map cleanly to JSON. Choose canonical representations and stick to them:

  • datetime: ISO 8601 strings with timezone offset (prefer UTC with “Z”).
  • date: ISO 8601 date string (YYYY-MM-DD).
  • Decimal: string (to avoid float rounding) or integer minor units (e.g., cents).
  • UUID: string.
  • bytes: base64 string.
  • Enum: string value (stable) rather than integer index (fragile).

Dataclasses: explicit serialization without magic

Dataclasses are lightweight and do not impose a serialization policy. That is a feature: you can keep domain objects clean and write explicit adapters at the boundary. The most common baseline is dataclasses.asdict, but it is not a full strategy: it recursively converts dataclasses to dicts, but it does not handle datetime/Decimal/UUID well, and it will eagerly convert nested dataclasses (which might not match your desired external shape).

Pattern: write boundary mappers (to_dict/from_dict)

A pragmatic approach is to implement serialization functions in a separate module (or on a dedicated DTO), keeping the domain object free of transport concerns.

from __future__ import annotations

from dataclasses import dataclass
from datetime import datetime, timezone
from decimal import Decimal
from uuid import UUID

@dataclass(frozen=True)
class Invoice:
    id: UUID
    issued_at: datetime
    total: Decimal


def invoice_to_dict(inv: Invoice) -> dict:
    return {
        "id": str(inv.id),
        "issued_at": inv.issued_at.astimezone(timezone.utc).isoformat().replace("+00:00", "Z"),
        "total": str(inv.total),
    }


def invoice_from_dict(data: dict) -> Invoice:
    return Invoice(
        id=UUID(data["id"]),
        issued_at=_parse_utc_datetime(data["issued_at"]),
        total=Decimal(data["total"]),
    )


def _parse_utc_datetime(value: str) -> datetime:
    # Minimal ISO8601 handling for "Z".
    if value.endswith("Z"):
        value = value[:-1] + "+00:00"
    dt = datetime.fromisoformat(value)
    if dt.tzinfo is None:
        raise ValueError("issued_at must include timezone")
    return dt.astimezone(timezone.utc)

This is explicit, testable, and stable. It also forces you to decide the external contract (strings for UUID and Decimal, ISO timestamps in UTC).

Step-by-step: adding versioning to a dict payload

Versioning is a serialization concern. One simple strategy is embedding a version field and migrating on read.

def invoice_to_payload(inv: Invoice) -> dict:
    return {
        "_type": "invoice",
        "_version": 2,
        "data": invoice_to_dict(inv),
    }


def invoice_from_payload(payload: dict) -> Invoice:
    if payload.get("_type") != "invoice":
        raise ValueError("unexpected payload type")

    version = int(payload.get("_version", 1))
    data = payload["data"] if "data" in payload else payload

    if version == 1:
        # v1 used "issuedAt" and "amount" keys
        migrated = {
            "id": data["id"],
            "issued_at": data["issuedAt"],
            "total": data["amount"],
        }
        return invoice_from_dict(migrated)

    if version == 2:
        return invoice_from_dict(data)

    raise ValueError(f"unsupported version: {version}")

Step-by-step, the strategy is: (1) write a versioned envelope, (2) keep old readers working by migrating input, (3) keep the domain constructor unchanged, (4) test migrations with fixtures.

Common pitfalls with dataclass serialization

  • Accidental leakage of internal fields: if you rely on __dict__ or asdict, you might expose caches, computed fields, or internal flags.
  • Datetime without timezone: naive datetimes serialize but become ambiguous across systems.
  • Decimal to float: converting to float can lose cents; prefer string or integer minor units.
  • Renaming fields: refactors break consumers unless you alias or version.

Pydantic: structured parsing and controlled dumping

Pydantic is often used at boundaries because it provides robust parsing (deserialization) and controlled dumping (serialization). It can accept messy input (strings for numbers, different datetime formats) and produce a normalized output. This makes it a strong choice for API schemas, message payloads, and configuration files.

The key idea is to separate “input acceptance” from “output emission”. You can be liberal in what you accept (within reason) and strict in what you emit.

Defining a boundary schema with Pydantic

from datetime import datetime
from decimal import Decimal
from uuid import UUID

from pydantic import BaseModel, ConfigDict, Field

class InvoiceDTO(BaseModel):
    model_config = ConfigDict(
        extra="forbid",          # reject unknown fields
        populate_by_name=True,    # allow aliases
        str_strip_whitespace=True,
    )

    id: UUID
    issued_at: datetime = Field(alias="issuedAt")
    total: Decimal = Field(alias="amount")

This DTO accepts input keys issuedAt and amount but stores them as issued_at and total internally. With extra="forbid", unexpected fields become explicit errors, which is often desirable for public APIs.

Step-by-step: parsing input, validating, and emitting canonical JSON

import json

raw = '{"id":"b3b2b7b2-2f2a-4a6d-9b7b-9d7a8c2d2c0a","issuedAt":"2025-01-05T10:00:00Z","amount":"19.99"}'

# 1) Parse JSON to Python primitives
payload = json.loads(raw)

# 2) Validate and coerce types
invoice_dto = InvoiceDTO.model_validate(payload)

# 3) Emit canonical JSON-ready dict
out_dict = invoice_dto.model_dump(
    by_alias=False,   # use snake_case keys in output
    mode="json",     # ensure JSON-compatible types
    exclude_none=True,
)

# 4) Encode to JSON string
out_json = json.dumps(out_dict, separators=(",", ":"), sort_keys=True)

print(out_json)

Important details: mode="json" ensures types like UUID, datetime, and Decimal are converted to JSON-friendly representations. exclude_none=True avoids emitting nulls unless you want them.

Controlling extra fields: forbid, ignore, or allow

Different boundaries require different behavior:

  • Public API: often extra="forbid" to detect client mistakes early.
  • Event ingestion: sometimes extra="ignore" for forward compatibility.
  • Pass-through gateways: occasionally extra="allow" to preserve unknown fields.
class LenientEvent(BaseModel):
    model_config = ConfigDict(extra="ignore")
    event_id: str
    occurred_at: datetime

With extra="ignore", producers can add fields without breaking older consumers, but you must ensure you are not silently ignoring critical data changes. A common compromise is to ignore extras but log them.

Bridging domain objects and DTOs

A common architecture is: domain objects remain focused on business meaning, while DTOs handle serialization. The mapping layer becomes the single place where field names, formats, and versioning are handled.

Step-by-step: mapping a Pydantic DTO to a dataclass domain object

from dataclasses import dataclass
from datetime import datetime
from decimal import Decimal
from uuid import UUID

@dataclass(frozen=True)
class Invoice:
    id: UUID
    issued_at: datetime
    total: Decimal

class InvoiceDTO(BaseModel):
    model_config = ConfigDict(extra="forbid", populate_by_name=True)
    id: UUID
    issued_at: datetime = Field(alias="issuedAt")
    total: Decimal = Field(alias="amount")

    def to_domain(self) -> Invoice:
        return Invoice(id=self.id, issued_at=self.issued_at, total=self.total)

    @classmethod
    def from_domain(cls, inv: Invoice) -> "InvoiceDTO":
        return cls(id=inv.id, issuedAt=inv.issued_at, amount=inv.total)

This keeps the domain object free of JSON concerns while still providing a convenient conversion path. Note that from_domain uses alias field names when constructing, which is enabled by populate_by_name=True.

Alternative: Pydantic dataclasses

If you want dataclass ergonomics with Pydantic validation, Pydantic supports dataclasses. This can reduce duplication, but it also couples your domain model to a boundary library. Use it when that coupling is acceptable for your project.

Serialization policies: canonicalization, determinism, and security

Canonicalization and deterministic output

Deterministic serialization is valuable for caching, signatures, and idempotency keys. For JSON, determinism usually means: stable key ordering, consistent whitespace, and consistent formatting for timestamps and decimals.

import json

def canonical_json(data: dict) -> str:
    return json.dumps(
        data,
        sort_keys=True,
        separators=(",", ":"),
        ensure_ascii=False,
    )

Pair this with a canonical dump from Pydantic (mode="json") or your own conversion functions to avoid non-JSON types.

Security: avoid unsafe deserialization

Some Python serialization mechanisms can execute code or reconstruct arbitrary objects. Avoid using pickle for untrusted input. For boundary data, prefer JSON or schema-driven formats, and validate with Pydantic or explicit parsing.

Redaction and field-level control

Serialization is also where you decide what not to expose: secrets, internal notes, or personally identifiable information. Make redaction explicit rather than relying on “private” naming conventions.

class UserPublicDTO(BaseModel):
    model_config = ConfigDict(extra="forbid")
    id: str
    display_name: str

# Internal model might have email, roles, etc., but the public DTO does not.

Handling partial updates and patch semantics

Deserialization is not always “create a full object”. For HTTP PATCH or update commands, you often want to accept partial payloads and distinguish between “field missing” and “field explicitly set to null”. Your strategy should define how to represent these states.

Pattern: separate Patch DTO with optional fields

from typing import Optional

class InvoicePatchDTO(BaseModel):
    model_config = ConfigDict(extra="forbid")
    issued_at: Optional[datetime] = None
    total: Optional[Decimal] = None

patch = InvoicePatchDTO.model_validate({"total": "20.00"})
changes = patch.model_dump(exclude_unset=True)
# changes == {"total": Decimal('20.00')}

exclude_unset=True is crucial: it tells you which fields were actually provided. This enables correct patch semantics without guessing.

Error handling and observability

Deserialization failures are inevitable. A strategy should define:

  • Error shape: how you report validation errors to callers (especially in APIs).
  • Logging: what you log (avoid logging secrets) and how you correlate failures with requests/events.
  • Fallback behavior: whether to drop messages, send to a dead-letter queue, or retry.

Step-by-step: turning Pydantic errors into API-friendly responses

from pydantic import ValidationError

def parse_invoice(payload: dict) -> InvoiceDTO:
    try:
        return InvoiceDTO.model_validate(payload)
    except ValidationError as e:
        # Transform into a stable error format
        details = [
            {
                "loc": list(err["loc"]),
                "msg": err["msg"],
                "type": err["type"],
            }
            for err in e.errors()
        ]
        raise ValueError({"error": "invalid_payload", "details": details})

This keeps your outward error contract stable even if internal validation libraries change their wording. In production, you would likely raise a domain-specific exception type rather than ValueError.

Strategies for backward and forward compatibility

Field aliases and deprecations

When you rename a field, you can accept both names during a transition period. Pydantic aliases help for input; for output, choose one canonical name and stick to it.

class EventDTO(BaseModel):
    model_config = ConfigDict(populate_by_name=True, extra="ignore")
    occurred_at: datetime = Field(alias="occurredAt")

    # You can also accept legacy keys by pre-processing input
    @classmethod
    def model_validate_compat(cls, data: dict) -> "EventDTO":
        if "timestamp" in data and "occurredAt" not in data:
            data = {**data, "occurredAt": data["timestamp"]}
        return cls.model_validate(data)

Schema evolution with envelopes

For events and messages, an envelope with type, version, and data makes routing and migration explicit. It also helps when multiple event types share a transport.

class Envelope(BaseModel):
    model_config = ConfigDict(extra="forbid")
    type: str
    version: int
    data: dict

Then dispatch by type and migrate by version before validating the inner DTO.

Testing serialization and deserialization

Because serialization is a contract, tests should lock down both directions:

  • Golden files: known JSON fixtures that must continue to parse.
  • Round-trip tests: object → JSON → object, ensuring equality where appropriate.
  • Compatibility tests: old payload versions still parse into the current domain model.
  • Determinism tests: canonical JSON output is stable for the same object.
def test_invoice_round_trip(invoice: Invoice):
    d = invoice_to_dict(invoice)
    parsed = invoice_from_dict(d)
    assert parsed == invoice


def test_invoice_dto_dump_is_json_safe():
    dto = InvoiceDTO.model_validate({
        "id": "b3b2b7b2-2f2a-4a6d-9b7b-9d7a8c2d2c0a",
        "issuedAt": "2025-01-05T10:00:00Z",
        "amount": "19.99",
    })
    dumped = dto.model_dump(mode="json")
    assert isinstance(dumped["id"], str)
    assert isinstance(dumped["issued_at"], str)
    assert isinstance(dumped["total"], str)

These tests encode your strategy: UUID/datetime/Decimal become strings in JSON mode, and round-trips preserve meaning.

Now answer the exercise about the content:

When accepting a partial update payload, which approach best preserves correct patch semantics by distinguishing between fields not provided and fields explicitly set to null?

You are right! Congratulations, now go to the next page

You missed! Try again.

For PATCH-like updates, using a separate Patch DTO with optional fields plus exclude_unset=True keeps only fields the client actually sent, avoiding confusion between missing fields and explicit nulls.

Next chapter

Pydantic-Style Validation for Robust Input Boundaries

Arrow Right Icon
Download the app to earn free Certification and listen to the courses in the background, even with the screen off.