Data Models for AI Applications: Pydantic vs Python Built-ins

When you’re building AI applications, data flows everywhere—API responses from Bedrock, metadata for vector stores, tool inputs for agents, structured outputs from LLMs. Without proper data models, you end up with a mess of dictionaries, mysterious KeyError exceptions, and code that’s impossible to refactor.

This tutorial compares Python’s data modeling options and shows you when to use each one in AI/ML applications.

The Problem: Dictionary Hell

Here’s code I see in most AI prototypes:

# The "it works" approach
def process_embedding_result(result):
    vector = result["embedding"]
    metadata = result.get("metadata", {})
    score = metadata.get("similarity_score", 0.0)

    # Wait, is score a float or string?
    # What keys does metadata have?
    # What happens if embedding is missing?

    return {
        "id": metadata["id"],  # KeyError if missing
        "vector": vector,
        "score": float(score)  # TypeError if score is None
    }

This code has three problems:

No validation - Bad data causes cryptic errors downstream
No documentation - What shape is result? What’s in metadata?
No IDE support - No autocomplete, no type checking

Let’s fix this systematically.

Option 1: Plain Dictionaries

Use when: Quick scripts, throwaway code, or when schema is truly dynamic.

# Simple but fragile
embedding_result = {
    "id": "doc-123",
    "vector": [0.1, 0.2, 0.3],
    "metadata": {"source": "arxiv", "year": 2024}
}

# Access with .get() for safety
source = embedding_result.get("metadata", {}).get("source", "unknown")

Pros:

Zero overhead
Maximum flexibility
Works with any JSON

Cons:

No validation
No autocomplete
No documentation
Runtime errors from typos (metdata vs metadata)

Option 2: dataclasses (Python 3.7+)

Use when: Internal data structures, configuration objects, or when you want type hints without validation.

from dataclasses import dataclass, field
from typing import Optional

@dataclass
class EmbeddingMetadata:
    source: str
    year: int
    author: Optional[str] = None

@dataclass
class EmbeddingResult:
    id: str
    vector: list[float]
    metadata: EmbeddingMetadata
    score: float = 0.0

Now you get autocomplete and type checking:

result = EmbeddingResult(
    id="doc-123",
    vector=[0.1, 0.2, 0.3],
    metadata=EmbeddingMetadata(source="arxiv", year=2024)
)

# IDE knows these exist
print(result.metadata.source)  # "arxiv"
print(result.score)  # 0.0

Pros:

Type hints for IDE support
Default values
Auto-generated __init__, __repr__, __eq__
Built into Python (no dependencies)

Cons:

No runtime validation (wrong types silently accepted)
Manual serialization to/from JSON
No coercion ("2024" won’t become 2024)

# This "works" but is wrong - no validation!
bad_result = EmbeddingResult(
    id=123,  # Should be str, but Python accepts it
    vector="not a list",  # Completely wrong type
    metadata={"source": "arxiv"}  # Should be EmbeddingMetadata
)

Option 3: TypedDict (Python 3.8+)

Use when: You need type hints for dictionary access patterns, especially with JSON APIs.

from typing import TypedDict, NotRequired

class EmbeddingMetadata(TypedDict):
    source: str
    year: int
    author: NotRequired[str]  # Optional key

class EmbeddingResult(TypedDict):
    id: str
    vector: list[float]
    metadata: EmbeddingMetadata
    score: NotRequired[float]

TypedDict gives you autocomplete while keeping dictionary syntax:

result: EmbeddingResult = {
    "id": "doc-123",
    "vector": [0.1, 0.2, 0.3],
    "metadata": {"source": "arxiv", "year": 2024}
}

# IDE knows the keys
print(result["metadata"]["source"])  # Autocomplete works!

Pros:

Type hints for dictionaries
Works with existing dict-based code
No runtime overhead (it’s just a dict)
Great for API response typing

Cons:

No runtime validation (type hints only)
Still uses [] access (can raise KeyError)
No methods or computed properties

# Type checker catches this, but runtime doesn't
bad: EmbeddingResult = {
    "id": 123,  # Type error in IDE, but runs fine
    "vector": [0.1, 0.2]
}

Option 4: Pydantic (The AI/ML Standard)

Use when: API boundaries, external data, configuration, or anywhere you need validation.

from pydantic import BaseModel, Field
from typing import Optional

class EmbeddingMetadata(BaseModel):
    source: str
    year: int
    author: Optional[str] = None

class EmbeddingResult(BaseModel):
    id: str
    vector: list[float]
    metadata: EmbeddingMetadata
    score: float = Field(default=0.0, ge=0.0, le=1.0)

Pydantic validates at runtime and coerces types:

# This works - Pydantic coerces types
result = EmbeddingResult(
    id="doc-123",
    vector=[0.1, 0.2, 0.3],
    metadata={"source": "arxiv", "year": "2024"}  # str -> int
)

print(result.metadata.year)  # 2024 (int, not str)
print(type(result.metadata.year))  # <class 'int'>

Validation catches errors immediately:

from pydantic import ValidationError

try:
    bad_result = EmbeddingResult(
        id="doc-123",
        vector="not a list",  # Wrong type
        metadata={"source": "arxiv"}  # Missing required field
    )
except ValidationError as e:
    print(e)

Output

2 validation errors for EmbeddingResult
vector
Input should be a valid list [type=list_type, input_value='not a list', input_type=str]
metadata.year
Field required [type=missing, input_value={'source': 'arxiv'}, input_type=dict]

Pros:

Runtime validation with clear error messages
Automatic type coercion
JSON serialization built-in
Field constraints (min, max, regex, etc.)
Nested model support
Integration with FastAPI, LangChain, Instructor

Cons:

External dependency
Slightly slower than raw dicts (but v2 is fast)
Learning curve for advanced features

Pydantic for AI Applications

Let’s see Pydantic in real AI/ML scenarios.

Example 1: Bedrock API Responses

from pydantic import BaseModel
from typing import Literal

class BedrockMessage(BaseModel):
    role: Literal["user", "assistant"]
    content: str

class BedrockUsage(BaseModel):
    input_tokens: int
    output_tokens: int

class BedrockResponse(BaseModel):
    id: str
    model: str
    stop_reason: Literal["end_turn", "max_tokens", "stop_sequence"]
    usage: BedrockUsage
    content: list[dict]

    @property
    def text(self) -> str:
        """Extract text from content blocks."""
        for block in self.content:
            if block.get("type") == "text":
                return block.get("text", "")
        return ""


# Parse API response safely
import json

def call_bedrock(prompt: str) -> BedrockResponse:
    # ... boto3 call ...
    response_body = json.loads(response["body"].read())
    return BedrockResponse(**response_body)

result = call_bedrock("Explain embeddings")
print(result.text)  # Clean access
print(result.usage.output_tokens)  # Type-safe

Example 2: Vector Store Metadata

from pydantic import BaseModel, Field
from datetime import datetime
from typing import Optional

class DocumentMetadata(BaseModel):
    """Metadata stored with each vector in S3 Vectors."""
    doc_id: str = Field(..., min_length=1)
    title: str
    source: str
    chunk_index: int = Field(ge=0)
    total_chunks: int = Field(ge=1)
    created_at: datetime = Field(default_factory=datetime.utcnow)
    url: Optional[str] = None

    # Computed property
    @property
    def is_first_chunk(self) -> bool:
        return self.chunk_index == 0

# Serialize for S3 Vectors (which needs flat dicts)
metadata = DocumentMetadata(
    doc_id="paper-123",
    title="Attention Is All You Need",
    source="arxiv",
    chunk_index=0,
    total_chunks=5
)

# Convert to dict for storage
storage_dict = metadata.model_dump(mode="json")
print(storage_dict)

Output

{
"doc_id": "paper-123",
"title": "Attention Is All You Need",
"source": "arxiv",
"chunk_index": 0,
"total_chunks": 5,
"created_at": "2026-01-01T12:00:00",
"url": null
}

Example 3: Agent Tool Contracts

Define clear input/output contracts for agent tools:

from pydantic import BaseModel, Field
from typing import Literal

class WeatherRequest(BaseModel):
    """Input for weather tool."""
    location: str = Field(..., description="City name or coordinates")
    units: Literal["celsius", "fahrenheit"] = "celsius"

class WeatherResponse(BaseModel):
    """Output from weather tool."""
    location: str
    temperature: float
    conditions: str
    humidity: int = Field(ge=0, le=100)

def get_weather(request: WeatherRequest) -> WeatherResponse:
    """Tool with typed contract."""
    # ... API call ...
    return WeatherResponse(
        location=request.location,
        temperature=72.5,
        conditions="sunny",
        humidity=45
    )

# Agent can inspect the schema
print(WeatherRequest.model_json_schema())

Output

{
"properties": {
  "location": {
    "description": "City name or coordinates",
    "title": "Location",
    "type": "string"
  },
  "units": {
    "default": "celsius",
    "enum": ["celsius", "fahrenheit"],
    "title": "Units",
    "type": "string"
  }
},
"required": ["location"],
"title": "WeatherRequest",
"type": "object"
}

Example 4: Configuration Management

from pydantic_settings import BaseSettings
from pydantic import Field

class AIConfig(BaseSettings):
    """Configuration from environment variables."""

    # Required
    aws_region: str = Field(alias="AWS_REGION")

    # With defaults
    model_id: str = "anthropic.claude-3-sonnet-20240229-v1:0"
    max_tokens: int = Field(default=1024, ge=1, le=4096)
    temperature: float = Field(default=0.7, ge=0.0, le=1.0)

    # Nested config
    class Config:
        env_prefix = "AI_"  # AI_MODEL_ID, AI_MAX_TOKENS, etc.

# Loads from environment automatically
config = AIConfig()
print(config.model_id)
print(config.max_tokens)

Performance Comparison

How much overhead does each option add?

import timeit
from dataclasses import dataclass
from pydantic import BaseModel

# Test data
data = {"id": "123", "value": 42.0, "tags": ["a", "b"]}

# Plain dict
def dict_access():
    return data["id"], data["value"], data["tags"]

# dataclass
@dataclass
class DataclassModel:
    id: str
    value: float
    tags: list[str]

def dataclass_create():
    return DataclassModel(**data)

# Pydantic
class PydanticModel(BaseModel):
    id: str
    value: float
    tags: list[str]

def pydantic_create():
    return PydanticModel(**data)

# Benchmark
print("Dict access:", timeit.timeit(dict_access, number=100000))
print("Dataclass:", timeit.timeit(dataclass_create, number=100000))
print("Pydantic:", timeit.timeit(pydantic_create, number=100000))

Output

Dict access: 0.012s Dataclass: 0.089s Pydantic: 0.342s

Pydantic is ~4x slower than dataclass for object creation. But consider:

0.342s for 100,000 objects = 3.4 microseconds each
A single Bedrock API call takes 500-2000ms
Network I/O dominates AI application performance

Decision Framework

Use this flowchart to choose:

Is data from external source (API, file, user)?
├── Yes → Pydantic (validation required)
└── No → Is it configuration?
    ├── Yes → Pydantic Settings
    └── No → Do you need methods/properties?
        ├── Yes → dataclass or Pydantic
        └── No → Is schema fixed?
            ├── Yes → TypedDict
            └── No → Plain dict

Quick Reference:

Scenario	Recommendation
Bedrock/OpenAI responses	Pydantic
Vector metadata	Pydantic
Agent tool I/O	Pydantic
Internal data structures	dataclass
Config from env vars	pydantic-settings
JSON API typing	TypedDict
Dynamic/unknown schema	dict
Performance-critical inner loop	dataclass or dict

Migration Path

Already have dict-heavy code? Here’s how to migrate incrementally:

Step 1: Add TypedDict for Documentation

# Before
def process_result(result: dict) -> dict:
    return {"id": result["id"], "score": result["score"]}

# After - just add types, no behavior change
class InputResult(TypedDict):
    id: str
    score: float
    metadata: dict

class OutputResult(TypedDict):
    id: str
    score: float

def process_result(result: InputResult) -> OutputResult:
    return {"id": result["id"], "score": result["score"]}

Step 2: Convert Critical Paths to Pydantic

# Validate at API boundaries
class InputResult(BaseModel):
    id: str
    score: float = Field(ge=0.0, le=1.0)
    metadata: dict = Field(default_factory=dict)

def process_result(result: InputResult) -> OutputResult:
    # Now result is validated
    return {"id": result.id, "score": result.score}

Step 3: Replace Output Dicts

class OutputResult(BaseModel):
    id: str
    score: float

def process_result(result: InputResult) -> OutputResult:
    return OutputResult(id=result.id, score=result.score)

Full Example: Embedding Pipeline

Here’s a complete example combining everything:

"""
Embedding pipeline with proper data models.
"""
from pydantic import BaseModel, Field
from typing import Optional
from datetime import datetime
import json

# --- Models ---

class Document(BaseModel):
    """Input document to embed."""
    id: str
    content: str
    source: str
    url: Optional[str] = None
    created_at: datetime = Field(default_factory=datetime.utcnow)

class EmbeddingConfig(BaseModel):
    """Configuration for embedding generation."""
    model_id: str = "amazon.titan-embed-text-v2:0"
    dimension: int = 1024
    normalize: bool = True

class DocumentEmbedding(BaseModel):
    """Document with its embedding."""
    document: Document
    vector: list[float]
    model_id: str
    generated_at: datetime = Field(default_factory=datetime.utcnow)

    @property
    def storage_metadata(self) -> dict:
        """Flatten for vector store."""
        return {
            "doc_id": self.document.id,
            "source": self.document.source,
            "url": self.document.url or "",
            "created_at": self.document.created_at.isoformat(),
        }

class SearchResult(BaseModel):
    """Result from similarity search."""
    document: Document
    score: float = Field(ge=0.0, le=1.0)
    rank: int = Field(ge=1)

class SearchResponse(BaseModel):
    """Response from search endpoint."""
    query: str
    results: list[SearchResult]
    total_results: int
    search_time_ms: float

# --- Usage ---

def embed_document(doc: Document, config: EmbeddingConfig) -> DocumentEmbedding:
    """Generate embedding for a document."""
    # ... call Bedrock ...
    vector = [0.1] * config.dimension  # Placeholder

    return DocumentEmbedding(
        document=doc,
        vector=vector,
        model_id=config.model_id
    )

def search(query: str, results: list[dict]) -> SearchResponse:
    """Parse search results into typed response."""
    search_results = [
        SearchResult(
            document=Document(**r["metadata"]),
            score=1 - r["distance"],
            rank=i + 1
        )
        for i, r in enumerate(results)
    ]

    return SearchResponse(
        query=query,
        results=search_results,
        total_results=len(search_results),
        search_time_ms=42.5
    )

# Example usage
doc = Document(
    id="paper-001",
    content="Attention mechanisms allow models to focus on relevant parts...",
    source="arxiv",
    url="https://arxiv.org/abs/1706.03762"
)

config = EmbeddingConfig()
embedding = embed_document(doc, config)

print(f"Embedded {embedding.document.id}")
print(f"Vector dimension: {len(embedding.vector)}")
print(f"Storage metadata: {embedding.storage_metadata}")

Output

Embedded paper-001
Vector dimension: 1024
Storage metadata: {'doc_id': 'paper-001', 'source': 'arxiv', 'url': 'https://arxiv.org/abs/1706.03762', 'created_at': '2026-01-01T12:00:00'}

What’s Next

You’ve learned when to use each data modeling approach in AI applications. Key takeaways:

Use Pydantic at boundaries - API inputs, external data, configuration
Use dataclasses internally - When you control the data source
Use TypedDict for gradual typing - Add types to existing dict code
Don’t over-engineer - Plain dicts are fine for simple, local use

Further Reading:

The Problem: Dictionary Hell

Option 1: Plain Dictionaries

Option 2: dataclasses (Python 3.7+)

Option 3: TypedDict (Python 3.8+)

Option 4: Pydantic (The AI/ML Standard)

Pydantic for AI Applications

Example 1: Bedrock API Responses

Example 2: Vector Store Metadata

Example 3: Agent Tool Contracts

Example 4: Configuration Management

Performance Comparison

Decision Framework

Migration Path

Step 1: Add TypedDict for Documentation

Step 2: Convert Critical Paths to Pydantic

Step 3: Replace Output Dicts

Full Example: Embedding Pipeline

What’s Next

Comments