Data Models for AI Applications: Pydantic vs Python Built-ins

Open Seas 20 min read January 1, 2026 |
0

Compare Python's data modeling options for AI/ML applications. Learn when to use dataclasses, TypedDict, or Pydantic for API responses, embeddings metadata, and agent tool contracts.

When you’re building AI applications, data flows everywhere—API responses from Bedrock, metadata for vector stores, tool inputs for agents, structured outputs from LLMs. Without proper data models, you end up with a mess of dictionaries, mysterious KeyError exceptions, and code that’s impossible to refactor.

This tutorial compares Python’s data modeling options and shows you when to use each one in AI/ML applications.

The Problem: Dictionary Hell

Here’s code I see in most AI prototypes:

# The "it works" approach
def process_embedding_result(result):
    vector = result["embedding"]
    metadata = result.get("metadata", {})
    score = metadata.get("similarity_score", 0.0)

    # Wait, is score a float or string?
    # What keys does metadata have?
    # What happens if embedding is missing?

    return {
        "id": metadata["id"],  # KeyError if missing
        "vector": vector,
        "score": float(score)  # TypeError if score is None
    }

This code has three problems:

  1. No validation - Bad data causes cryptic errors downstream
  2. No documentation - What shape is result? What’s in metadata?
  3. No IDE support - No autocomplete, no type checking

Let’s fix this systematically.

Option 1: Plain Dictionaries

Use when: Quick scripts, throwaway code, or when schema is truly dynamic.

# Simple but fragile
embedding_result = {
    "id": "doc-123",
    "vector": [0.1, 0.2, 0.3],
    "metadata": {"source": "arxiv", "year": 2024}
}

# Access with .get() for safety
source = embedding_result.get("metadata", {}).get("source", "unknown")

Pros:

  • Zero overhead
  • Maximum flexibility
  • Works with any JSON

Cons:

  • No validation
  • No autocomplete
  • No documentation
  • Runtime errors from typos (metdata vs metadata)

Option 2: dataclasses (Python 3.7+)

Use when: Internal data structures, configuration objects, or when you want type hints without validation.

from dataclasses import dataclass, field
from typing import Optional

@dataclass
class EmbeddingMetadata:
    source: str
    year: int
    author: Optional[str] = None

@dataclass
class EmbeddingResult:
    id: str
    vector: list[float]
    metadata: EmbeddingMetadata
    score: float = 0.0

Now you get autocomplete and type checking:

result = EmbeddingResult(
    id="doc-123",
    vector=[0.1, 0.2, 0.3],
    metadata=EmbeddingMetadata(source="arxiv", year=2024)
)

# IDE knows these exist
print(result.metadata.source)  # "arxiv"
print(result.score)  # 0.0

Pros:

  • Type hints for IDE support
  • Default values
  • Auto-generated __init__, __repr__, __eq__
  • Built into Python (no dependencies)

Cons:

  • No runtime validation (wrong types silently accepted)
  • Manual serialization to/from JSON
  • No coercion ("2024" won’t become 2024)
# This "works" but is wrong - no validation!
bad_result = EmbeddingResult(
    id=123,  # Should be str, but Python accepts it
    vector="not a list",  # Completely wrong type
    metadata={"source": "arxiv"}  # Should be EmbeddingMetadata
)

Option 3: TypedDict (Python 3.8+)

Use when: You need type hints for dictionary access patterns, especially with JSON APIs.

from typing import TypedDict, NotRequired

class EmbeddingMetadata(TypedDict):
    source: str
    year: int
    author: NotRequired[str]  # Optional key

class EmbeddingResult(TypedDict):
    id: str
    vector: list[float]
    metadata: EmbeddingMetadata
    score: NotRequired[float]

TypedDict gives you autocomplete while keeping dictionary syntax:

result: EmbeddingResult = {
    "id": "doc-123",
    "vector": [0.1, 0.2, 0.3],
    "metadata": {"source": "arxiv", "year": 2024}
}

# IDE knows the keys
print(result["metadata"]["source"])  # Autocomplete works!

Pros:

  • Type hints for dictionaries
  • Works with existing dict-based code
  • No runtime overhead (it’s just a dict)
  • Great for API response typing

Cons:

  • No runtime validation (type hints only)
  • Still uses [] access (can raise KeyError)
  • No methods or computed properties
# Type checker catches this, but runtime doesn't
bad: EmbeddingResult = {
    "id": 123,  # Type error in IDE, but runs fine
    "vector": [0.1, 0.2]
}

Option 4: Pydantic (The AI/ML Standard)

Use when: API boundaries, external data, configuration, or anywhere you need validation.

from pydantic import BaseModel, Field
from typing import Optional

class EmbeddingMetadata(BaseModel):
    source: str
    year: int
    author: Optional[str] = None

class EmbeddingResult(BaseModel):
    id: str
    vector: list[float]
    metadata: EmbeddingMetadata
    score: float = Field(default=0.0, ge=0.0, le=1.0)

Pydantic validates at runtime and coerces types:

# This works - Pydantic coerces types
result = EmbeddingResult(
    id="doc-123",
    vector=[0.1, 0.2, 0.3],
    metadata={"source": "arxiv", "year": "2024"}  # str -> int
)

print(result.metadata.year)  # 2024 (int, not str)
print(type(result.metadata.year))  # <class 'int'>

Validation catches errors immediately:

from pydantic import ValidationError

try:
    bad_result = EmbeddingResult(
        id="doc-123",
        vector="not a list",  # Wrong type
        metadata={"source": "arxiv"}  # Missing required field
    )
except ValidationError as e:
    print(e)
Output
2 validation errors for EmbeddingResult
vector
Input should be a valid list [type=list_type, input_value='not a list', input_type=str]
metadata.year
Field required [type=missing, input_value={'source': 'arxiv'}, input_type=dict]

Pros:

  • Runtime validation with clear error messages
  • Automatic type coercion
  • JSON serialization built-in
  • Field constraints (min, max, regex, etc.)
  • Nested model support
  • Integration with FastAPI, LangChain, Instructor

Cons:

  • External dependency
  • Slightly slower than raw dicts (but v2 is fast)
  • Learning curve for advanced features

Pydantic for AI Applications

Let’s see Pydantic in real AI/ML scenarios.

Example 1: Bedrock API Responses

from pydantic import BaseModel
from typing import Literal

class BedrockMessage(BaseModel):
    role: Literal["user", "assistant"]
    content: str

class BedrockUsage(BaseModel):
    input_tokens: int
    output_tokens: int

class BedrockResponse(BaseModel):
    id: str
    model: str
    stop_reason: Literal["end_turn", "max_tokens", "stop_sequence"]
    usage: BedrockUsage
    content: list[dict]

    @property
    def text(self) -> str:
        """Extract text from content blocks."""
        for block in self.content:
            if block.get("type") == "text":
                return block.get("text", "")
        return ""


# Parse API response safely
import json

def call_bedrock(prompt: str) -> BedrockResponse:
    # ... boto3 call ...
    response_body = json.loads(response["body"].read())
    return BedrockResponse(**response_body)

result = call_bedrock("Explain embeddings")
print(result.text)  # Clean access
print(result.usage.output_tokens)  # Type-safe

Example 2: Vector Store Metadata

from pydantic import BaseModel, Field
from datetime import datetime
from typing import Optional

class DocumentMetadata(BaseModel):
    """Metadata stored with each vector in S3 Vectors."""
    doc_id: str = Field(..., min_length=1)
    title: str
    source: str
    chunk_index: int = Field(ge=0)
    total_chunks: int = Field(ge=1)
    created_at: datetime = Field(default_factory=datetime.utcnow)
    url: Optional[str] = None

    # Computed property
    @property
    def is_first_chunk(self) -> bool:
        return self.chunk_index == 0

# Serialize for S3 Vectors (which needs flat dicts)
metadata = DocumentMetadata(
    doc_id="paper-123",
    title="Attention Is All You Need",
    source="arxiv",
    chunk_index=0,
    total_chunks=5
)

# Convert to dict for storage
storage_dict = metadata.model_dump(mode="json")
print(storage_dict)
Output
{
"doc_id": "paper-123",
"title": "Attention Is All You Need",
"source": "arxiv",
"chunk_index": 0,
"total_chunks": 5,
"created_at": "2026-01-01T12:00:00",
"url": null
}

Example 3: Agent Tool Contracts

Define clear input/output contracts for agent tools:

from pydantic import BaseModel, Field
from typing import Literal

class WeatherRequest(BaseModel):
    """Input for weather tool."""
    location: str = Field(..., description="City name or coordinates")
    units: Literal["celsius", "fahrenheit"] = "celsius"

class WeatherResponse(BaseModel):
    """Output from weather tool."""
    location: str
    temperature: float
    conditions: str
    humidity: int = Field(ge=0, le=100)

def get_weather(request: WeatherRequest) -> WeatherResponse:
    """Tool with typed contract."""
    # ... API call ...
    return WeatherResponse(
        location=request.location,
        temperature=72.5,
        conditions="sunny",
        humidity=45
    )

# Agent can inspect the schema
print(WeatherRequest.model_json_schema())
Output
{
"properties": {
  "location": {
    "description": "City name or coordinates",
    "title": "Location",
    "type": "string"
  },
  "units": {
    "default": "celsius",
    "enum": ["celsius", "fahrenheit"],
    "title": "Units",
    "type": "string"
  }
},
"required": ["location"],
"title": "WeatherRequest",
"type": "object"
}

Example 4: Configuration Management

from pydantic_settings import BaseSettings
from pydantic import Field

class AIConfig(BaseSettings):
    """Configuration from environment variables."""

    # Required
    aws_region: str = Field(alias="AWS_REGION")

    # With defaults
    model_id: str = "anthropic.claude-3-sonnet-20240229-v1:0"
    max_tokens: int = Field(default=1024, ge=1, le=4096)
    temperature: float = Field(default=0.7, ge=0.0, le=1.0)

    # Nested config
    class Config:
        env_prefix = "AI_"  # AI_MODEL_ID, AI_MAX_TOKENS, etc.

# Loads from environment automatically
config = AIConfig()
print(config.model_id)
print(config.max_tokens)

Performance Comparison

How much overhead does each option add?

import timeit
from dataclasses import dataclass
from pydantic import BaseModel

# Test data
data = {"id": "123", "value": 42.0, "tags": ["a", "b"]}

# Plain dict
def dict_access():
    return data["id"], data["value"], data["tags"]

# dataclass
@dataclass
class DataclassModel:
    id: str
    value: float
    tags: list[str]

def dataclass_create():
    return DataclassModel(**data)

# Pydantic
class PydanticModel(BaseModel):
    id: str
    value: float
    tags: list[str]

def pydantic_create():
    return PydanticModel(**data)

# Benchmark
print("Dict access:", timeit.timeit(dict_access, number=100000))
print("Dataclass:", timeit.timeit(dataclass_create, number=100000))
print("Pydantic:", timeit.timeit(pydantic_create, number=100000))
Output

Dict access: 0.012s Dataclass: 0.089s Pydantic: 0.342s

Pydantic is ~4x slower than dataclass for object creation. But consider:

  • 0.342s for 100,000 objects = 3.4 microseconds each
  • A single Bedrock API call takes 500-2000ms
  • Network I/O dominates AI application performance

Decision Framework

Use this flowchart to choose:

Is data from external source (API, file, user)?
├── Yes → Pydantic (validation required)
└── No → Is it configuration?
    ├── Yes → Pydantic Settings
    └── No → Do you need methods/properties?
        ├── Yes → dataclass or Pydantic
        └── No → Is schema fixed?
            ├── Yes → TypedDict
            └── No → Plain dict

Quick Reference:

ScenarioRecommendation
Bedrock/OpenAI responsesPydantic
Vector metadataPydantic
Agent tool I/OPydantic
Internal data structuresdataclass
Config from env varspydantic-settings
JSON API typingTypedDict
Dynamic/unknown schemadict
Performance-critical inner loopdataclass or dict

Migration Path

Already have dict-heavy code? Here’s how to migrate incrementally:

Step 1: Add TypedDict for Documentation

# Before
def process_result(result: dict) -> dict:
    return {"id": result["id"], "score": result["score"]}

# After - just add types, no behavior change
class InputResult(TypedDict):
    id: str
    score: float
    metadata: dict

class OutputResult(TypedDict):
    id: str
    score: float

def process_result(result: InputResult) -> OutputResult:
    return {"id": result["id"], "score": result["score"]}

Step 2: Convert Critical Paths to Pydantic

# Validate at API boundaries
class InputResult(BaseModel):
    id: str
    score: float = Field(ge=0.0, le=1.0)
    metadata: dict = Field(default_factory=dict)

def process_result(result: InputResult) -> OutputResult:
    # Now result is validated
    return {"id": result.id, "score": result.score}

Step 3: Replace Output Dicts

class OutputResult(BaseModel):
    id: str
    score: float

def process_result(result: InputResult) -> OutputResult:
    return OutputResult(id=result.id, score=result.score)

Full Example: Embedding Pipeline

Here’s a complete example combining everything:

"""
Embedding pipeline with proper data models.
"""
from pydantic import BaseModel, Field
from typing import Optional
from datetime import datetime
import json

# --- Models ---

class Document(BaseModel):
    """Input document to embed."""
    id: str
    content: str
    source: str
    url: Optional[str] = None
    created_at: datetime = Field(default_factory=datetime.utcnow)

class EmbeddingConfig(BaseModel):
    """Configuration for embedding generation."""
    model_id: str = "amazon.titan-embed-text-v2:0"
    dimension: int = 1024
    normalize: bool = True

class DocumentEmbedding(BaseModel):
    """Document with its embedding."""
    document: Document
    vector: list[float]
    model_id: str
    generated_at: datetime = Field(default_factory=datetime.utcnow)

    @property
    def storage_metadata(self) -> dict:
        """Flatten for vector store."""
        return {
            "doc_id": self.document.id,
            "source": self.document.source,
            "url": self.document.url or "",
            "created_at": self.document.created_at.isoformat(),
        }

class SearchResult(BaseModel):
    """Result from similarity search."""
    document: Document
    score: float = Field(ge=0.0, le=1.0)
    rank: int = Field(ge=1)

class SearchResponse(BaseModel):
    """Response from search endpoint."""
    query: str
    results: list[SearchResult]
    total_results: int
    search_time_ms: float

# --- Usage ---

def embed_document(doc: Document, config: EmbeddingConfig) -> DocumentEmbedding:
    """Generate embedding for a document."""
    # ... call Bedrock ...
    vector = [0.1] * config.dimension  # Placeholder

    return DocumentEmbedding(
        document=doc,
        vector=vector,
        model_id=config.model_id
    )

def search(query: str, results: list[dict]) -> SearchResponse:
    """Parse search results into typed response."""
    search_results = [
        SearchResult(
            document=Document(**r["metadata"]),
            score=1 - r["distance"],
            rank=i + 1
        )
        for i, r in enumerate(results)
    ]

    return SearchResponse(
        query=query,
        results=search_results,
        total_results=len(search_results),
        search_time_ms=42.5
    )

# Example usage
doc = Document(
    id="paper-001",
    content="Attention mechanisms allow models to focus on relevant parts...",
    source="arxiv",
    url="https://arxiv.org/abs/1706.03762"
)

config = EmbeddingConfig()
embedding = embed_document(doc, config)

print(f"Embedded {embedding.document.id}")
print(f"Vector dimension: {len(embedding.vector)}")
print(f"Storage metadata: {embedding.storage_metadata}")
Output
Embedded paper-001
Vector dimension: 1024
Storage metadata: {'doc_id': 'paper-001', 'source': 'arxiv', 'url': 'https://arxiv.org/abs/1706.03762', 'created_at': '2026-01-01T12:00:00'}

What’s Next

You’ve learned when to use each data modeling approach in AI applications. Key takeaways:

  1. Use Pydantic at boundaries - API inputs, external data, configuration
  2. Use dataclasses internally - When you control the data source
  3. Use TypedDict for gradual typing - Add types to existing dict code
  4. Don’t over-engineer - Plain dicts are fine for simple, local use

Further Reading:

Found this helpful?
0

Comments

Loading comments...