Data Models for AI Applications: Pydantic vs Python Built-ins
Compare Python's data modeling options for AI/ML applications. Learn when to use dataclasses, TypedDict, or Pydantic for API responses, embeddings metadata, and agent tool contracts.
When you’re building AI applications, data flows everywhere—API responses from Bedrock, metadata for vector stores, tool inputs for agents, structured outputs from LLMs. Without proper data models, you end up with a mess of dictionaries, mysterious KeyError exceptions, and code that’s impossible to refactor.
This tutorial compares Python’s data modeling options and shows you when to use each one in AI/ML applications.
The Problem: Dictionary Hell
Here’s code I see in most AI prototypes:
# The "it works" approach
def process_embedding_result(result):
vector = result["embedding"]
metadata = result.get("metadata", {})
score = metadata.get("similarity_score", 0.0)
# Wait, is score a float or string?
# What keys does metadata have?
# What happens if embedding is missing?
return {
"id": metadata["id"], # KeyError if missing
"vector": vector,
"score": float(score) # TypeError if score is None
}
This code has three problems:
- No validation - Bad data causes cryptic errors downstream
- No documentation - What shape is
result? What’s inmetadata? - No IDE support - No autocomplete, no type checking
Let’s fix this systematically.
Option 1: Plain Dictionaries
Use when: Quick scripts, throwaway code, or when schema is truly dynamic.
# Simple but fragile
embedding_result = {
"id": "doc-123",
"vector": [0.1, 0.2, 0.3],
"metadata": {"source": "arxiv", "year": 2024}
}
# Access with .get() for safety
source = embedding_result.get("metadata", {}).get("source", "unknown")
Pros:
- Zero overhead
- Maximum flexibility
- Works with any JSON
Cons:
- No validation
- No autocomplete
- No documentation
- Runtime errors from typos (
metdatavsmetadata)
Option 2: dataclasses (Python 3.7+)
Use when: Internal data structures, configuration objects, or when you want type hints without validation.
from dataclasses import dataclass, field
from typing import Optional
@dataclass
class EmbeddingMetadata:
source: str
year: int
author: Optional[str] = None
@dataclass
class EmbeddingResult:
id: str
vector: list[float]
metadata: EmbeddingMetadata
score: float = 0.0
Now you get autocomplete and type checking:
result = EmbeddingResult(
id="doc-123",
vector=[0.1, 0.2, 0.3],
metadata=EmbeddingMetadata(source="arxiv", year=2024)
)
# IDE knows these exist
print(result.metadata.source) # "arxiv"
print(result.score) # 0.0
Pros:
- Type hints for IDE support
- Default values
- Auto-generated
__init__,__repr__,__eq__ - Built into Python (no dependencies)
Cons:
- No runtime validation (wrong types silently accepted)
- Manual serialization to/from JSON
- No coercion (
"2024"won’t become2024)
# This "works" but is wrong - no validation!
bad_result = EmbeddingResult(
id=123, # Should be str, but Python accepts it
vector="not a list", # Completely wrong type
metadata={"source": "arxiv"} # Should be EmbeddingMetadata
)
Option 3: TypedDict (Python 3.8+)
Use when: You need type hints for dictionary access patterns, especially with JSON APIs.
from typing import TypedDict, NotRequired
class EmbeddingMetadata(TypedDict):
source: str
year: int
author: NotRequired[str] # Optional key
class EmbeddingResult(TypedDict):
id: str
vector: list[float]
metadata: EmbeddingMetadata
score: NotRequired[float]
TypedDict gives you autocomplete while keeping dictionary syntax:
result: EmbeddingResult = {
"id": "doc-123",
"vector": [0.1, 0.2, 0.3],
"metadata": {"source": "arxiv", "year": 2024}
}
# IDE knows the keys
print(result["metadata"]["source"]) # Autocomplete works!
Pros:
- Type hints for dictionaries
- Works with existing dict-based code
- No runtime overhead (it’s just a dict)
- Great for API response typing
Cons:
- No runtime validation (type hints only)
- Still uses
[]access (can raise KeyError) - No methods or computed properties
# Type checker catches this, but runtime doesn't
bad: EmbeddingResult = {
"id": 123, # Type error in IDE, but runs fine
"vector": [0.1, 0.2]
}
Option 4: Pydantic (The AI/ML Standard)
Use when: API boundaries, external data, configuration, or anywhere you need validation.
from pydantic import BaseModel, Field
from typing import Optional
class EmbeddingMetadata(BaseModel):
source: str
year: int
author: Optional[str] = None
class EmbeddingResult(BaseModel):
id: str
vector: list[float]
metadata: EmbeddingMetadata
score: float = Field(default=0.0, ge=0.0, le=1.0)
Pydantic validates at runtime and coerces types:
# This works - Pydantic coerces types
result = EmbeddingResult(
id="doc-123",
vector=[0.1, 0.2, 0.3],
metadata={"source": "arxiv", "year": "2024"} # str -> int
)
print(result.metadata.year) # 2024 (int, not str)
print(type(result.metadata.year)) # <class 'int'>
Validation catches errors immediately:
from pydantic import ValidationError
try:
bad_result = EmbeddingResult(
id="doc-123",
vector="not a list", # Wrong type
metadata={"source": "arxiv"} # Missing required field
)
except ValidationError as e:
print(e)
2 validation errors for EmbeddingResult
vector
Input should be a valid list [type=list_type, input_value='not a list', input_type=str]
metadata.year
Field required [type=missing, input_value={'source': 'arxiv'}, input_type=dict] Pros:
- Runtime validation with clear error messages
- Automatic type coercion
- JSON serialization built-in
- Field constraints (min, max, regex, etc.)
- Nested model support
- Integration with FastAPI, LangChain, Instructor
Cons:
- External dependency
- Slightly slower than raw dicts (but v2 is fast)
- Learning curve for advanced features
Pydantic for AI Applications
Let’s see Pydantic in real AI/ML scenarios.
Example 1: Bedrock API Responses
from pydantic import BaseModel
from typing import Literal
class BedrockMessage(BaseModel):
role: Literal["user", "assistant"]
content: str
class BedrockUsage(BaseModel):
input_tokens: int
output_tokens: int
class BedrockResponse(BaseModel):
id: str
model: str
stop_reason: Literal["end_turn", "max_tokens", "stop_sequence"]
usage: BedrockUsage
content: list[dict]
@property
def text(self) -> str:
"""Extract text from content blocks."""
for block in self.content:
if block.get("type") == "text":
return block.get("text", "")
return ""
# Parse API response safely
import json
def call_bedrock(prompt: str) -> BedrockResponse:
# ... boto3 call ...
response_body = json.loads(response["body"].read())
return BedrockResponse(**response_body)
result = call_bedrock("Explain embeddings")
print(result.text) # Clean access
print(result.usage.output_tokens) # Type-safe
Example 2: Vector Store Metadata
from pydantic import BaseModel, Field
from datetime import datetime
from typing import Optional
class DocumentMetadata(BaseModel):
"""Metadata stored with each vector in S3 Vectors."""
doc_id: str = Field(..., min_length=1)
title: str
source: str
chunk_index: int = Field(ge=0)
total_chunks: int = Field(ge=1)
created_at: datetime = Field(default_factory=datetime.utcnow)
url: Optional[str] = None
# Computed property
@property
def is_first_chunk(self) -> bool:
return self.chunk_index == 0
# Serialize for S3 Vectors (which needs flat dicts)
metadata = DocumentMetadata(
doc_id="paper-123",
title="Attention Is All You Need",
source="arxiv",
chunk_index=0,
total_chunks=5
)
# Convert to dict for storage
storage_dict = metadata.model_dump(mode="json")
print(storage_dict)
{
"doc_id": "paper-123",
"title": "Attention Is All You Need",
"source": "arxiv",
"chunk_index": 0,
"total_chunks": 5,
"created_at": "2026-01-01T12:00:00",
"url": null
} Example 3: Agent Tool Contracts
Define clear input/output contracts for agent tools:
from pydantic import BaseModel, Field
from typing import Literal
class WeatherRequest(BaseModel):
"""Input for weather tool."""
location: str = Field(..., description="City name or coordinates")
units: Literal["celsius", "fahrenheit"] = "celsius"
class WeatherResponse(BaseModel):
"""Output from weather tool."""
location: str
temperature: float
conditions: str
humidity: int = Field(ge=0, le=100)
def get_weather(request: WeatherRequest) -> WeatherResponse:
"""Tool with typed contract."""
# ... API call ...
return WeatherResponse(
location=request.location,
temperature=72.5,
conditions="sunny",
humidity=45
)
# Agent can inspect the schema
print(WeatherRequest.model_json_schema())
{
"properties": {
"location": {
"description": "City name or coordinates",
"title": "Location",
"type": "string"
},
"units": {
"default": "celsius",
"enum": ["celsius", "fahrenheit"],
"title": "Units",
"type": "string"
}
},
"required": ["location"],
"title": "WeatherRequest",
"type": "object"
} Example 4: Configuration Management
from pydantic_settings import BaseSettings
from pydantic import Field
class AIConfig(BaseSettings):
"""Configuration from environment variables."""
# Required
aws_region: str = Field(alias="AWS_REGION")
# With defaults
model_id: str = "anthropic.claude-3-sonnet-20240229-v1:0"
max_tokens: int = Field(default=1024, ge=1, le=4096)
temperature: float = Field(default=0.7, ge=0.0, le=1.0)
# Nested config
class Config:
env_prefix = "AI_" # AI_MODEL_ID, AI_MAX_TOKENS, etc.
# Loads from environment automatically
config = AIConfig()
print(config.model_id)
print(config.max_tokens)
Performance Comparison
How much overhead does each option add?
import timeit
from dataclasses import dataclass
from pydantic import BaseModel
# Test data
data = {"id": "123", "value": 42.0, "tags": ["a", "b"]}
# Plain dict
def dict_access():
return data["id"], data["value"], data["tags"]
# dataclass
@dataclass
class DataclassModel:
id: str
value: float
tags: list[str]
def dataclass_create():
return DataclassModel(**data)
# Pydantic
class PydanticModel(BaseModel):
id: str
value: float
tags: list[str]
def pydantic_create():
return PydanticModel(**data)
# Benchmark
print("Dict access:", timeit.timeit(dict_access, number=100000))
print("Dataclass:", timeit.timeit(dataclass_create, number=100000))
print("Pydantic:", timeit.timeit(pydantic_create, number=100000))
Dict access: 0.012s Dataclass: 0.089s Pydantic: 0.342s
Pydantic is ~4x slower than dataclass for object creation. But consider:
- 0.342s for 100,000 objects = 3.4 microseconds each
- A single Bedrock API call takes 500-2000ms
- Network I/O dominates AI application performance
Decision Framework
Use this flowchart to choose:
Is data from external source (API, file, user)?
├── Yes → Pydantic (validation required)
└── No → Is it configuration?
├── Yes → Pydantic Settings
└── No → Do you need methods/properties?
├── Yes → dataclass or Pydantic
└── No → Is schema fixed?
├── Yes → TypedDict
└── No → Plain dict
Quick Reference:
| Scenario | Recommendation |
|---|---|
| Bedrock/OpenAI responses | Pydantic |
| Vector metadata | Pydantic |
| Agent tool I/O | Pydantic |
| Internal data structures | dataclass |
| Config from env vars | pydantic-settings |
| JSON API typing | TypedDict |
| Dynamic/unknown schema | dict |
| Performance-critical inner loop | dataclass or dict |
Migration Path
Already have dict-heavy code? Here’s how to migrate incrementally:
Step 1: Add TypedDict for Documentation
# Before
def process_result(result: dict) -> dict:
return {"id": result["id"], "score": result["score"]}
# After - just add types, no behavior change
class InputResult(TypedDict):
id: str
score: float
metadata: dict
class OutputResult(TypedDict):
id: str
score: float
def process_result(result: InputResult) -> OutputResult:
return {"id": result["id"], "score": result["score"]}
Step 2: Convert Critical Paths to Pydantic
# Validate at API boundaries
class InputResult(BaseModel):
id: str
score: float = Field(ge=0.0, le=1.0)
metadata: dict = Field(default_factory=dict)
def process_result(result: InputResult) -> OutputResult:
# Now result is validated
return {"id": result.id, "score": result.score}
Step 3: Replace Output Dicts
class OutputResult(BaseModel):
id: str
score: float
def process_result(result: InputResult) -> OutputResult:
return OutputResult(id=result.id, score=result.score)
Full Example: Embedding Pipeline
Here’s a complete example combining everything:
"""
Embedding pipeline with proper data models.
"""
from pydantic import BaseModel, Field
from typing import Optional
from datetime import datetime
import json
# --- Models ---
class Document(BaseModel):
"""Input document to embed."""
id: str
content: str
source: str
url: Optional[str] = None
created_at: datetime = Field(default_factory=datetime.utcnow)
class EmbeddingConfig(BaseModel):
"""Configuration for embedding generation."""
model_id: str = "amazon.titan-embed-text-v2:0"
dimension: int = 1024
normalize: bool = True
class DocumentEmbedding(BaseModel):
"""Document with its embedding."""
document: Document
vector: list[float]
model_id: str
generated_at: datetime = Field(default_factory=datetime.utcnow)
@property
def storage_metadata(self) -> dict:
"""Flatten for vector store."""
return {
"doc_id": self.document.id,
"source": self.document.source,
"url": self.document.url or "",
"created_at": self.document.created_at.isoformat(),
}
class SearchResult(BaseModel):
"""Result from similarity search."""
document: Document
score: float = Field(ge=0.0, le=1.0)
rank: int = Field(ge=1)
class SearchResponse(BaseModel):
"""Response from search endpoint."""
query: str
results: list[SearchResult]
total_results: int
search_time_ms: float
# --- Usage ---
def embed_document(doc: Document, config: EmbeddingConfig) -> DocumentEmbedding:
"""Generate embedding for a document."""
# ... call Bedrock ...
vector = [0.1] * config.dimension # Placeholder
return DocumentEmbedding(
document=doc,
vector=vector,
model_id=config.model_id
)
def search(query: str, results: list[dict]) -> SearchResponse:
"""Parse search results into typed response."""
search_results = [
SearchResult(
document=Document(**r["metadata"]),
score=1 - r["distance"],
rank=i + 1
)
for i, r in enumerate(results)
]
return SearchResponse(
query=query,
results=search_results,
total_results=len(search_results),
search_time_ms=42.5
)
# Example usage
doc = Document(
id="paper-001",
content="Attention mechanisms allow models to focus on relevant parts...",
source="arxiv",
url="https://arxiv.org/abs/1706.03762"
)
config = EmbeddingConfig()
embedding = embed_document(doc, config)
print(f"Embedded {embedding.document.id}")
print(f"Vector dimension: {len(embedding.vector)}")
print(f"Storage metadata: {embedding.storage_metadata}")
Embedded paper-001
Vector dimension: 1024
Storage metadata: {'doc_id': 'paper-001', 'source': 'arxiv', 'url': 'https://arxiv.org/abs/1706.03762', 'created_at': '2026-01-01T12:00:00'} What’s Next
You’ve learned when to use each data modeling approach in AI applications. Key takeaways:
- Use Pydantic at boundaries - API inputs, external data, configuration
- Use dataclasses internally - When you control the data source
- Use TypedDict for gradual typing - Add types to existing dict code
- Don’t over-engineer - Plain dicts are fine for simple, local use
Further Reading:
Comments
to join the discussion.