CFP Oracle: Semantic Search for College Football History
Build a semantic search system to find historically similar College Football Playoff games using Amazon S3 Vectors and Bedrock embeddings.
With the College Football Playoff in full swing, I built a semantic search system to answer questions like: “What historical games are similar to today’s Texas vs Ohio State matchup?” The CFP Oracle uses S3 Vectors to find games with similar narratives, not just keyword matches.
What We’re Building
A semantic search system that:
- Stores 36 historical CFP and bowl games (2006-2025) as embeddings
- Accepts natural language queries like “overtime thriller” or “quarterback comeback”
- Returns the most similar games based on narrative, not keywords
Example queries:
- “underdog upset against a top-2 team” → Miami 10-seed beats Ohio State (2025)
- “quarterback leads dramatic comeback to win championship” → Vince Young’s Texas (2006)
- “defensive battle with a shutout” → Alabama 21 - LSU 0 (2012)
Architecture
Query: "overtime thriller in a semifinal"
│
├── Generate embedding (Titan Text V2)
│
├── Query S3 Vectors index
│ └── Cosine similarity search
│
└── Return top matches with metadata
├── Georgia 54 - Oklahoma 48 (2018) - 26.4% similar
├── Michigan 27 - Alabama 20 (2024) - 25.2% similar
└── Alabama 45 - Oklahoma 34 (2019) - 23.9% similar
Data Model with Pydantic
We use Pydantic for type-safe data handling:
from pydantic import BaseModel, Field
from typing import Optional
class Teams(BaseModel):
winner: str
loser: str
class Score(BaseModel):
winner: int
loser: int
class KeyStats(BaseModel):
overtime: Optional[bool] = None
comeback: Optional[int] = None
shutout: Optional[bool] = None
upset: Optional[bool] = None
class Game(BaseModel):
"""Historical CFP/Bowl game."""
id: str
game: str
year: int
teams: Teams
score: Score
narrative: str
venue: Optional[str] = None
tags: list[str] = Field(default_factory=list)
key_stats: Optional[KeyStats] = None
Creating Rich Game Embeddings
The key to good semantic search is creating rich text representations:
def create_game_text(game: Game) -> str:
"""Create text representation for embedding."""
parts = [
f"{game.game} ({game.year})",
f"{game.teams.winner} defeated {game.teams.loser} "
f"{game.score.winner}-{game.score.loser}",
game.narrative,
]
# Add context from key stats
if game.key_stats:
if game.key_stats.overtime:
parts.append("The game went to overtime.")
if game.key_stats.comeback:
parts.append(f"Featured a {game.key_stats.comeback}-point comeback.")
if game.key_stats.shutout:
parts.append("This was a shutout victory.")
if game.key_stats.upset:
parts.append("This was a major upset.")
# Add tags as readable context
if game.tags:
readable_tags = [t.replace("_", " ") for t in game.tags]
parts.append(f"Key themes: {', '.join(readable_tags)}")
return " ".join(parts)
For example, the 2025 Miami upset becomes:
Cotton Bowl - CFP Quarterfinal (2025) Miami defeated Ohio State 24-14 In a stunning upset, 10-seed Miami dominated 2-seed Ohio State from start to finish. The Hurricanes built a 14-0 halftime lead and never looked back, becoming the first double-digit seed to reach a CFP semifinal. This was a major upset. Key themes: upset, defensive battle, cfp quarterfinal, double digit seed
Storing in S3 Vectors
from pydantic import BaseModel
class S3VectorConfig(BaseModel):
bucket_name: str = "cfp-oracle-vectors"
index_name: str = "historical-games"
embedding_model: str = "amazon.titan-embed-text-v2:0"
embedding_dimension: int = 1024
region: str = "us-east-1"
config = S3VectorConfig()
# Create infrastructure
s3vectors.create_vector_bucket(vectorBucketName=config.bucket_name)
s3vectors.create_index(
vectorBucketName=config.bucket_name,
indexName=config.index_name,
dataType="float32",
dimension=config.embedding_dimension,
distanceMetric="cosine"
)
# Load games
for game in games:
game_text = create_game_text(game)
embedding = generate_embedding(game_text)
metadata = game_to_metadata(game)
s3vectors.put_vectors(
vectorBucketName=config.bucket_name,
indexName=config.index_name,
vectors=[{
"key": game.id,
"data": {"float32": embedding},
"metadata": metadata.model_dump(exclude_none=True)
}]
)
Querying for Similar Games
def find_similar_games(query: str, top_k: int = 5) -> list[SearchResult]:
"""Find games similar to a query description."""
query_embedding = generate_embedding(query)
response = s3vectors.query_vectors(
vectorBucketName=config.bucket_name,
indexName=config.index_name,
queryVector={"float32": query_embedding},
topK=top_k,
returnDistance=True,
returnMetadata=True
)
results = []
for v in response["vectors"]:
result = SearchResult(
key=v["key"],
distance=v.get("distance", 0),
metadata=VectorMetadata(**v["metadata"])
)
results.append(result)
return results
Try It: Demo Queries
Here’s what the oracle returns for topical queries:
Query: “Texas vs Ohio State Cotton Bowl semifinal”
🏈 Cotton Bowl - CFP Quarterfinal (2025) Miami 24 - Ohio State 14 Similarity: 57.9% In a stunning upset, 10-seed Miami dominated 2-seed Ohio State... #upset #defensive-battle #cfp-quarterfinal 🏈 Cotton Bowl - CFP Semifinal (2022) Alabama 27 - Cincinnati 6 Similarity: 49.2% 🏈 Cotton Bowl - CFP Semifinal (2016) Alabama 38 - Michigan State 0 Similarity: 45.7%
Query: “quarterback leads dramatic comeback to win championship”
🏈 Rose Bowl - BCS Championship (2006) Texas 41 - USC 38 Similarity: 33.8% Vince Young delivered one of the greatest performances in championship history, rushing for 200 yards and the game-winning touchdown with 19 seconds left... #championship #greatest-game #qb-heroics #comeback 🏈 CFP National Championship (2018) Alabama 26 - Georgia 23 Similarity: 33.7% Tua Tagovailoa came off the bench to lead Alabama to an overtime victory...
Historical Games Dataset
The oracle includes 36 games spanning 2006-2025:
| Era | Games | Highlights |
|---|---|---|
| CFP Era (2015-2025) | 30 | All semifinals, championships, 2025 first round |
| BCS Era (2006-2014) | 6 | Vince Young’s Texas, Alabama-LSU rematch |
Tags for filtering:
- Game type:
championship,semifinal,cfp_quarterfinal - Drama:
overtime,comeback,upset,shutout - Style:
shootout,defensive_battle,qb_heroics
Full Code
The complete implementation is in the largo demos:
demos/cfp-oracle/
├── cfp_oracle.py # Main oracle with S3 Vectors
├── games_data.py # 36 historical games
├── models.py # Pydantic data models
└── requirements.txt # boto3>=1.42, pydantic>=2.0
Usage:
# Setup infrastructure and load games
python cfp_oracle.py --setup
# Run demo queries
python cfp_oracle.py --demo
# Interactive mode
python cfp_oracle.py --interactive
# Single query
python cfp_oracle.py --query "high-scoring shootout"
# Cleanup
python cfp_oracle.py --cleanup
What’s Next
You’ve built a semantic search system for sports history. The same pattern works for:
- Document search - Find similar contracts, papers, or articles
- Product recommendations - “Show me products like this one”
- Support tickets - Find similar past issues and resolutions
- Code search - Find similar implementations across a codebase
Key takeaways:
- Rich text representations improve search quality
- Pydantic ensures consistent data handling
- S3 Vectors is cost-effective for infrequent queries
- Narrative similarity beats keyword matching for exploratory search
Related tutorials:
- S3 Vectors Getting Started - Foundation for this demo
- Data Models for AI Applications - Pydantic patterns we used
Comments
to join the discussion.