CFP Oracle: Semantic Search for College Football History

Open Seas 15 min read January 1, 2026 |
0

Build a semantic search system to find historically similar College Football Playoff games using Amazon S3 Vectors and Bedrock embeddings.

With the College Football Playoff in full swing, I built a semantic search system to answer questions like: “What historical games are similar to today’s Texas vs Ohio State matchup?” The CFP Oracle uses S3 Vectors to find games with similar narratives, not just keyword matches.

What We’re Building

A semantic search system that:

  1. Stores 36 historical CFP and bowl games (2006-2025) as embeddings
  2. Accepts natural language queries like “overtime thriller” or “quarterback comeback”
  3. Returns the most similar games based on narrative, not keywords

Example queries:

  • “underdog upset against a top-2 team” → Miami 10-seed beats Ohio State (2025)
  • “quarterback leads dramatic comeback to win championship” → Vince Young’s Texas (2006)
  • “defensive battle with a shutout” → Alabama 21 - LSU 0 (2012)

Architecture

Query: "overtime thriller in a semifinal"

    ├── Generate embedding (Titan Text V2)

    ├── Query S3 Vectors index
    │   └── Cosine similarity search

    └── Return top matches with metadata
        ├── Georgia 54 - Oklahoma 48 (2018) - 26.4% similar
        ├── Michigan 27 - Alabama 20 (2024) - 25.2% similar
        └── Alabama 45 - Oklahoma 34 (2019) - 23.9% similar

Data Model with Pydantic

We use Pydantic for type-safe data handling:

from pydantic import BaseModel, Field
from typing import Optional

class Teams(BaseModel):
    winner: str
    loser: str

class Score(BaseModel):
    winner: int
    loser: int

class KeyStats(BaseModel):
    overtime: Optional[bool] = None
    comeback: Optional[int] = None
    shutout: Optional[bool] = None
    upset: Optional[bool] = None

class Game(BaseModel):
    """Historical CFP/Bowl game."""
    id: str
    game: str
    year: int
    teams: Teams
    score: Score
    narrative: str
    venue: Optional[str] = None
    tags: list[str] = Field(default_factory=list)
    key_stats: Optional[KeyStats] = None

Creating Rich Game Embeddings

The key to good semantic search is creating rich text representations:

def create_game_text(game: Game) -> str:
    """Create text representation for embedding."""
    parts = [
        f"{game.game} ({game.year})",
        f"{game.teams.winner} defeated {game.teams.loser} "
        f"{game.score.winner}-{game.score.loser}",
        game.narrative,
    ]

    # Add context from key stats
    if game.key_stats:
        if game.key_stats.overtime:
            parts.append("The game went to overtime.")
        if game.key_stats.comeback:
            parts.append(f"Featured a {game.key_stats.comeback}-point comeback.")
        if game.key_stats.shutout:
            parts.append("This was a shutout victory.")
        if game.key_stats.upset:
            parts.append("This was a major upset.")

    # Add tags as readable context
    if game.tags:
        readable_tags = [t.replace("_", " ") for t in game.tags]
        parts.append(f"Key themes: {', '.join(readable_tags)}")

    return " ".join(parts)

For example, the 2025 Miami upset becomes:

Output
Cotton Bowl - CFP Quarterfinal (2025) Miami defeated Ohio State 24-14
In a stunning upset, 10-seed Miami dominated 2-seed Ohio State from
start to finish. The Hurricanes built a 14-0 halftime lead and never
looked back, becoming the first double-digit seed to reach a CFP
semifinal. This was a major upset. Key themes: upset, defensive battle,
cfp quarterfinal, double digit seed

Storing in S3 Vectors

from pydantic import BaseModel

class S3VectorConfig(BaseModel):
    bucket_name: str = "cfp-oracle-vectors"
    index_name: str = "historical-games"
    embedding_model: str = "amazon.titan-embed-text-v2:0"
    embedding_dimension: int = 1024
    region: str = "us-east-1"

config = S3VectorConfig()

# Create infrastructure
s3vectors.create_vector_bucket(vectorBucketName=config.bucket_name)
s3vectors.create_index(
    vectorBucketName=config.bucket_name,
    indexName=config.index_name,
    dataType="float32",
    dimension=config.embedding_dimension,
    distanceMetric="cosine"
)

# Load games
for game in games:
    game_text = create_game_text(game)
    embedding = generate_embedding(game_text)
    metadata = game_to_metadata(game)

    s3vectors.put_vectors(
        vectorBucketName=config.bucket_name,
        indexName=config.index_name,
        vectors=[{
            "key": game.id,
            "data": {"float32": embedding},
            "metadata": metadata.model_dump(exclude_none=True)
        }]
    )

Querying for Similar Games

def find_similar_games(query: str, top_k: int = 5) -> list[SearchResult]:
    """Find games similar to a query description."""
    query_embedding = generate_embedding(query)

    response = s3vectors.query_vectors(
        vectorBucketName=config.bucket_name,
        indexName=config.index_name,
        queryVector={"float32": query_embedding},
        topK=top_k,
        returnDistance=True,
        returnMetadata=True
    )

    results = []
    for v in response["vectors"]:
        result = SearchResult(
            key=v["key"],
            distance=v.get("distance", 0),
            metadata=VectorMetadata(**v["metadata"])
        )
        results.append(result)

    return results

Try It: Demo Queries

Here’s what the oracle returns for topical queries:

Query: “Texas vs Ohio State Cotton Bowl semifinal”

Output
🏈 Cotton Bowl - CFP Quarterfinal (2025)
 Miami 24 - Ohio State 14
 Similarity: 57.9%

 In a stunning upset, 10-seed Miami dominated 2-seed Ohio State...
 #upset #defensive-battle #cfp-quarterfinal

🏈 Cotton Bowl - CFP Semifinal (2022)
 Alabama 27 - Cincinnati 6
 Similarity: 49.2%

🏈 Cotton Bowl - CFP Semifinal (2016)
 Alabama 38 - Michigan State 0
 Similarity: 45.7%

Query: “quarterback leads dramatic comeback to win championship”

Output
🏈 Rose Bowl - BCS Championship (2006)
 Texas 41 - USC 38
 Similarity: 33.8%

 Vince Young delivered one of the greatest performances in
 championship history, rushing for 200 yards and the game-winning
 touchdown with 19 seconds left...
 #championship #greatest-game #qb-heroics #comeback

🏈 CFP National Championship (2018)
 Alabama 26 - Georgia 23
 Similarity: 33.7%

 Tua Tagovailoa came off the bench to lead Alabama to an
 overtime victory...

Historical Games Dataset

The oracle includes 36 games spanning 2006-2025:

EraGamesHighlights
CFP Era (2015-2025)30All semifinals, championships, 2025 first round
BCS Era (2006-2014)6Vince Young’s Texas, Alabama-LSU rematch

Tags for filtering:

  • Game type: championship, semifinal, cfp_quarterfinal
  • Drama: overtime, comeback, upset, shutout
  • Style: shootout, defensive_battle, qb_heroics

Full Code

The complete implementation is in the largo demos:

demos/cfp-oracle/
├── cfp_oracle.py      # Main oracle with S3 Vectors
├── games_data.py      # 36 historical games
├── models.py          # Pydantic data models
└── requirements.txt   # boto3>=1.42, pydantic>=2.0

Usage:

# Setup infrastructure and load games
python cfp_oracle.py --setup

# Run demo queries
python cfp_oracle.py --demo

# Interactive mode
python cfp_oracle.py --interactive

# Single query
python cfp_oracle.py --query "high-scoring shootout"

# Cleanup
python cfp_oracle.py --cleanup

What’s Next

You’ve built a semantic search system for sports history. The same pattern works for:

  • Document search - Find similar contracts, papers, or articles
  • Product recommendations - “Show me products like this one”
  • Support tickets - Find similar past issues and resolutions
  • Code search - Find similar implementations across a codebase

Key takeaways:

  1. Rich text representations improve search quality
  2. Pydantic ensures consistent data handling
  3. S3 Vectors is cost-effective for infrequent queries
  4. Narrative similarity beats keyword matching for exploratory search

Related tutorials:

Found this helpful?
0

Comments

Loading comments...