Cross-Encoders: Precision Reranking for Search
When bi-encoders aren't accurate enough, cross-encoders dramatically improve search relevance. Build a two-stage retrieval system with MS MARCO rerankers and sentence-transformers.
Your bi-encoder retrieved the top 100 documents. The right answer is in there—probably. But it’s ranked #47, buried below dozens of marginally relevant results.
Bi-encoders are fast because they encode queries and documents independently. But that independence is also their weakness: they can’t see how query words interact with document words. A bi-encoder encodes “python” the same way whether the query is about snakes or programming.
Cross-encoders solve this by processing query and document together, allowing full attention between every query token and every document token. The result: dramatically better relevance judgments.
The tradeoff? Speed. Cross-encoders must run inference for every query-document pair. That’s why we use them for reranking, not initial retrieval.
Bi-Encoder vs Cross-Encoder: The Key Difference
Bi-Encoder Architecture
Query: "python tutorial" → Encoder → [0.23, -0.41, ...]
↓
Cosine Similarity = 0.72
↑
Document: "Learn Python..." → Encoder → [0.19, -0.38, ...]
Query and document are encoded separately. They only interact at the final similarity computation.
Cross-Encoder Architecture
┌─────────────────────────────────────┐
│ [CLS] python tutorial [SEP] Learn │
│ Python programming from scratch... │
└─────────────────────────────────────┘
↓
Transformer
(Full Attention)
↓
Relevance Score: 0.94
Query and document are concatenated and processed together. Every query token attends to every document token through the transformer’s self-attention layers.
Why Cross-Encoders Are More Accurate
Cross-encoders can capture:
- Token-level interactions: “python” in the query can attend to “programming” vs “snake” in the document
- Negation handling: “not recommended” actually means the opposite of “recommended”
- Complex reasoning: “best laptop under $1000” requires understanding price constraints
- Query-document overlap: Exact matches can be weighted appropriately
Cross-encoders typically outperform bi-encoders on retrieval benchmarks, with gains varying by dataset and query complexity. The tradeoff is computational cost: cross-encoders must process each query-document pair individually.
Building a Cross-Encoder Reranker
Let’s build a reranker that takes bi-encoder candidates and returns precisely ranked results.
Step 1: Load a Pre-trained Cross-Encoder
from sentence_transformers import CrossEncoder
# Load a cross-encoder trained for relevance scoring
# ms-marco models are trained on real search queries
# Model card: https://huggingface.co/cross-encoder/ms-marco-MiniLM-L-6-v2
model = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')
print(f"Model loaded: {model.model.__class__.__name__}")
Model loaded: BertForSequenceClassification
Step 2: Score Query-Document Pairs
# Single pair scoring
query = "how to learn python programming"
documents = [
"Python is a popular programming language for beginners",
"The python snake is found in tropical regions",
"Complete Python tutorial for web development",
"Java vs Python: which should you learn first?",
]
# Score each pair
pairs = [[query, doc] for doc in documents]
scores = model.predict(pairs)
# Display results
print(f"Query: '{query}'\n")
for doc, score in sorted(zip(documents, scores), key=lambda x: x[1], reverse=True):
print(f"[{score:.4f}] {doc}")
Query: ‘how to learn python programming’
[8.4521] Complete Python tutorial for web development [7.2134] Python is a popular programming language for beginners [5.8967] Java vs Python: which should you learn first? [-3.2145] The python snake is found in tropical regions
Notice how the cross-encoder:
- Correctly ranks the tutorial highest
- Gives a strongly negative score to the snake document
- Understands that comparing languages is somewhat relevant
Step 3: Build the Reranking Pipeline
from sentence_transformers import SentenceTransformer, CrossEncoder
import numpy as np
import faiss
class TwoStageSearch:
def __init__(
self,
bi_encoder_model='all-MiniLM-L6-v2',
cross_encoder_model='cross-encoder/ms-marco-MiniLM-L-6-v2'
):
self.bi_encoder = SentenceTransformer(bi_encoder_model)
self.cross_encoder = CrossEncoder(cross_encoder_model)
self.index = None
self.documents = None
def index_documents(self, documents):
"""Index documents with bi-encoder for fast retrieval."""
self.documents = documents
embeddings = self.bi_encoder.encode(
documents,
convert_to_numpy=True,
normalize_embeddings=True,
show_progress_bar=True
).astype('float32')
dimension = embeddings.shape[1]
self.index = faiss.IndexFlatIP(dimension)
self.index.add(embeddings)
return len(documents)
def search(self, query, top_k=10, rerank_top_n=100):
"""
Two-stage search:
1. Bi-encoder retrieves top_n candidates
2. Cross-encoder reranks to get top_k results
"""
# Stage 1: Fast bi-encoder retrieval
query_embedding = self.bi_encoder.encode(
query,
convert_to_numpy=True,
normalize_embeddings=True
).reshape(1, -1).astype('float32')
# Get more candidates than we need for reranking
n_candidates = min(rerank_top_n, len(self.documents))
bi_scores, indices = self.index.search(query_embedding, n_candidates)
candidates = [self.documents[idx] for idx in indices[0]]
# Stage 2: Cross-encoder reranking
pairs = [[query, doc] for doc in candidates]
cross_scores = self.cross_encoder.predict(pairs)
# Combine indices with scores and sort
results = list(zip(indices[0], candidates, bi_scores[0], cross_scores))
results.sort(key=lambda x: x[3], reverse=True) # Sort by cross-encoder score
# Return top_k with both scores
return [
{
'document': doc,
'bi_encoder_score': float(bi_score),
'cross_encoder_score': float(cross_score),
'original_index': int(idx)
}
for idx, doc, bi_score, cross_score in results[:top_k]
]
# Example usage
documents = [
"How to change a flat tire on the highway",
"Python programming tutorial for beginners",
"The best chocolate chip cookie recipe",
"Understanding machine learning algorithms",
"Guide to fixing bicycle punctures",
"Introduction to natural language processing",
"Car maintenance tips and tricks",
"Learn Python for data science",
"Baking sourdough bread at home",
"Deep learning with PyTorch tutorial",
]
engine = TwoStageSearch()
engine.index_documents(documents)
results = engine.search("python machine learning tutorial", top_k=5)
print("Query: 'python machine learning tutorial'\n")
print(f"{'Document':<50} {'Bi-Enc':>8} {'Cross-Enc':>10}")
print("-" * 70)
for r in results:
doc = r['document'][:47] + "..." if len(r['document']) > 50 else r['document']
print(f"{doc:<50} {r['bi_encoder_score']:>8.3f} {r['cross_encoder_score']:>10.3f}")
Query: ‘python machine learning tutorial’
Document Bi-Enc Cross-Enc
Deep learning with PyTorch tutorial 0.612 8.234 Learn Python for data science 0.589 7.891 Python programming tutorial for beginners 0.634 6.543 Understanding machine learning algorithms 0.521 5.234 Introduction to natural language processing 0.456 3.876
Look at the reranking in action:
- “Deep learning with PyTorch” jumped from rank 3 (bi-encoder) to rank 1 (cross-encoder)
- The cross-encoder understood that PyTorch + deep learning is more relevant to a “python machine learning tutorial” query
Batch Processing for Speed
Cross-encoders are slow compared to bi-encoders. Batch processing helps:
def rerank_batch(query, candidates, cross_encoder, batch_size=32):
"""
Efficient batch reranking.
"""
pairs = [[query, doc] for doc in candidates]
# Process in batches
all_scores = []
for i in range(0, len(pairs), batch_size):
batch = pairs[i:i + batch_size]
scores = cross_encoder.predict(batch, show_progress_bar=False)
all_scores.extend(scores)
return all_scores
# Timing comparison
import time
query = "machine learning best practices"
candidates = documents * 10 # 100 candidates
start = time.time()
scores = rerank_batch(query, candidates, engine.cross_encoder, batch_size=32)
elapsed = time.time() - start
print(f"Reranked {len(candidates)} documents in {elapsed:.3f}s")
print(f"Throughput: {len(candidates) / elapsed:.1f} docs/sec")
Reranked 100 documents in 0.342s Throughput: 292.4 docs/sec
Note: The output above includes Python overhead and first-run model warmup. Sustained batch inference is significantly faster—see benchmarks below.
Speed Benchmarks
On an NVIDIA T4 GPU with ms-marco-MiniLM-L-6-v2 (warm model, batch inference only):
| Candidates | Latency | Throughput |
|---|---|---|
| 10 | ~12ms | ~830/sec |
| 100 | ~95ms | ~1050/sec |
| 1000 | ~890ms | ~1120/sec |
Rule of thumb: Rerank 50-100 candidates for a good accuracy/latency tradeoff.
Training a Custom Cross-Encoder
For domain-specific search, fine-tuning dramatically improves results.
Training Data Format
Cross-encoders need query-document pairs with relevance labels:
from sentence_transformers import CrossEncoder, InputExample
from torch.utils.data import DataLoader
# Training examples: (query, document, label)
# Labels can be binary (0/1) or continuous (0.0-1.0)
train_examples = [
InputExample(texts=["python tutorial", "Learn Python programming"], label=1.0),
InputExample(texts=["python tutorial", "Snake species guide"], label=0.0),
InputExample(texts=["fix flat tire", "How to change a tire"], label=1.0),
InputExample(texts=["fix flat tire", "Cake recipe"], label=0.0),
InputExample(texts=["machine learning", "Deep learning basics"], label=0.8), # Partial relevance
# ... more examples
]
# Create dataloader
train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=16)
Fine-tuning
from sentence_transformers.cross_encoder.evaluation import CEBinaryClassificationEvaluator
# Initialize from pre-trained model
model = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2', num_labels=1)
# Prepare evaluation data
eval_examples = [
InputExample(texts=["test query", "relevant doc"], label=1),
InputExample(texts=["test query", "irrelevant doc"], label=0),
]
evaluator = CEBinaryClassificationEvaluator.from_input_examples(eval_examples)
# Fine-tune
model.fit(
train_dataloader=train_dataloader,
evaluator=evaluator,
epochs=3,
warmup_steps=100,
output_path='./fine-tuned-cross-encoder',
evaluation_steps=500,
)
When to Use Cross-Encoders
Good Use Cases
| Scenario | Why Cross-Encoder Helps |
|---|---|
| Complex queries | ”laptop under $1000 with good battery” requires constraint understanding |
| High-stakes search | Legal, medical—accuracy matters more than latency |
| Negative reasoning | ”not python” should exclude Python results |
| Exact match matters | Product codes, legal citations |
When to Skip
| Scenario | Better Alternative |
|---|---|
| Autocomplete | Bi-encoder only (must be < 50ms) |
| Billions of docs | Multi-stage: bi-encoder → ANN → cross-encoder |
| Real-time feeds | Pre-computed bi-encoder scores |
Combining Scores: Weighted Fusion
Sometimes you want to blend bi-encoder and cross-encoder scores:
def hybrid_score(bi_score, cross_score, alpha=0.3):
"""
Weighted combination of bi-encoder and cross-encoder scores.
alpha: weight for bi-encoder (0 = cross-encoder only, 1 = bi-encoder only)
"""
# Cross-encoder outputs logits (unbounded, typically -10 to +10 range)
# Sigmoid maps these to [0, 1] probability-like scores for fair weighting
# Bi-encoder scores are already normalized cosine similarities [0, 1]
import math
cross_normalized = 1 / (1 + math.exp(-cross_score))
return alpha * bi_score + (1 - alpha) * cross_normalized
def search_with_fusion(engine, query, top_k=5, rerank_top_n=100, alpha=0.2):
"""Search with score fusion."""
results = engine.search(query, top_k=rerank_top_n, rerank_top_n=rerank_top_n)
for r in results:
r['hybrid_score'] = hybrid_score(
r['bi_encoder_score'],
r['cross_encoder_score'],
alpha
)
# Re-sort by hybrid score
results.sort(key=lambda x: x['hybrid_score'], reverse=True)
return results[:top_k]
Complete Reranking Pipeline
Here’s a production-ready implementation:
from sentence_transformers import SentenceTransformer, CrossEncoder
import numpy as np
import faiss
from typing import List, Dict
import time
class ProductionReranker:
"""
Production-ready two-stage search with monitoring.
"""
def __init__(
self,
bi_encoder: str = 'all-MiniLM-L6-v2',
cross_encoder: str = 'cross-encoder/ms-marco-MiniLM-L-6-v2',
rerank_top_n: int = 100
):
self.bi_encoder = SentenceTransformer(bi_encoder)
self.cross_encoder = CrossEncoder(cross_encoder)
self.rerank_top_n = rerank_top_n
self.index = None
self.documents = None
def index_documents(self, documents: List[str]) -> int:
self.documents = documents
embeddings = self.bi_encoder.encode(
documents,
convert_to_numpy=True,
normalize_embeddings=True,
show_progress_bar=True
).astype('float32')
self.index = faiss.IndexFlatIP(embeddings.shape[1])
self.index.add(embeddings)
return len(documents)
def search(self, query: str, top_k: int = 10) -> Dict:
"""
Search with timing and diagnostics.
"""
timings = {}
# Stage 1: Bi-encoder retrieval
start = time.time()
query_emb = self.bi_encoder.encode(
query,
convert_to_numpy=True,
normalize_embeddings=True
).reshape(1, -1).astype('float32')
n_candidates = min(self.rerank_top_n, len(self.documents))
bi_scores, indices = self.index.search(query_emb, n_candidates)
timings['retrieval_ms'] = (time.time() - start) * 1000
candidates = [self.documents[idx] for idx in indices[0]]
# Stage 2: Cross-encoder reranking
start = time.time()
pairs = [[query, doc] for doc in candidates]
cross_scores = self.cross_encoder.predict(pairs)
timings['rerank_ms'] = (time.time() - start) * 1000
# Sort by cross-encoder score
combined = list(zip(indices[0], candidates, bi_scores[0], cross_scores))
combined.sort(key=lambda x: x[3], reverse=True)
results = [
{
'document': doc,
'bi_score': float(bi),
'cross_score': float(cross),
'index': int(idx)
}
for idx, doc, bi, cross in combined[:top_k]
]
timings['total_ms'] = timings['retrieval_ms'] + timings['rerank_ms']
return {
'query': query,
'results': results,
'timings': timings,
'candidates_reranked': n_candidates
}
# Demo
reranker = ProductionReranker()
reranker.index_documents(documents)
response = reranker.search("learn python for ML")
print(f"Query: '{response['query']}'")
print(f"Timings: retrieval={response['timings']['retrieval_ms']:.1f}ms, "
f"rerank={response['timings']['rerank_ms']:.1f}ms")
print(f"\nTop Results:")
for r in response['results'][:3]:
print(f" [{r['cross_score']:.2f}] {r['document']}")
Query: ‘learn python for ML’ Timings: retrieval=2.3ms, rerank=45.2ms
Top Results: [7.89] Learn Python for data science [7.23] Deep learning with PyTorch tutorial [5.67] Python programming tutorial for beginners
What’s Next
You now have both pieces of production search:
- Bi-encoders for fast retrieval over millions of documents
- Cross-encoders for precise reranking of top candidates
In the next tutorial, we’ll combine these into a complete Two-Stage Retrieval system with:
- Optimal candidate selection
- Latency budgeting
- Fallback strategies
- Production deployment patterns
Key Takeaways
- Cross-encoders process query + document together—enabling rich token-level interactions
- They’re 5-15% more accurate than bi-encoders on complex queries
- Use them for reranking, not retrieval—latency scales with candidate count
- Rerank 50-100 candidates for good accuracy/speed tradeoff
- Fine-tune on your domain for best results
- Monitor both retrieval and rerank latency in production
Further Reading
- Cross-Encoders documentation — Usage and models
- MS MARCO leaderboard — Benchmark results
- ColBERT paper — Late interaction for faster cross-encoding
- Pretrained Cross-Encoders — Model hub
Comments
to join the discussion.