Cross-Encoders: Precision Reranking for Search

Your bi-encoder retrieved the top 100 documents. The right answer is in there—probably. But it’s ranked #47, buried below dozens of marginally relevant results.

Bi-encoders are fast because they encode queries and documents independently. But that independence is also their weakness: they can’t see how query words interact with document words. A bi-encoder encodes “python” the same way whether the query is about snakes or programming.

Cross-encoders solve this by processing query and document together, allowing full attention between every query token and every document token. The result: dramatically better relevance judgments.

The tradeoff? Speed. Cross-encoders must run inference for every query-document pair. That’s why we use them for reranking, not initial retrieval.

Bi-Encoder vs Cross-Encoder: The Key Difference

Bi-Encoder Architecture

Query: "python tutorial"      →  Encoder  →  [0.23, -0.41, ...]
                                                    ↓
                                            Cosine Similarity = 0.72
                                                    ↑
Document: "Learn Python..."   →  Encoder  →  [0.19, -0.38, ...]

Query and document are encoded separately. They only interact at the final similarity computation.

Cross-Encoder Architecture

      ┌─────────────────────────────────────┐
      │  [CLS] python tutorial [SEP] Learn  │
      │  Python programming from scratch... │
      └─────────────────────────────────────┘
                        ↓
                   Transformer
                   (Full Attention)
                        ↓
                  Relevance Score: 0.94

Query and document are concatenated and processed together. Every query token attends to every document token through the transformer’s self-attention layers.

Why Cross-Encoders Are More Accurate

Cross-encoders can capture:

Token-level interactions: “python” in the query can attend to “programming” vs “snake” in the document
Negation handling: “not recommended” actually means the opposite of “recommended”
Complex reasoning: “best laptop under $1000” requires understanding price constraints
Query-document overlap: Exact matches can be weighted appropriately

Cross-encoders typically outperform bi-encoders on retrieval benchmarks, with gains varying by dataset and query complexity. The tradeoff is computational cost: cross-encoders must process each query-document pair individually.

Building a Cross-Encoder Reranker

Let’s build a reranker that takes bi-encoder candidates and returns precisely ranked results.

Step 1: Load a Pre-trained Cross-Encoder

from sentence_transformers import CrossEncoder

# Load a cross-encoder trained for relevance scoring
# ms-marco models are trained on real search queries
# Model card: https://huggingface.co/cross-encoder/ms-marco-MiniLM-L-6-v2
model = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')

print(f"Model loaded: {model.model.__class__.__name__}")

Output

Model loaded: BertForSequenceClassification

Choosing a Cross-Encoder

For reranking, consider these models:

Model	Speed	Quality	Use Case
`ms-marco-MiniLM-L-6-v2`	Fast	Good	Low latency reranking
`ms-marco-MiniLM-L-12-v2`	Medium	Better	Balanced performance
`ms-marco-electra-base`	Slower	Best	Maximum accuracy

MS-MARCO models are trained on Bing search data—real queries with human relevance judgments.

Step 2: Score Query-Document Pairs

# Single pair scoring
query = "how to learn python programming"
documents = [
    "Python is a popular programming language for beginners",
    "The python snake is found in tropical regions",
    "Complete Python tutorial for web development",
    "Java vs Python: which should you learn first?",
]

# Score each pair
pairs = [[query, doc] for doc in documents]
scores = model.predict(pairs)

# Display results
print(f"Query: '{query}'\n")
for doc, score in sorted(zip(documents, scores), key=lambda x: x[1], reverse=True):
    print(f"[{score:.4f}] {doc}")

Output

Query: ‘how to learn python programming’
[8.4521] Complete Python tutorial for web development
[7.2134] Python is a popular programming language for beginners
[5.8967] Java vs Python: which should you learn first?
[-3.2145] The python snake is found in tropical regions

Notice how the cross-encoder:

Correctly ranks the tutorial highest
Gives a strongly negative score to the snake document
Understands that comparing languages is somewhat relevant

Step 3: Build the Reranking Pipeline

from sentence_transformers import SentenceTransformer, CrossEncoder
import numpy as np
import faiss

class TwoStageSearch:
    def __init__(
        self,
        bi_encoder_model='all-MiniLM-L6-v2',
        cross_encoder_model='cross-encoder/ms-marco-MiniLM-L-6-v2'
    ):
        self.bi_encoder = SentenceTransformer(bi_encoder_model)
        self.cross_encoder = CrossEncoder(cross_encoder_model)
        self.index = None
        self.documents = None

    def index_documents(self, documents):
        """Index documents with bi-encoder for fast retrieval."""
        self.documents = documents

        embeddings = self.bi_encoder.encode(
            documents,
            convert_to_numpy=True,
            normalize_embeddings=True,
            show_progress_bar=True
        ).astype('float32')

        dimension = embeddings.shape[1]
        self.index = faiss.IndexFlatIP(dimension)
        self.index.add(embeddings)

        return len(documents)

    def search(self, query, top_k=10, rerank_top_n=100):
        """
        Two-stage search:
        1. Bi-encoder retrieves top_n candidates
        2. Cross-encoder reranks to get top_k results
        """
        # Stage 1: Fast bi-encoder retrieval
        query_embedding = self.bi_encoder.encode(
            query,
            convert_to_numpy=True,
            normalize_embeddings=True
        ).reshape(1, -1).astype('float32')

        # Get more candidates than we need for reranking
        n_candidates = min(rerank_top_n, len(self.documents))
        bi_scores, indices = self.index.search(query_embedding, n_candidates)

        candidates = [self.documents[idx] for idx in indices[0]]

        # Stage 2: Cross-encoder reranking
        pairs = [[query, doc] for doc in candidates]
        cross_scores = self.cross_encoder.predict(pairs)

        # Combine indices with scores and sort
        results = list(zip(indices[0], candidates, bi_scores[0], cross_scores))
        results.sort(key=lambda x: x[3], reverse=True)  # Sort by cross-encoder score

        # Return top_k with both scores
        return [
            {
                'document': doc,
                'bi_encoder_score': float(bi_score),
                'cross_encoder_score': float(cross_score),
                'original_index': int(idx)
            }
            for idx, doc, bi_score, cross_score in results[:top_k]
        ]


# Example usage
documents = [
    "How to change a flat tire on the highway",
    "Python programming tutorial for beginners",
    "The best chocolate chip cookie recipe",
    "Understanding machine learning algorithms",
    "Guide to fixing bicycle punctures",
    "Introduction to natural language processing",
    "Car maintenance tips and tricks",
    "Learn Python for data science",
    "Baking sourdough bread at home",
    "Deep learning with PyTorch tutorial",
]

engine = TwoStageSearch()
engine.index_documents(documents)

results = engine.search("python machine learning tutorial", top_k=5)

print("Query: 'python machine learning tutorial'\n")
print(f"{'Document':<50} {'Bi-Enc':>8} {'Cross-Enc':>10}")
print("-" * 70)
for r in results:
    doc = r['document'][:47] + "..." if len(r['document']) > 50 else r['document']
    print(f"{doc:<50} {r['bi_encoder_score']:>8.3f} {r['cross_encoder_score']:>10.3f}")

Output

Query: ‘python machine learning tutorial’

Document Bi-Enc Cross-Enc

Deep learning with PyTorch tutorial 0.612 8.234 Learn Python for data science 0.589 7.891 Python programming tutorial for beginners 0.634 6.543 Understanding machine learning algorithms 0.521 5.234 Introduction to natural language processing 0.456 3.876

Look at the reranking in action:

“Deep learning with PyTorch” jumped from rank 3 (bi-encoder) to rank 1 (cross-encoder)
The cross-encoder understood that PyTorch + deep learning is more relevant to a “python machine learning tutorial” query

Batch Processing for Speed

Cross-encoders are slow compared to bi-encoders. Batch processing helps:

def rerank_batch(query, candidates, cross_encoder, batch_size=32):
    """
    Efficient batch reranking.
    """
    pairs = [[query, doc] for doc in candidates]

    # Process in batches
    all_scores = []
    for i in range(0, len(pairs), batch_size):
        batch = pairs[i:i + batch_size]
        scores = cross_encoder.predict(batch, show_progress_bar=False)
        all_scores.extend(scores)

    return all_scores

# Timing comparison
import time

query = "machine learning best practices"
candidates = documents * 10  # 100 candidates

start = time.time()
scores = rerank_batch(query, candidates, engine.cross_encoder, batch_size=32)
elapsed = time.time() - start

print(f"Reranked {len(candidates)} documents in {elapsed:.3f}s")
print(f"Throughput: {len(candidates) / elapsed:.1f} docs/sec")

Output

Reranked 100 documents in 0.342s Throughput: 292.4 docs/sec

Note: The output above includes Python overhead and first-run model warmup. Sustained batch inference is significantly faster—see benchmarks below.

Speed Benchmarks

On an NVIDIA T4 GPU with ms-marco-MiniLM-L-6-v2 (warm model, batch inference only):

Candidates	Latency	Throughput
10	~12ms	~830/sec
100	~95ms	~1050/sec
1000	~890ms	~1120/sec

Rule of thumb: Rerank 50-100 candidates for a good accuracy/latency tradeoff.

Training a Custom Cross-Encoder

For domain-specific search, fine-tuning dramatically improves results.

Training Data Format

Cross-encoders need query-document pairs with relevance labels:

from sentence_transformers import CrossEncoder, InputExample
from torch.utils.data import DataLoader

# Training examples: (query, document, label)
# Labels can be binary (0/1) or continuous (0.0-1.0)
train_examples = [
    InputExample(texts=["python tutorial", "Learn Python programming"], label=1.0),
    InputExample(texts=["python tutorial", "Snake species guide"], label=0.0),
    InputExample(texts=["fix flat tire", "How to change a tire"], label=1.0),
    InputExample(texts=["fix flat tire", "Cake recipe"], label=0.0),
    InputExample(texts=["machine learning", "Deep learning basics"], label=0.8),  # Partial relevance
    # ... more examples
]

# Create dataloader
train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=16)

Fine-tuning

from sentence_transformers.cross_encoder.evaluation import CEBinaryClassificationEvaluator

# Initialize from pre-trained model
model = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2', num_labels=1)

# Prepare evaluation data
eval_examples = [
    InputExample(texts=["test query", "relevant doc"], label=1),
    InputExample(texts=["test query", "irrelevant doc"], label=0),
]
evaluator = CEBinaryClassificationEvaluator.from_input_examples(eval_examples)

# Fine-tune
model.fit(
    train_dataloader=train_dataloader,
    evaluator=evaluator,
    epochs=3,
    warmup_steps=100,
    output_path='./fine-tuned-cross-encoder',
    evaluation_steps=500,
)

When to Use Cross-Encoders

Good Use Cases

Scenario	Why Cross-Encoder Helps
Complex queries	”laptop under $1000 with good battery” requires constraint understanding
High-stakes search	Legal, medical—accuracy matters more than latency
Negative reasoning	”not python” should exclude Python results
Exact match matters	Product codes, legal citations

When to Skip

Scenario	Better Alternative
Autocomplete	Bi-encoder only (must be < 50ms)
Billions of docs	Multi-stage: bi-encoder → ANN → cross-encoder
Real-time feeds	Pre-computed bi-encoder scores

Combining Scores: Weighted Fusion

Sometimes you want to blend bi-encoder and cross-encoder scores:

def hybrid_score(bi_score, cross_score, alpha=0.3):
    """
    Weighted combination of bi-encoder and cross-encoder scores.

    alpha: weight for bi-encoder (0 = cross-encoder only, 1 = bi-encoder only)
    """
    # Cross-encoder outputs logits (unbounded, typically -10 to +10 range)
    # Sigmoid maps these to [0, 1] probability-like scores for fair weighting
    # Bi-encoder scores are already normalized cosine similarities [0, 1]
    import math
    cross_normalized = 1 / (1 + math.exp(-cross_score))

    return alpha * bi_score + (1 - alpha) * cross_normalized


def search_with_fusion(engine, query, top_k=5, rerank_top_n=100, alpha=0.2):
    """Search with score fusion."""
    results = engine.search(query, top_k=rerank_top_n, rerank_top_n=rerank_top_n)

    for r in results:
        r['hybrid_score'] = hybrid_score(
            r['bi_encoder_score'],
            r['cross_encoder_score'],
            alpha
        )

    # Re-sort by hybrid score
    results.sort(key=lambda x: x['hybrid_score'], reverse=True)

    return results[:top_k]

Complete Reranking Pipeline

Here’s a production-ready implementation:

from sentence_transformers import SentenceTransformer, CrossEncoder
import numpy as np
import faiss
from typing import List, Dict
import time

class ProductionReranker:
    """
    Production-ready two-stage search with monitoring.
    """

    def __init__(
        self,
        bi_encoder: str = 'all-MiniLM-L6-v2',
        cross_encoder: str = 'cross-encoder/ms-marco-MiniLM-L-6-v2',
        rerank_top_n: int = 100
    ):
        self.bi_encoder = SentenceTransformer(bi_encoder)
        self.cross_encoder = CrossEncoder(cross_encoder)
        self.rerank_top_n = rerank_top_n
        self.index = None
        self.documents = None

    def index_documents(self, documents: List[str]) -> int:
        self.documents = documents
        embeddings = self.bi_encoder.encode(
            documents,
            convert_to_numpy=True,
            normalize_embeddings=True,
            show_progress_bar=True
        ).astype('float32')

        self.index = faiss.IndexFlatIP(embeddings.shape[1])
        self.index.add(embeddings)
        return len(documents)

    def search(self, query: str, top_k: int = 10) -> Dict:
        """
        Search with timing and diagnostics.
        """
        timings = {}

        # Stage 1: Bi-encoder retrieval
        start = time.time()
        query_emb = self.bi_encoder.encode(
            query,
            convert_to_numpy=True,
            normalize_embeddings=True
        ).reshape(1, -1).astype('float32')

        n_candidates = min(self.rerank_top_n, len(self.documents))
        bi_scores, indices = self.index.search(query_emb, n_candidates)
        timings['retrieval_ms'] = (time.time() - start) * 1000

        candidates = [self.documents[idx] for idx in indices[0]]

        # Stage 2: Cross-encoder reranking
        start = time.time()
        pairs = [[query, doc] for doc in candidates]
        cross_scores = self.cross_encoder.predict(pairs)
        timings['rerank_ms'] = (time.time() - start) * 1000

        # Sort by cross-encoder score
        combined = list(zip(indices[0], candidates, bi_scores[0], cross_scores))
        combined.sort(key=lambda x: x[3], reverse=True)

        results = [
            {
                'document': doc,
                'bi_score': float(bi),
                'cross_score': float(cross),
                'index': int(idx)
            }
            for idx, doc, bi, cross in combined[:top_k]
        ]

        timings['total_ms'] = timings['retrieval_ms'] + timings['rerank_ms']

        return {
            'query': query,
            'results': results,
            'timings': timings,
            'candidates_reranked': n_candidates
        }


# Demo
reranker = ProductionReranker()
reranker.index_documents(documents)

response = reranker.search("learn python for ML")

print(f"Query: '{response['query']}'")
print(f"Timings: retrieval={response['timings']['retrieval_ms']:.1f}ms, "
      f"rerank={response['timings']['rerank_ms']:.1f}ms")
print(f"\nTop Results:")
for r in response['results'][:3]:
    print(f"  [{r['cross_score']:.2f}] {r['document']}")

Output

Query: ‘learn python for ML’
Timings: retrieval=2.3ms, rerank=45.2ms
Top Results:
[7.89] Learn Python for data science
[7.23] Deep learning with PyTorch tutorial
[5.67] Python programming tutorial for beginners

What’s Next

You now have both pieces of production search:

Bi-encoders for fast retrieval over millions of documents
Cross-encoders for precise reranking of top candidates

In the next tutorial, we’ll combine these into a complete Two-Stage Retrieval system with:

Optimal candidate selection
Latency budgeting
Fallback strategies
Production deployment patterns

Key Takeaways

Cross-encoders process query + document together—enabling rich token-level interactions
They’re 5-15% more accurate than bi-encoders on complex queries
Use them for reranking, not retrieval—latency scales with candidate count
Rerank 50-100 candidates for good accuracy/speed tradeoff
Fine-tune on your domain for best results
Monitor both retrieval and rerank latency in production

Cross-Encoders: Precision Reranking for Search

Bi-Encoder vs Cross-Encoder: The Key Difference

Bi-Encoder Architecture

Cross-Encoder Architecture

Why Cross-Encoders Are More Accurate

Building a Cross-Encoder Reranker

Step 1: Load a Pre-trained Cross-Encoder

Step 2: Score Query-Document Pairs

Step 3: Build the Reranking Pipeline

Document Bi-Enc Cross-Enc

Batch Processing for Speed

Speed Benchmarks

Training a Custom Cross-Encoder

Training Data Format

Fine-tuning

When to Use Cross-Encoders

Good Use Cases

When to Skip

Combining Scores: Weighted Fusion

Complete Reranking Pipeline

What’s Next

Key Takeaways

Further Reading

Comments