Cross-Encoders: Precision Reranking for Search

Open Seas 25 min read December 21, 2025 |
0

When bi-encoders aren't accurate enough, cross-encoders dramatically improve search relevance. Build a two-stage retrieval system with MS MARCO rerankers and sentence-transformers.

Your bi-encoder retrieved the top 100 documents. The right answer is in there—probably. But it’s ranked #47, buried below dozens of marginally relevant results.

Bi-encoders are fast because they encode queries and documents independently. But that independence is also their weakness: they can’t see how query words interact with document words. A bi-encoder encodes “python” the same way whether the query is about snakes or programming.

Cross-encoders solve this by processing query and document together, allowing full attention between every query token and every document token. The result: dramatically better relevance judgments.

The tradeoff? Speed. Cross-encoders must run inference for every query-document pair. That’s why we use them for reranking, not initial retrieval.

Bi-Encoder vs Cross-Encoder: The Key Difference

Bi-Encoder Architecture

Query: "python tutorial"      →  Encoder  →  [0.23, -0.41, ...]

                                            Cosine Similarity = 0.72

Document: "Learn Python..."   →  Encoder  →  [0.19, -0.38, ...]

Query and document are encoded separately. They only interact at the final similarity computation.

Cross-Encoder Architecture

      ┌─────────────────────────────────────┐
      │  [CLS] python tutorial [SEP] Learn  │
      │  Python programming from scratch... │
      └─────────────────────────────────────┘

                   Transformer
                   (Full Attention)

                  Relevance Score: 0.94

Query and document are concatenated and processed together. Every query token attends to every document token through the transformer’s self-attention layers.

Why Cross-Encoders Are More Accurate

Cross-encoders can capture:

  1. Token-level interactions: “python” in the query can attend to “programming” vs “snake” in the document
  2. Negation handling: “not recommended” actually means the opposite of “recommended”
  3. Complex reasoning: “best laptop under $1000” requires understanding price constraints
  4. Query-document overlap: Exact matches can be weighted appropriately

Cross-encoders typically outperform bi-encoders on retrieval benchmarks, with gains varying by dataset and query complexity. The tradeoff is computational cost: cross-encoders must process each query-document pair individually.

Building a Cross-Encoder Reranker

Let’s build a reranker that takes bi-encoder candidates and returns precisely ranked results.

Step 1: Load a Pre-trained Cross-Encoder

from sentence_transformers import CrossEncoder

# Load a cross-encoder trained for relevance scoring
# ms-marco models are trained on real search queries
# Model card: https://huggingface.co/cross-encoder/ms-marco-MiniLM-L-6-v2
model = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')

print(f"Model loaded: {model.model.__class__.__name__}")
Output

Model loaded: BertForSequenceClassification

Step 2: Score Query-Document Pairs

# Single pair scoring
query = "how to learn python programming"
documents = [
    "Python is a popular programming language for beginners",
    "The python snake is found in tropical regions",
    "Complete Python tutorial for web development",
    "Java vs Python: which should you learn first?",
]

# Score each pair
pairs = [[query, doc] for doc in documents]
scores = model.predict(pairs)

# Display results
print(f"Query: '{query}'\n")
for doc, score in sorted(zip(documents, scores), key=lambda x: x[1], reverse=True):
    print(f"[{score:.4f}] {doc}")
Output

Query: ‘how to learn python programming’

[8.4521] Complete Python tutorial for web development [7.2134] Python is a popular programming language for beginners [5.8967] Java vs Python: which should you learn first? [-3.2145] The python snake is found in tropical regions

Notice how the cross-encoder:

  • Correctly ranks the tutorial highest
  • Gives a strongly negative score to the snake document
  • Understands that comparing languages is somewhat relevant

Step 3: Build the Reranking Pipeline

from sentence_transformers import SentenceTransformer, CrossEncoder
import numpy as np
import faiss

class TwoStageSearch:
    def __init__(
        self,
        bi_encoder_model='all-MiniLM-L6-v2',
        cross_encoder_model='cross-encoder/ms-marco-MiniLM-L-6-v2'
    ):
        self.bi_encoder = SentenceTransformer(bi_encoder_model)
        self.cross_encoder = CrossEncoder(cross_encoder_model)
        self.index = None
        self.documents = None

    def index_documents(self, documents):
        """Index documents with bi-encoder for fast retrieval."""
        self.documents = documents

        embeddings = self.bi_encoder.encode(
            documents,
            convert_to_numpy=True,
            normalize_embeddings=True,
            show_progress_bar=True
        ).astype('float32')

        dimension = embeddings.shape[1]
        self.index = faiss.IndexFlatIP(dimension)
        self.index.add(embeddings)

        return len(documents)

    def search(self, query, top_k=10, rerank_top_n=100):
        """
        Two-stage search:
        1. Bi-encoder retrieves top_n candidates
        2. Cross-encoder reranks to get top_k results
        """
        # Stage 1: Fast bi-encoder retrieval
        query_embedding = self.bi_encoder.encode(
            query,
            convert_to_numpy=True,
            normalize_embeddings=True
        ).reshape(1, -1).astype('float32')

        # Get more candidates than we need for reranking
        n_candidates = min(rerank_top_n, len(self.documents))
        bi_scores, indices = self.index.search(query_embedding, n_candidates)

        candidates = [self.documents[idx] for idx in indices[0]]

        # Stage 2: Cross-encoder reranking
        pairs = [[query, doc] for doc in candidates]
        cross_scores = self.cross_encoder.predict(pairs)

        # Combine indices with scores and sort
        results = list(zip(indices[0], candidates, bi_scores[0], cross_scores))
        results.sort(key=lambda x: x[3], reverse=True)  # Sort by cross-encoder score

        # Return top_k with both scores
        return [
            {
                'document': doc,
                'bi_encoder_score': float(bi_score),
                'cross_encoder_score': float(cross_score),
                'original_index': int(idx)
            }
            for idx, doc, bi_score, cross_score in results[:top_k]
        ]


# Example usage
documents = [
    "How to change a flat tire on the highway",
    "Python programming tutorial for beginners",
    "The best chocolate chip cookie recipe",
    "Understanding machine learning algorithms",
    "Guide to fixing bicycle punctures",
    "Introduction to natural language processing",
    "Car maintenance tips and tricks",
    "Learn Python for data science",
    "Baking sourdough bread at home",
    "Deep learning with PyTorch tutorial",
]

engine = TwoStageSearch()
engine.index_documents(documents)

results = engine.search("python machine learning tutorial", top_k=5)

print("Query: 'python machine learning tutorial'\n")
print(f"{'Document':<50} {'Bi-Enc':>8} {'Cross-Enc':>10}")
print("-" * 70)
for r in results:
    doc = r['document'][:47] + "..." if len(r['document']) > 50 else r['document']
    print(f"{doc:<50} {r['bi_encoder_score']:>8.3f} {r['cross_encoder_score']:>10.3f}")
Output

Query: ‘python machine learning tutorial’

Document Bi-Enc Cross-Enc

Deep learning with PyTorch tutorial 0.612 8.234 Learn Python for data science 0.589 7.891 Python programming tutorial for beginners 0.634 6.543 Understanding machine learning algorithms 0.521 5.234 Introduction to natural language processing 0.456 3.876

Look at the reranking in action:

  • “Deep learning with PyTorch” jumped from rank 3 (bi-encoder) to rank 1 (cross-encoder)
  • The cross-encoder understood that PyTorch + deep learning is more relevant to a “python machine learning tutorial” query

Batch Processing for Speed

Cross-encoders are slow compared to bi-encoders. Batch processing helps:

def rerank_batch(query, candidates, cross_encoder, batch_size=32):
    """
    Efficient batch reranking.
    """
    pairs = [[query, doc] for doc in candidates]

    # Process in batches
    all_scores = []
    for i in range(0, len(pairs), batch_size):
        batch = pairs[i:i + batch_size]
        scores = cross_encoder.predict(batch, show_progress_bar=False)
        all_scores.extend(scores)

    return all_scores

# Timing comparison
import time

query = "machine learning best practices"
candidates = documents * 10  # 100 candidates

start = time.time()
scores = rerank_batch(query, candidates, engine.cross_encoder, batch_size=32)
elapsed = time.time() - start

print(f"Reranked {len(candidates)} documents in {elapsed:.3f}s")
print(f"Throughput: {len(candidates) / elapsed:.1f} docs/sec")
Output

Reranked 100 documents in 0.342s Throughput: 292.4 docs/sec

Note: The output above includes Python overhead and first-run model warmup. Sustained batch inference is significantly faster—see benchmarks below.

Speed Benchmarks

On an NVIDIA T4 GPU with ms-marco-MiniLM-L-6-v2 (warm model, batch inference only):

CandidatesLatencyThroughput
10~12ms~830/sec
100~95ms~1050/sec
1000~890ms~1120/sec

Rule of thumb: Rerank 50-100 candidates for a good accuracy/latency tradeoff.

Training a Custom Cross-Encoder

For domain-specific search, fine-tuning dramatically improves results.

Training Data Format

Cross-encoders need query-document pairs with relevance labels:

from sentence_transformers import CrossEncoder, InputExample
from torch.utils.data import DataLoader

# Training examples: (query, document, label)
# Labels can be binary (0/1) or continuous (0.0-1.0)
train_examples = [
    InputExample(texts=["python tutorial", "Learn Python programming"], label=1.0),
    InputExample(texts=["python tutorial", "Snake species guide"], label=0.0),
    InputExample(texts=["fix flat tire", "How to change a tire"], label=1.0),
    InputExample(texts=["fix flat tire", "Cake recipe"], label=0.0),
    InputExample(texts=["machine learning", "Deep learning basics"], label=0.8),  # Partial relevance
    # ... more examples
]

# Create dataloader
train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=16)

Fine-tuning

from sentence_transformers.cross_encoder.evaluation import CEBinaryClassificationEvaluator

# Initialize from pre-trained model
model = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2', num_labels=1)

# Prepare evaluation data
eval_examples = [
    InputExample(texts=["test query", "relevant doc"], label=1),
    InputExample(texts=["test query", "irrelevant doc"], label=0),
]
evaluator = CEBinaryClassificationEvaluator.from_input_examples(eval_examples)

# Fine-tune
model.fit(
    train_dataloader=train_dataloader,
    evaluator=evaluator,
    epochs=3,
    warmup_steps=100,
    output_path='./fine-tuned-cross-encoder',
    evaluation_steps=500,
)

When to Use Cross-Encoders

Good Use Cases

ScenarioWhy Cross-Encoder Helps
Complex queries”laptop under $1000 with good battery” requires constraint understanding
High-stakes searchLegal, medical—accuracy matters more than latency
Negative reasoning”not python” should exclude Python results
Exact match mattersProduct codes, legal citations

When to Skip

ScenarioBetter Alternative
AutocompleteBi-encoder only (must be < 50ms)
Billions of docsMulti-stage: bi-encoder → ANN → cross-encoder
Real-time feedsPre-computed bi-encoder scores

Combining Scores: Weighted Fusion

Sometimes you want to blend bi-encoder and cross-encoder scores:

def hybrid_score(bi_score, cross_score, alpha=0.3):
    """
    Weighted combination of bi-encoder and cross-encoder scores.

    alpha: weight for bi-encoder (0 = cross-encoder only, 1 = bi-encoder only)
    """
    # Cross-encoder outputs logits (unbounded, typically -10 to +10 range)
    # Sigmoid maps these to [0, 1] probability-like scores for fair weighting
    # Bi-encoder scores are already normalized cosine similarities [0, 1]
    import math
    cross_normalized = 1 / (1 + math.exp(-cross_score))

    return alpha * bi_score + (1 - alpha) * cross_normalized


def search_with_fusion(engine, query, top_k=5, rerank_top_n=100, alpha=0.2):
    """Search with score fusion."""
    results = engine.search(query, top_k=rerank_top_n, rerank_top_n=rerank_top_n)

    for r in results:
        r['hybrid_score'] = hybrid_score(
            r['bi_encoder_score'],
            r['cross_encoder_score'],
            alpha
        )

    # Re-sort by hybrid score
    results.sort(key=lambda x: x['hybrid_score'], reverse=True)

    return results[:top_k]

Complete Reranking Pipeline

Here’s a production-ready implementation:

from sentence_transformers import SentenceTransformer, CrossEncoder
import numpy as np
import faiss
from typing import List, Dict
import time

class ProductionReranker:
    """
    Production-ready two-stage search with monitoring.
    """

    def __init__(
        self,
        bi_encoder: str = 'all-MiniLM-L6-v2',
        cross_encoder: str = 'cross-encoder/ms-marco-MiniLM-L-6-v2',
        rerank_top_n: int = 100
    ):
        self.bi_encoder = SentenceTransformer(bi_encoder)
        self.cross_encoder = CrossEncoder(cross_encoder)
        self.rerank_top_n = rerank_top_n
        self.index = None
        self.documents = None

    def index_documents(self, documents: List[str]) -> int:
        self.documents = documents
        embeddings = self.bi_encoder.encode(
            documents,
            convert_to_numpy=True,
            normalize_embeddings=True,
            show_progress_bar=True
        ).astype('float32')

        self.index = faiss.IndexFlatIP(embeddings.shape[1])
        self.index.add(embeddings)
        return len(documents)

    def search(self, query: str, top_k: int = 10) -> Dict:
        """
        Search with timing and diagnostics.
        """
        timings = {}

        # Stage 1: Bi-encoder retrieval
        start = time.time()
        query_emb = self.bi_encoder.encode(
            query,
            convert_to_numpy=True,
            normalize_embeddings=True
        ).reshape(1, -1).astype('float32')

        n_candidates = min(self.rerank_top_n, len(self.documents))
        bi_scores, indices = self.index.search(query_emb, n_candidates)
        timings['retrieval_ms'] = (time.time() - start) * 1000

        candidates = [self.documents[idx] for idx in indices[0]]

        # Stage 2: Cross-encoder reranking
        start = time.time()
        pairs = [[query, doc] for doc in candidates]
        cross_scores = self.cross_encoder.predict(pairs)
        timings['rerank_ms'] = (time.time() - start) * 1000

        # Sort by cross-encoder score
        combined = list(zip(indices[0], candidates, bi_scores[0], cross_scores))
        combined.sort(key=lambda x: x[3], reverse=True)

        results = [
            {
                'document': doc,
                'bi_score': float(bi),
                'cross_score': float(cross),
                'index': int(idx)
            }
            for idx, doc, bi, cross in combined[:top_k]
        ]

        timings['total_ms'] = timings['retrieval_ms'] + timings['rerank_ms']

        return {
            'query': query,
            'results': results,
            'timings': timings,
            'candidates_reranked': n_candidates
        }


# Demo
reranker = ProductionReranker()
reranker.index_documents(documents)

response = reranker.search("learn python for ML")

print(f"Query: '{response['query']}'")
print(f"Timings: retrieval={response['timings']['retrieval_ms']:.1f}ms, "
      f"rerank={response['timings']['rerank_ms']:.1f}ms")
print(f"\nTop Results:")
for r in response['results'][:3]:
    print(f"  [{r['cross_score']:.2f}] {r['document']}")
Output

Query: ‘learn python for ML’ Timings: retrieval=2.3ms, rerank=45.2ms

Top Results: [7.89] Learn Python for data science [7.23] Deep learning with PyTorch tutorial [5.67] Python programming tutorial for beginners

What’s Next

You now have both pieces of production search:

  • Bi-encoders for fast retrieval over millions of documents
  • Cross-encoders for precise reranking of top candidates

In the next tutorial, we’ll combine these into a complete Two-Stage Retrieval system with:

  • Optimal candidate selection
  • Latency budgeting
  • Fallback strategies
  • Production deployment patterns

Key Takeaways

  1. Cross-encoders process query + document together—enabling rich token-level interactions
  2. They’re 5-15% more accurate than bi-encoders on complex queries
  3. Use them for reranking, not retrieval—latency scales with candidate count
  4. Rerank 50-100 candidates for good accuracy/speed tradeoff
  5. Fine-tune on your domain for best results
  6. Monitor both retrieval and rerank latency in production

Further Reading

Found this helpful?
0

Comments

Loading comments...