Bi-Encoders: Fast Semantic Search at Scale

You have a million documents. A user types a query. How do you find the most relevant results in milliseconds?

Keyword search fails when users phrase things differently than your documents. “How do I fix a flat tire?” should match “Changing a punctured wheel” even though they share no words. You need semantic search—finding documents by meaning, not just keywords.

Bi-encoders make this possible at scale. They encode queries and documents into the same vector space, where similar meanings land close together. Search becomes a nearest-neighbor lookup, which modern vector databases handle in milliseconds over billions of vectors.

By the end of this tutorial, you’ll understand why bi-encoders dominate production search systems and have a working implementation you can scale.

What Makes Bi-Encoders Special

A bi-encoder uses the same neural network to encode both queries and documents independently. This independence is the key to their speed.

Query: "flat tire repair"     →  Encoder  →  [0.23, -0.41, 0.87, ...]
                                              ↓
                                         Cosine Similarity
                                              ↑
Document: "How to change..."  →  Encoder  →  [0.19, -0.38, 0.91, ...]

Because documents are encoded independently, you can:

Pre-compute document embeddings once, store them in a vector database
Encode the query at search time (one forward pass, ~10ms)
Find nearest neighbors in the vector space (sub-millisecond with FAISS)

Compare this to evaluating every query-document pair through a neural network—that’s what cross-encoders do, and why they can’t scale to millions of documents directly.

Building a Semantic Search System

Let’s build a complete semantic search system over a corpus of documents. We’ll use the sentence-transformers library, which provides state-of-the-art bi-encoder models.

Step 1: Load a Pre-trained Model

from sentence_transformers import SentenceTransformer
import numpy as np

# Load a bi-encoder model trained for semantic search
# all-MiniLM-L6-v2 is fast and effective (384 dimensions)
# Model card: https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2
model = SentenceTransformer('all-MiniLM-L6-v2')

print(f"Model loaded: {model}")
print(f"Embedding dimension: {model.get_sentence_embedding_dimension()}")

Output

Model loaded: SentenceTransformer(…)
Embedding dimension: 384

Choosing a Model

For production semantic search, consider these models:

Model	Dimensions	Speed	Quality	Use Case
`all-MiniLM-L6-v2`	384	Fast	Good	General purpose, low latency
`all-mpnet-base-v2`	768	Medium	Better	Higher accuracy needs
`multi-qa-mpnet-base-dot-v1`	768	Medium	Best	Question-answering retrieval

Start with MiniLM for prototyping, upgrade if you need more accuracy.

Step 2: Encode Your Corpus

# Sample document corpus
documents = [
    "How to change a flat tire on the highway",
    "Recipe for homemade chocolate chip cookies",
    "Understanding neural network backpropagation",
    "Best practices for Python code reviews",
    "Guide to replacing bicycle inner tubes",
    "Machine learning model deployment strategies",
    "Troubleshooting car engine problems",
    "Introduction to natural language processing",
    "How to bake sourdough bread from scratch",
    "Deep learning optimization techniques",
]

# Encode all documents (do this once, store the embeddings)
document_embeddings = model.encode(
    documents,
    convert_to_numpy=True,
    show_progress_bar=True,
    normalize_embeddings=True  # For cosine similarity via dot product
)

print(f"Encoded {len(documents)} documents")
print(f"Embedding shape: {document_embeddings.shape}")

Output

Encoded 10 documents
Embedding shape: (10, 384)

Step 3: Search with a Query

def semantic_search(query, document_embeddings, documents, top_k=3):
    """
    Find the most similar documents to a query.
    """
    # Encode the query
    query_embedding = model.encode(
        query,
        convert_to_numpy=True,
        normalize_embeddings=True
    )

    # Compute similarities (dot product = cosine sim for normalized vectors)
    similarities = np.dot(document_embeddings, query_embedding)

    # Get top-k indices
    top_indices = np.argsort(similarities)[::-1][:top_k]

    results = []
    for idx in top_indices:
        results.append({
            'document': documents[idx],
            'score': float(similarities[idx]),
            'index': int(idx)
        })

    return results

# Test it
query = "fixing a punctured wheel"
results = semantic_search(query, document_embeddings, documents)

print(f"Query: '{query}'\n")
for i, r in enumerate(results, 1):
    print(f"{i}. [{r['score']:.3f}] {r['document']}")

Output

Query: ‘fixing a punctured wheel’

[0.672] How to change a flat tire on the highway
[0.543] Guide to replacing bicycle inner tubes
[0.401] Troubleshooting car engine problems

The query “fixing a punctured wheel” matches documents about tires and tubes—even though none of those exact words appear in the query. That’s semantic search.

Step 4: Scale with FAISS

NumPy dot products work fine for thousands of documents. For millions, you need approximate nearest neighbor (ANN) search. FAISS is the industry standard.

import faiss

def build_faiss_index(embeddings):
    """
    Build a FAISS index for fast similarity search.

    For normalized embeddings, IndexFlatIP (inner product)
    gives us cosine similarity.
    """
    dimension = embeddings.shape[1]

    # Exact search index (use IndexIVFFlat for millions of docs)
    index = faiss.IndexFlatIP(dimension)

    # Add embeddings to index
    index.add(embeddings.astype('float32'))

    return index

def faiss_search(query, index, documents, top_k=3):
    """Search using FAISS index."""
    query_embedding = model.encode(
        query,
        convert_to_numpy=True,
        normalize_embeddings=True
    ).reshape(1, -1).astype('float32')

    # Search returns distances and indices
    scores, indices = index.search(query_embedding, top_k)

    results = []
    for score, idx in zip(scores[0], indices[0]):
        results.append({
            'document': documents[idx],
            'score': float(score),
            'index': int(idx)
        })

    return results

# Build index and search
index = build_faiss_index(document_embeddings)
results = faiss_search("machine learning best practices", index, documents)

print("Query: 'machine learning best practices'\n")
for i, r in enumerate(results, 1):
    print(f"{i}. [{r['score']:.3f}] {r['document']}")

Output

Query: ‘machine learning best practices’

[0.584] Machine learning model deployment strategies
[0.523] Deep learning optimization techniques
[0.489] Best practices for Python code reviews

FAISS Index Types

For different scales:

Documents	Index Type	Notes
< 100K	`IndexFlatIP`	Exact search, no training needed
100K - 10M	`IndexIVFFlat`	Cluster-based, requires training
> 10M	`IndexHNSWFlat`	Graph-based, best recall/speed tradeoff

For production at scale, consider managed services: Pinecone, Weaviate, Qdrant, or AWS OpenSearch.

Encoding at Scale

When encoding millions of documents, batch processing and GPU utilization matter:

def encode_corpus_batched(documents, model, batch_size=64):
    """
    Encode a large corpus efficiently.
    """
    all_embeddings = model.encode(
        documents,
        batch_size=batch_size,
        show_progress_bar=True,
        convert_to_numpy=True,
        normalize_embeddings=True,
        device='cuda' if torch.cuda.is_available() else 'cpu'
    )
    return all_embeddings

# For very large corpora, process in chunks to manage memory
def encode_corpus_chunked(documents, model, chunk_size=10000, batch_size=64):
    """
    Encode corpus in chunks for memory efficiency.
    """
    embeddings_list = []

    for i in range(0, len(documents), chunk_size):
        chunk = documents[i:i + chunk_size]
        chunk_embeddings = model.encode(
            chunk,
            batch_size=batch_size,
            show_progress_bar=True,
            convert_to_numpy=True,
            normalize_embeddings=True
        )
        embeddings_list.append(chunk_embeddings)

        # Save intermediate results
        print(f"Encoded {min(i + chunk_size, len(documents))}/{len(documents)}")

    return np.vstack(embeddings_list)

Encoding Speed Benchmarks

On an NVIDIA T4 GPU with all-MiniLM-L6-v2 (tested with batch encoding, no special optimizations):

Batch Size	Documents/Second	Time for 1M docs
32	~2,500	~7 minutes
64	~3,800	~4.5 minutes
128	~4,200	~4 minutes

Measured on single GPU with sentence-transformers defaults. Your results may vary based on document length and hardware.

CPU encoding is significantly slower—expect 5-15x longer depending on your processor. For one-time batch indexing, this is acceptable. For real-time encoding at scale, GPU acceleration is recommended.

Training Your Own Bi-Encoder

Pre-trained models work well for general text. For domain-specific search (legal, medical, code), fine-tuning dramatically improves quality.

Contrastive Learning with Triplets

Bi-encoders are typically trained with contrastive loss. Given a query, the model learns to score relevant documents higher than irrelevant ones.

from sentence_transformers import SentenceTransformer, InputExample, losses
from torch.utils.data import DataLoader

# Training data: (query, positive_doc, negative_doc) triplets
train_examples = [
    InputExample(texts=["flat tire help", "How to change a flat tire", "Chocolate cake recipe"]),
    InputExample(texts=["python tips", "Best practices for Python", "Car repair guide"]),
    InputExample(texts=["neural networks", "Understanding backpropagation", "Baking sourdough"]),
    # ... more triplets
]

# Load base model
model = SentenceTransformer('all-MiniLM-L6-v2')

# Create dataloader
train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=16)

# Use triplet loss
train_loss = losses.TripletLoss(model=model)

# Train
model.fit(
    train_objectives=[(train_dataloader, train_loss)],
    epochs=3,
    warmup_steps=100,
    output_path='./fine-tuned-bi-encoder'
)

Evaluation: Measuring Search Quality

How do you know if your bi-encoder is any good? Standard metrics:

def evaluate_retrieval(queries, relevant_docs, model, index, documents, k=10):
    """
    Evaluate retrieval quality with standard metrics.

    queries: list of query strings
    relevant_docs: list of sets, relevant doc indices for each query
    """
    total_recall = 0
    total_mrr = 0

    for query, relevant in zip(queries, relevant_docs):
        results = faiss_search(query, index, documents, top_k=k)
        retrieved_indices = set(r['index'] for r in results)

        # Recall@K: fraction of relevant docs retrieved
        recall = len(retrieved_indices & relevant) / len(relevant)
        total_recall += recall

        # MRR: reciprocal rank of first relevant result
        for rank, r in enumerate(results, 1):
            if r['index'] in relevant:
                total_mrr += 1 / rank
                break

    n = len(queries)
    return {
        'recall@k': total_recall / n,
        'mrr': total_mrr / n
    }

# Example evaluation
test_queries = ["tire repair", "python coding"]
test_relevant = [{0, 4, 6}, {3, 5}]  # Relevant doc indices for each query

metrics = evaluate_retrieval(
    test_queries, test_relevant,
    model, index, documents, k=5
)
print(f"Recall@5: {metrics['recall@k']:.3f}")
print(f"MRR: {metrics['mrr']:.3f}")

Output

Recall@5: 0.700 MRR: 0.750

Key Metrics Explained

Metric	What It Measures	Good Score
Recall@K	% of relevant docs in top-K	> 0.8
MRR	How high is the first relevant result?	> 0.7
NDCG@K	Ranking quality (position matters)	> 0.6

The Complete Pipeline

Here’s everything together—a production-ready semantic search system:

from sentence_transformers import SentenceTransformer
import numpy as np
import faiss
import pickle

class SemanticSearchEngine:
    def __init__(self, model_name='all-MiniLM-L6-v2'):
        self.model = SentenceTransformer(model_name)
        self.index = None
        self.documents = None

    def index_documents(self, documents):
        """Encode and index a corpus."""
        self.documents = documents

        # Encode all documents
        embeddings = self.model.encode(
            documents,
            convert_to_numpy=True,
            normalize_embeddings=True,
            show_progress_bar=True
        ).astype('float32')

        # Build FAISS index
        dimension = embeddings.shape[1]
        self.index = faiss.IndexFlatIP(dimension)
        self.index.add(embeddings)

        return len(documents)

    def search(self, query, top_k=5):
        """Search for similar documents."""
        query_embedding = self.model.encode(
            query,
            convert_to_numpy=True,
            normalize_embeddings=True
        ).reshape(1, -1).astype('float32')

        scores, indices = self.index.search(query_embedding, top_k)

        return [
            {'document': self.documents[idx], 'score': float(score)}
            for score, idx in zip(scores[0], indices[0])
        ]

    def save(self, path):
        """Save index and documents."""
        faiss.write_index(self.index, f"{path}/index.faiss")
        with open(f"{path}/documents.pkl", 'wb') as f:
            pickle.dump(self.documents, f)

    def load(self, path):
        """Load index and documents."""
        self.index = faiss.read_index(f"{path}/index.faiss")
        with open(f"{path}/documents.pkl", 'rb') as f:
            self.documents = pickle.load(f)


# Usage
engine = SemanticSearchEngine()
engine.index_documents(documents)

results = engine.search("how do I fix my bike?")
for r in results:
    print(f"[{r['score']:.3f}] {r['document']}")

Output

[0.612] Guide to replacing bicycle inner tubes
[0.487] How to change a flat tire on the highway
[0.356] Troubleshooting car engine problems
[0.289] Best practices for Python code reviews
[0.201] Machine learning model deployment strategies

What’s Next

Bi-encoders give you speed. But they sacrifice accuracy—sometimes the best match isn’t in the top results.

The solution: two-stage retrieval. Use bi-encoders to quickly find the top 100-1000 candidates, then use a more powerful cross-encoder to precisely rerank them.

In the next tutorial, we’ll build a cross-encoder and see how combining both approaches gives you the best of both worlds: speed AND accuracy.

Key Takeaways

Bi-encoders encode queries and documents independently—enabling pre-computation and millisecond search
Use normalized embeddings with dot product for cosine similarity
FAISS scales to millions of documents with approximate nearest neighbor search
Hard negatives matter for training quality bi-encoders
Measure with Recall@K and MRR to understand retrieval quality
Combine with cross-encoders for production systems (covered next)

Bi-Encoders: Fast Semantic Search at Scale

What Makes Bi-Encoders Special

Building a Semantic Search System

Step 1: Load a Pre-trained Model

Step 2: Encode Your Corpus

Step 3: Search with a Query

Step 4: Scale with FAISS

Encoding at Scale

Encoding Speed Benchmarks

Training Your Own Bi-Encoder

Contrastive Learning with Triplets

Evaluation: Measuring Search Quality

Key Metrics Explained

The Complete Pipeline

What’s Next

Key Takeaways

Further Reading

Comments