Sentence Embeddings from Scratch with PyTorch
Build a complete sentence encoder from the ground up. Learn tokenization, embedding layers, pooling strategies, and benchmark on semantic similarity.
What We’re Building
Sentence embeddings are dense vector representations that capture the semantic meaning of text. Unlike word embeddings (like Word2Vec), sentence embeddings represent entire sentences or paragraphs as single vectors. This makes them powerful for:
- Semantic search: Find documents similar in meaning, not just keywords
- Clustering: Group similar texts together
- Classification: Use embeddings as features for downstream tasks
- Similarity comparison: Measure how related two pieces of text are
By the end of this tutorial, you’ll have a working sentence encoder in under 100 lines of PyTorch code, and you’ll understand exactly what’s happening at each step.
The Pipeline
Every sentence encoder follows the same fundamental pipeline:
Text → Tokenization → Token Embeddings → Pooling → Sentence Embedding
Let’s build each piece.
Step 1: Tokenization
Tokenization converts text into a sequence of token IDs that the model can process. For simplicity, we’ll use a character-level tokenizer, but the principles apply to any tokenization scheme (BPE, WordPiece, etc.).
class SimpleTokenizer:
def __init__(self):
# Create vocabulary from printable ASCII characters
self.chars = list("abcdefghijklmnopqrstuvwxyz ")
self.char_to_idx = {c: i + 1 for i, c in enumerate(self.chars)}
self.char_to_idx['<PAD>'] = 0
self.char_to_idx['<UNK>'] = len(self.chars) + 1
self.vocab_size = len(self.char_to_idx)
def encode(self, text, max_len=64):
text = text.lower()
tokens = [self.char_to_idx.get(c, self.char_to_idx['<UNK>'])
for c in text[:max_len]]
# Pad to max_len
tokens += [0] * (max_len - len(tokens))
return tokens
def batch_encode(self, texts, max_len=64):
return [self.encode(t, max_len) for t in texts]
Let’s test our tokenizer:
tokenizer = SimpleTokenizer()
print(f"Vocabulary size: {tokenizer.vocab_size}")
print(f"Encoded 'hello world': {tokenizer.encode('hello world', max_len=16)}")
Vocabulary size: 29 Encoded ‘hello world’: [8, 5, 12, 12, 15, 27, 23, 15, 18, 12, 4, 0, 0, 0, 0, 0]
Step 2: The Embedding Layer
The embedding layer converts token IDs into dense vectors. Each token ID maps to a learnable vector of dimension embedding_dim.
import torch
import torch.nn as nn
class SentenceEncoder(nn.Module):
def __init__(self, vocab_size, embedding_dim=128, hidden_dim=256):
super().__init__()
self.embedding = nn.Embedding(vocab_size, embedding_dim, padding_idx=0)
self.lstm = nn.LSTM(embedding_dim, hidden_dim, batch_first=True,
bidirectional=True)
self.output_dim = hidden_dim * 2 # bidirectional doubles the size
def forward(self, token_ids):
# token_ids: (batch_size, seq_len)
embedded = self.embedding(token_ids) # (batch, seq, embed_dim)
lstm_out, (hidden, cell) = self.lstm(embedded)
# lstm_out: (batch, seq, hidden*2)
return lstm_out
Step 3: Pooling Strategies
After encoding, we have a sequence of vectors (one per token). We need to collapse this into a single sentence vector. There are several strategies:
Mean Pooling
Average all token embeddings, ignoring padding tokens:
def mean_pooling(token_embeddings, attention_mask):
"""
token_embeddings: (batch, seq_len, hidden_dim)
attention_mask: (batch, seq_len) - 1 for real tokens, 0 for padding
"""
# Expand mask for broadcasting
mask = attention_mask.unsqueeze(-1).float() # (batch, seq, 1)
# Sum embeddings where mask is 1, then divide by count
sum_embeddings = (token_embeddings * mask).sum(dim=1)
sum_mask = mask.sum(dim=1).clamp(min=1e-9) # Avoid division by zero
return sum_embeddings / sum_mask
CLS Token Pooling
Use the embedding of a special [CLS] token (common in BERT-style models):
def cls_pooling(token_embeddings):
"""Take the first token's embedding as the sentence representation."""
return token_embeddings[:, 0, :] # (batch, hidden_dim)
Max Pooling
Take the maximum value across the sequence for each dimension:
def max_pooling(token_embeddings, attention_mask):
"""Take the max value for each dimension across the sequence."""
mask = attention_mask.unsqueeze(-1).float()
# Set padding positions to very negative so they don't affect max
token_embeddings = token_embeddings.masked_fill(mask == 0, -1e9)
return token_embeddings.max(dim=1)[0]
Step 4: Complete Encoder
Let’s put it all together with mean pooling as our default strategy:
import torch
import torch.nn as nn
import torch.nn.functional as F
class SentenceEncoder(nn.Module):
def __init__(self, vocab_size, embedding_dim=128, hidden_dim=256,
pooling='mean'):
super().__init__()
self.embedding = nn.Embedding(vocab_size, embedding_dim, padding_idx=0)
self.lstm = nn.LSTM(embedding_dim, hidden_dim, batch_first=True,
bidirectional=True, num_layers=2, dropout=0.1)
self.output_dim = hidden_dim * 2
self.pooling = pooling
def forward(self, token_ids, attention_mask=None):
# Create attention mask from token_ids if not provided
if attention_mask is None:
attention_mask = (token_ids != 0).long()
# Embed tokens
embedded = self.embedding(token_ids)
# Encode with LSTM
lstm_out, _ = self.lstm(embedded)
# Pool to sentence embedding
if self.pooling == 'mean':
sentence_emb = self._mean_pool(lstm_out, attention_mask)
elif self.pooling == 'max':
sentence_emb = self._max_pool(lstm_out, attention_mask)
else: # cls
sentence_emb = lstm_out[:, 0, :]
# L2 normalize for cosine similarity
sentence_emb = F.normalize(sentence_emb, p=2, dim=1)
return sentence_emb
def _mean_pool(self, token_embeddings, attention_mask):
mask = attention_mask.unsqueeze(-1).float()
sum_embeddings = (token_embeddings * mask).sum(dim=1)
sum_mask = mask.sum(dim=1).clamp(min=1e-9)
return sum_embeddings / sum_mask
def _max_pool(self, token_embeddings, attention_mask):
mask = attention_mask.unsqueeze(-1).float()
token_embeddings = token_embeddings.masked_fill(mask == 0, -1e9)
return token_embeddings.max(dim=1)[0]
Step 5: Testing on Semantic Similarity
Let’s see how our (untrained) encoder handles semantic similarity:
# Initialize
tokenizer = SimpleTokenizer()
encoder = SentenceEncoder(tokenizer.vocab_size)
encoder.eval()
# Test sentences
sentences = [
"the cat sat on the mat",
"a cat was sitting on a mat",
"the dog ran in the park",
"machine learning is fascinating"
]
# Encode
with torch.no_grad():
token_ids = torch.tensor(tokenizer.batch_encode(sentences))
embeddings = encoder(token_ids)
# Compute similarity matrix
similarity = torch.mm(embeddings, embeddings.T)
print("Similarity Matrix:")
print(similarity.numpy().round(3))
Similarity Matrix: [[ 1. 0.234 0.156 0.089] [ 0.234 1. 0.198 0.112] [ 0.156 0.198 1. 0.145] [ 0.089 0.112 0.145 1. ]]
Step 6: Visualizing Embeddings
Let’s visualize our embeddings using dimensionality reduction:
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt
# More test sentences
sentences = [
# Cluster 1: Animals
"the cat sleeps on the couch",
"a dog plays in the yard",
"my cat loves to nap",
"the dog chases its tail",
# Cluster 2: Tech
"python is a programming language",
"javascript runs in browsers",
"coding is problem solving",
"software engineers write code",
# Cluster 3: Food
"pizza is my favorite food",
"i love eating pasta",
"cooking dinner is relaxing",
"the recipe needs more salt"
]
labels = ['animals'] * 4 + ['tech'] * 4 + ['food'] * 4
# Encode
with torch.no_grad():
token_ids = torch.tensor(tokenizer.batch_encode(sentences))
embeddings = encoder(token_ids).numpy()
# Reduce to 2D
tsne = TSNE(n_components=2, random_state=42, perplexity=4)
embeddings_2d = tsne.fit_transform(embeddings)
# Plot
colors = {'animals': 'blue', 'tech': 'green', 'food': 'red'}
plt.figure(figsize=(10, 8))
for i, (x, y) in enumerate(embeddings_2d):
plt.scatter(x, y, c=colors[labels[i]], s=100)
plt.annotate(sentences[i][:20] + '...', (x, y), fontsize=8)
plt.title('Sentence Embeddings (Untrained Model)')
plt.savefig('embeddings_viz.png', dpi=150, bbox_inches='tight')
The Complete Code
Here’s everything in one place (under 100 lines):
import torch
import torch.nn as nn
import torch.nn.functional as F
class SimpleTokenizer:
def __init__(self):
self.chars = list("abcdefghijklmnopqrstuvwxyz ")
self.char_to_idx = {c: i + 1 for i, c in enumerate(self.chars)}
self.char_to_idx['<PAD>'] = 0
self.char_to_idx['<UNK>'] = len(self.chars) + 1
self.vocab_size = len(self.char_to_idx)
def encode(self, text, max_len=64):
text = text.lower()
tokens = [self.char_to_idx.get(c, self.char_to_idx['<UNK>'])
for c in text[:max_len]]
tokens += [0] * (max_len - len(tokens))
return tokens
def batch_encode(self, texts, max_len=64):
return [self.encode(t, max_len) for t in texts]
class SentenceEncoder(nn.Module):
def __init__(self, vocab_size, embedding_dim=128, hidden_dim=256):
super().__init__()
self.embedding = nn.Embedding(vocab_size, embedding_dim, padding_idx=0)
self.lstm = nn.LSTM(embedding_dim, hidden_dim, batch_first=True,
bidirectional=True, num_layers=2, dropout=0.1)
self.output_dim = hidden_dim * 2
def forward(self, token_ids, attention_mask=None):
if attention_mask is None:
attention_mask = (token_ids != 0).long()
embedded = self.embedding(token_ids)
lstm_out, _ = self.lstm(embedded)
# Mean pooling
mask = attention_mask.unsqueeze(-1).float()
sum_emb = (lstm_out * mask).sum(dim=1)
sum_mask = mask.sum(dim=1).clamp(min=1e-9)
sentence_emb = sum_emb / sum_mask
return F.normalize(sentence_emb, p=2, dim=1)
def cosine_similarity(a, b):
"""Compute cosine similarity between two embedding tensors."""
return torch.mm(a, b.T)
if __name__ == "__main__":
# Initialize
tokenizer = SimpleTokenizer()
encoder = SentenceEncoder(tokenizer.vocab_size)
encoder.eval()
# Test
sentences = [
"the cat sat on the mat",
"a cat was sitting on a mat",
"the dog ran in the park"
]
with torch.no_grad():
ids = torch.tensor(tokenizer.batch_encode(sentences))
embs = encoder(ids)
sims = cosine_similarity(embs, embs)
print("Sentences:")
for i, s in enumerate(sentences):
print(f" {i}: {s}")
print(f"\nSimilarity matrix:\n{sims.numpy().round(3)}")
Sentences: 0: the cat sat on the mat 1: a cat was sitting on a mat 2: the dog ran in the park
Similarity matrix: [[ 1. 0.245 0.167] [ 0.245 1. 0.201] [ 0.167 0.201 1. ]]
What’s Next
This encoder works, but it’s untrained. The embeddings are based on random weights. To make it useful for real semantic similarity:
-
Fine-tune with contrastive learning: Train the model to push similar sentences together and dissimilar ones apart. See the next tutorial: Fine-Tuning a Bi-Encoder for Semantic Search.
-
Use a pretrained backbone: Replace our simple LSTM with a pretrained transformer like BERT. The
sentence-transformerslibrary makes this easy. -
Add hard negatives: During training, use similar-but-different sentences as negative examples to help the model learn fine-grained distinctions.
Further Reading
- Sentence-BERT paper - The foundational work on efficient sentence embeddings
- Understanding LSTMs - Chris Olah’s excellent visual guide
- The Illustrated Transformer - For when you’re ready to upgrade from LSTMs
- sentence-transformers library - Production-ready sentence embeddings
Fair winds on your embedding journey. The next tutorial covers training these encoders with contrastive learning.
Comments
to join the discussion.