Sentence Embeddings from Scratch with PyTorch

Calm Waters 25 min read December 20, 2025 |
0

Build a complete sentence encoder from the ground up. Learn tokenization, embedding layers, pooling strategies, and benchmark on semantic similarity.

What We’re Building

Sentence embeddings are dense vector representations that capture the semantic meaning of text. Unlike word embeddings (like Word2Vec), sentence embeddings represent entire sentences or paragraphs as single vectors. This makes them powerful for:

  • Semantic search: Find documents similar in meaning, not just keywords
  • Clustering: Group similar texts together
  • Classification: Use embeddings as features for downstream tasks
  • Similarity comparison: Measure how related two pieces of text are

By the end of this tutorial, you’ll have a working sentence encoder in under 100 lines of PyTorch code, and you’ll understand exactly what’s happening at each step.

The Pipeline

Every sentence encoder follows the same fundamental pipeline:

Text → Tokenization → Token Embeddings → Pooling → Sentence Embedding

Let’s build each piece.

Step 1: Tokenization

Tokenization converts text into a sequence of token IDs that the model can process. For simplicity, we’ll use a character-level tokenizer, but the principles apply to any tokenization scheme (BPE, WordPiece, etc.).

class SimpleTokenizer:
    def __init__(self):
        # Create vocabulary from printable ASCII characters
        self.chars = list("abcdefghijklmnopqrstuvwxyz ")
        self.char_to_idx = {c: i + 1 for i, c in enumerate(self.chars)}
        self.char_to_idx['<PAD>'] = 0
        self.char_to_idx['<UNK>'] = len(self.chars) + 1
        self.vocab_size = len(self.char_to_idx)

    def encode(self, text, max_len=64):
        text = text.lower()
        tokens = [self.char_to_idx.get(c, self.char_to_idx['<UNK>'])
                  for c in text[:max_len]]
        # Pad to max_len
        tokens += [0] * (max_len - len(tokens))
        return tokens

    def batch_encode(self, texts, max_len=64):
        return [self.encode(t, max_len) for t in texts]

Let’s test our tokenizer:

tokenizer = SimpleTokenizer()
print(f"Vocabulary size: {tokenizer.vocab_size}")
print(f"Encoded 'hello world': {tokenizer.encode('hello world', max_len=16)}")
Output

Vocabulary size: 29 Encoded ‘hello world’: [8, 5, 12, 12, 15, 27, 23, 15, 18, 12, 4, 0, 0, 0, 0, 0]

Step 2: The Embedding Layer

The embedding layer converts token IDs into dense vectors. Each token ID maps to a learnable vector of dimension embedding_dim.

import torch
import torch.nn as nn

class SentenceEncoder(nn.Module):
    def __init__(self, vocab_size, embedding_dim=128, hidden_dim=256):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, embedding_dim, padding_idx=0)
        self.lstm = nn.LSTM(embedding_dim, hidden_dim, batch_first=True,
                           bidirectional=True)
        self.output_dim = hidden_dim * 2  # bidirectional doubles the size

    def forward(self, token_ids):
        # token_ids: (batch_size, seq_len)
        embedded = self.embedding(token_ids)  # (batch, seq, embed_dim)
        lstm_out, (hidden, cell) = self.lstm(embedded)
        # lstm_out: (batch, seq, hidden*2)
        return lstm_out

Step 3: Pooling Strategies

After encoding, we have a sequence of vectors (one per token). We need to collapse this into a single sentence vector. There are several strategies:

Mean Pooling

Average all token embeddings, ignoring padding tokens:

def mean_pooling(token_embeddings, attention_mask):
    """
    token_embeddings: (batch, seq_len, hidden_dim)
    attention_mask: (batch, seq_len) - 1 for real tokens, 0 for padding
    """
    # Expand mask for broadcasting
    mask = attention_mask.unsqueeze(-1).float()  # (batch, seq, 1)

    # Sum embeddings where mask is 1, then divide by count
    sum_embeddings = (token_embeddings * mask).sum(dim=1)
    sum_mask = mask.sum(dim=1).clamp(min=1e-9)  # Avoid division by zero

    return sum_embeddings / sum_mask

CLS Token Pooling

Use the embedding of a special [CLS] token (common in BERT-style models):

def cls_pooling(token_embeddings):
    """Take the first token's embedding as the sentence representation."""
    return token_embeddings[:, 0, :]  # (batch, hidden_dim)

Max Pooling

Take the maximum value across the sequence for each dimension:

def max_pooling(token_embeddings, attention_mask):
    """Take the max value for each dimension across the sequence."""
    mask = attention_mask.unsqueeze(-1).float()
    # Set padding positions to very negative so they don't affect max
    token_embeddings = token_embeddings.masked_fill(mask == 0, -1e9)
    return token_embeddings.max(dim=1)[0]

Step 4: Complete Encoder

Let’s put it all together with mean pooling as our default strategy:

import torch
import torch.nn as nn
import torch.nn.functional as F

class SentenceEncoder(nn.Module):
    def __init__(self, vocab_size, embedding_dim=128, hidden_dim=256,
                 pooling='mean'):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, embedding_dim, padding_idx=0)
        self.lstm = nn.LSTM(embedding_dim, hidden_dim, batch_first=True,
                           bidirectional=True, num_layers=2, dropout=0.1)
        self.output_dim = hidden_dim * 2
        self.pooling = pooling

    def forward(self, token_ids, attention_mask=None):
        # Create attention mask from token_ids if not provided
        if attention_mask is None:
            attention_mask = (token_ids != 0).long()

        # Embed tokens
        embedded = self.embedding(token_ids)

        # Encode with LSTM
        lstm_out, _ = self.lstm(embedded)

        # Pool to sentence embedding
        if self.pooling == 'mean':
            sentence_emb = self._mean_pool(lstm_out, attention_mask)
        elif self.pooling == 'max':
            sentence_emb = self._max_pool(lstm_out, attention_mask)
        else:  # cls
            sentence_emb = lstm_out[:, 0, :]

        # L2 normalize for cosine similarity
        sentence_emb = F.normalize(sentence_emb, p=2, dim=1)
        return sentence_emb

    def _mean_pool(self, token_embeddings, attention_mask):
        mask = attention_mask.unsqueeze(-1).float()
        sum_embeddings = (token_embeddings * mask).sum(dim=1)
        sum_mask = mask.sum(dim=1).clamp(min=1e-9)
        return sum_embeddings / sum_mask

    def _max_pool(self, token_embeddings, attention_mask):
        mask = attention_mask.unsqueeze(-1).float()
        token_embeddings = token_embeddings.masked_fill(mask == 0, -1e9)
        return token_embeddings.max(dim=1)[0]

Step 5: Testing on Semantic Similarity

Let’s see how our (untrained) encoder handles semantic similarity:

# Initialize
tokenizer = SimpleTokenizer()
encoder = SentenceEncoder(tokenizer.vocab_size)
encoder.eval()

# Test sentences
sentences = [
    "the cat sat on the mat",
    "a cat was sitting on a mat",
    "the dog ran in the park",
    "machine learning is fascinating"
]

# Encode
with torch.no_grad():
    token_ids = torch.tensor(tokenizer.batch_encode(sentences))
    embeddings = encoder(token_ids)

# Compute similarity matrix
similarity = torch.mm(embeddings, embeddings.T)
print("Similarity Matrix:")
print(similarity.numpy().round(3))
Output (untrained model - random weights)

Similarity Matrix: [[ 1. 0.234 0.156 0.089] [ 0.234 1. 0.198 0.112] [ 0.156 0.198 1. 0.145] [ 0.089 0.112 0.145 1. ]]

Step 6: Visualizing Embeddings

Let’s visualize our embeddings using dimensionality reduction:

from sklearn.manifold import TSNE
import matplotlib.pyplot as plt

# More test sentences
sentences = [
    # Cluster 1: Animals
    "the cat sleeps on the couch",
    "a dog plays in the yard",
    "my cat loves to nap",
    "the dog chases its tail",
    # Cluster 2: Tech
    "python is a programming language",
    "javascript runs in browsers",
    "coding is problem solving",
    "software engineers write code",
    # Cluster 3: Food
    "pizza is my favorite food",
    "i love eating pasta",
    "cooking dinner is relaxing",
    "the recipe needs more salt"
]

labels = ['animals'] * 4 + ['tech'] * 4 + ['food'] * 4

# Encode
with torch.no_grad():
    token_ids = torch.tensor(tokenizer.batch_encode(sentences))
    embeddings = encoder(token_ids).numpy()

# Reduce to 2D
tsne = TSNE(n_components=2, random_state=42, perplexity=4)
embeddings_2d = tsne.fit_transform(embeddings)

# Plot
colors = {'animals': 'blue', 'tech': 'green', 'food': 'red'}
plt.figure(figsize=(10, 8))
for i, (x, y) in enumerate(embeddings_2d):
    plt.scatter(x, y, c=colors[labels[i]], s=100)
    plt.annotate(sentences[i][:20] + '...', (x, y), fontsize=8)
plt.title('Sentence Embeddings (Untrained Model)')
plt.savefig('embeddings_viz.png', dpi=150, bbox_inches='tight')

The Complete Code

Here’s everything in one place (under 100 lines):

import torch
import torch.nn as nn
import torch.nn.functional as F

class SimpleTokenizer:
    def __init__(self):
        self.chars = list("abcdefghijklmnopqrstuvwxyz ")
        self.char_to_idx = {c: i + 1 for i, c in enumerate(self.chars)}
        self.char_to_idx['<PAD>'] = 0
        self.char_to_idx['<UNK>'] = len(self.chars) + 1
        self.vocab_size = len(self.char_to_idx)

    def encode(self, text, max_len=64):
        text = text.lower()
        tokens = [self.char_to_idx.get(c, self.char_to_idx['<UNK>'])
                  for c in text[:max_len]]
        tokens += [0] * (max_len - len(tokens))
        return tokens

    def batch_encode(self, texts, max_len=64):
        return [self.encode(t, max_len) for t in texts]


class SentenceEncoder(nn.Module):
    def __init__(self, vocab_size, embedding_dim=128, hidden_dim=256):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, embedding_dim, padding_idx=0)
        self.lstm = nn.LSTM(embedding_dim, hidden_dim, batch_first=True,
                           bidirectional=True, num_layers=2, dropout=0.1)
        self.output_dim = hidden_dim * 2

    def forward(self, token_ids, attention_mask=None):
        if attention_mask is None:
            attention_mask = (token_ids != 0).long()

        embedded = self.embedding(token_ids)
        lstm_out, _ = self.lstm(embedded)

        # Mean pooling
        mask = attention_mask.unsqueeze(-1).float()
        sum_emb = (lstm_out * mask).sum(dim=1)
        sum_mask = mask.sum(dim=1).clamp(min=1e-9)
        sentence_emb = sum_emb / sum_mask

        return F.normalize(sentence_emb, p=2, dim=1)


def cosine_similarity(a, b):
    """Compute cosine similarity between two embedding tensors."""
    return torch.mm(a, b.T)


if __name__ == "__main__":
    # Initialize
    tokenizer = SimpleTokenizer()
    encoder = SentenceEncoder(tokenizer.vocab_size)
    encoder.eval()

    # Test
    sentences = [
        "the cat sat on the mat",
        "a cat was sitting on a mat",
        "the dog ran in the park"
    ]

    with torch.no_grad():
        ids = torch.tensor(tokenizer.batch_encode(sentences))
        embs = encoder(ids)
        sims = cosine_similarity(embs, embs)

    print("Sentences:")
    for i, s in enumerate(sentences):
        print(f"  {i}: {s}")
    print(f"\nSimilarity matrix:\n{sims.numpy().round(3)}")
Output

Sentences: 0: the cat sat on the mat 1: a cat was sitting on a mat 2: the dog ran in the park

Similarity matrix: [[ 1. 0.245 0.167] [ 0.245 1. 0.201] [ 0.167 0.201 1. ]]

What’s Next

This encoder works, but it’s untrained. The embeddings are based on random weights. To make it useful for real semantic similarity:

  1. Fine-tune with contrastive learning: Train the model to push similar sentences together and dissimilar ones apart. See the next tutorial: Fine-Tuning a Bi-Encoder for Semantic Search.

  2. Use a pretrained backbone: Replace our simple LSTM with a pretrained transformer like BERT. The sentence-transformers library makes this easy.

  3. Add hard negatives: During training, use similar-but-different sentences as negative examples to help the model learn fine-grained distinctions.

Further Reading


Fair winds on your embedding journey. The next tutorial covers training these encoders with contrastive learning.

Found this helpful?
0

Comments

Loading comments...