Cross-Attention Fusion: Combining Text Embeddings with Structured Features

Why This Matters to Me

I’ve been building NFL DFS prediction models as a personal hobby for the past month. The structured data is excellent—snap counts, target shares, Vegas lines, weather. You can build solid models on numbers alone.

But reviewing my model’s results from the past few weeks revealed the gaps. Week 18 was brutal: teams with nothing to play for rested starters, while teams fighting for playoff positioning went all out. The numbers couldn’t capture “this game matters to them.” Wild Card weekend was similar—the narratives, coaching schemes, and surrounding stories all influenced outcomes in ways my features missed.

When a beat reporter tweets “He looked explosive in practice today,” there’s no number for “explosive.” When a coach says “We’ll see how he feels Sunday,” the hesitation matters more than the official injury designation. When someone writes “Interesting that the QB wasn’t available for comment”—that implication can’t be reduced to a column in a spreadsheet.

Here’s what makes it tricky: the same words mean different things depending on context. “Limited in practice” for a star RB in Week 1 is probably load management. “Limited in practice” for that same player in the playoffs? That’s concerning. The text embedding alone captures “limited in practice.” But you need the structured features—player importance, game stakes, time of season—to know what it means.

This architecture helped me improve my predictions. But it’s not just for sports—the same pattern applies to security compliance mapping (where control descriptions need organizational context), clinical decision support (where notes need patient history), and anywhere text meaning depends on structured context.

You’ve embedded your text. You’ve got structured features—categories, scores, metadata. Now you need to combine them for prediction.

The default approach in most tutorials: take your 768-dimensional text embedding, take your 10 numerical features, concatenate them into a 778-dimensional vector, and pass it through a feedforward network. This is simple and works reasonably well. But it treats text and features as independent signals that only interact in the final layers—the model learns “text says X” and “features say Y” separately, then combines them at the end.

When your structured features encode domain knowledge that should guide how the model interprets text, concatenation leaves performance on the table.

This tutorial implements cross-attention fusion—letting structured features attend to relevant parts of the text. When the model sees a “high-risk” category feature, it learns to attend to words like “critical,” “breach,” and “failure.” When it sees “routine,” it attends to “standard,” “regular,” and “maintenance.” The features don’t just add to the text—they interpret it.

The Problem: When Text + Features ≠ Understanding

Consider a customer support system classifying ticket urgency:

Ticket A: “Server is completely down, all production systems affected, customers cannot access their accounts.”

Ticket B: “Server maintenance scheduled for next weekend, users will experience brief downtime.”

Text embedding similarity: 0.82 (both mention servers, downtime, and customer impact).

But the structured features tell a different story:

Feature	Ticket A	Ticket B
Category	Incident	Change Request
Affected Systems	12	0
Customer Tier	Enterprise	Free
Time Submitted	2:00 AM Sunday	10:00 AM Tuesday

Any support engineer knows Ticket A is critical and Ticket B is routine. The text similarity is misleading—the structured features provide crucial context the embeddings miss.

Where This Pattern Applies

Text + structured data appears across domains:

Domain	Text	Structured Features	Task
Finance	SEC filings, earnings calls	P/E ratio, market cap, sector	Risk assessment
Healthcare	Clinical notes	Lab values, vitals, demographics	Diagnosis support
E-commerce	Product descriptions	Price, category, ratings	Recommendation
Security	Threat reports	Severity scores, CVSS, asset value	Incident prioritization
Legal	Contract clauses	Party type, jurisdiction, value	Risk scoring
Support	Ticket descriptions	Priority, SLA, customer tier	Routing and escalation

The pattern is the same: rich text semantics + structured domain knowledge → better predictions.

The Solution: Cross-Attention Fusion

We’ll build a model that combines:

Text embeddings: Rich semantic understanding from language models
Structured features: Domain knowledge encoded as categories and scores
Cross-attention: Features query the text to find relevant evidence

┌─────────────────────────┐     ┌─────────────────────────┐
│     Item Text           │     │   Structured Features   │
│  "Server is down..."    │     │  [Incident, Enterprise, │
│                         │     │   12 systems, 2AM, ...] │
└───────────┬─────────────┘     └───────────┬─────────────┘
            │                               │
            ▼                               ▼
     ┌─────────────┐               ┌─────────────────┐
     │Text Encoder │               │ Feature Encoder │
     │ (768 dim)   │               │ (768 dim)       │
     └──────┬──────┘               └────────┬────────┘
            │                               │
            │    ┌──────────────────┐       │
            └───►│ Cross-Attention  │◄──────┘
                 │ Features → Text  │
                 └────────┬─────────┘
                          │
                          ▼
                 ┌─────────────────┐
                 │ Classification  │
                 │   Head          │
                 └────────┬────────┘
                          │
                          ▼
                     Prediction

Cross-attention for multimodal fusion isn’t new—our implementation draws from established patterns:

Perceiver (Jaegle et al., 2021): Uses cross-attention to let learned queries attend to any input modality
Multimodal Bottleneck Transformer (Nagrani et al., 2021): Fuses audio and video via cross-attention bottleneck
TabTransformer (Huang et al., 2020): Applies attention to tabular data specifically
Tabular-Text Transformers (TTT): Various architectures combining text with structured data

Our contribution is a practical, minimal implementation focused on the text+features case—no bottleneck tokens, no complex architectures, just direct cross-attention fusion that you can adapt to your domain.

Step 1: Data Structures

Let’s define a general structure for items with text and features. I’ll use a prioritization example, but you can adapt this to any domain:

from dataclasses import dataclass
from typing import List, Optional
from enum import Enum
import torch

class Category(Enum):
    CRITICAL = 0
    HIGH = 1
    MEDIUM = 2
    LOW = 3

class ItemType(Enum):
    INCIDENT = 0
    REQUEST = 1
    CHANGE = 2
    QUESTION = 3
    FEEDBACK = 4

class SourceChannel(Enum):
    EMAIL = 0
    CHAT = 1
    PHONE = 2
    PORTAL = 3

class CustomerTier(Enum):
    ENTERPRISE = 0
    PROFESSIONAL = 1
    STARTER = 2
    FREE = 3

@dataclass
class StructuredFeatures:
    """Structured features for an item with text."""
    category: Category
    item_type: ItemType
    source_channel: SourceChannel
    customer_tier: CustomerTier

    # Numeric features
    affected_count: float = 0.0  # Normalized 0-1
    time_sensitivity: float = 0.5  # 0-1, how time-sensitive

    def to_tensor(self) -> torch.Tensor:
        """Convert features to tensor via one-hot encoding."""
        category_onehot = torch.zeros(4)
        category_onehot[self.category.value] = 1.0

        type_onehot = torch.zeros(5)
        type_onehot[self.item_type.value] = 1.0

        channel_onehot = torch.zeros(4)
        channel_onehot[self.source_channel.value] = 1.0

        tier_onehot = torch.zeros(4)
        tier_onehot[self.customer_tier.value] = 1.0

        return torch.cat([
            category_onehot,      # 4
            type_onehot,          # 5
            channel_onehot,       # 4
            tier_onehot,          # 4
            torch.tensor([self.affected_count, self.time_sensitivity])  # 2
        ])  # Total: 19 dimensions

@dataclass
class TextItem:
    """An item with text and structured features."""
    item_id: str
    text: str
    features: StructuredFeatures
    domain: Optional[str] = None

@dataclass
class ItemPair:
    """A pair of items for similarity/matching tasks."""
    source: TextItem
    target: TextItem
    label: int  # 1 = match/similar, 0 = no match

Step 2: The Feature Encoder

The feature encoder projects low-dimensional structured features into the same embedding space as the text:

import torch.nn as nn

class FeatureEncoder(nn.Module):
    """
    Encodes structured features into embedding space.

    Projects low-dimensional categorical/numeric features into
    high-dimensional space compatible with text embeddings.
    """

    def __init__(
        self,
        input_dim: int = 19,      # Feature vector size
        hidden_dim: int = 256,
        output_dim: int = 768,    # Match text encoder dimension
        dropout: float = 0.1
    ):
        super().__init__()

        self.encoder = nn.Sequential(
            nn.Linear(input_dim, hidden_dim),
            nn.LayerNorm(hidden_dim),
            nn.GELU(),
            nn.Dropout(dropout),
            nn.Linear(hidden_dim, hidden_dim),
            nn.LayerNorm(hidden_dim),
            nn.GELU(),
            nn.Dropout(dropout),
            nn.Linear(hidden_dim, output_dim),
            nn.LayerNorm(output_dim)
        )

    def forward(self, features: torch.Tensor) -> torch.Tensor:
        """
        Args:
            features: [batch_size, input_dim] structured features

        Returns:
            [batch_size, 1, output_dim] encoded features (1 for sequence dim)
        """
        encoded = self.encoder(features)
        # Add sequence dimension for attention compatibility
        return encoded.unsqueeze(1)

Step 3: Cross-Attention Fusion

The core innovation—features attend to text to find relevant evidence:

class CrossAttentionFusion(nn.Module):
    """
    Fuses structured features with text embeddings via cross-attention.

    The feature embedding queries the text embeddings, learning to
    attend to text tokens relevant to each structured feature.
    """

    def __init__(
        self,
        embed_dim: int = 768,
        num_heads: int = 8,
        ff_dim: int = 2048,
        dropout: float = 0.1
    ):
        super().__init__()

        # Cross-attention: features (query) attend to text (key/value)
        self.cross_attention = nn.MultiheadAttention(
            embed_dim=embed_dim,
            num_heads=num_heads,
            dropout=dropout,
            batch_first=True
        )

        # Layer norms for residual connections
        self.norm1 = nn.LayerNorm(embed_dim)
        self.norm2 = nn.LayerNorm(embed_dim)

        # Feed-forward network
        self.ffn = nn.Sequential(
            nn.Linear(embed_dim, ff_dim),
            nn.GELU(),
            nn.Dropout(dropout),
            nn.Linear(ff_dim, embed_dim),
            nn.Dropout(dropout)
        )

    def forward(
        self,
        feature_embedding: torch.Tensor,  # [batch, 1, embed_dim]
        text_embeddings: torch.Tensor,    # [batch, seq_len, embed_dim]
        return_attention: bool = False
    ):
        """
        Fuse features with text via cross-attention.

        Returns:
            fused: [batch, 1, embed_dim] - fused representation
            attention_weights: [batch, 1, seq_len] - which text tokens mattered
        """
        # Cross-attention with residual connection
        attended, attention_weights = self.cross_attention(
            query=feature_embedding,
            key=text_embeddings,
            value=text_embeddings,
            need_weights=True
        )
        feature_embedding = self.norm1(feature_embedding + attended)

        # Feed-forward with residual
        fused = self.norm2(feature_embedding + self.ffn(feature_embedding))

        if return_attention:
            return fused, attention_weights
        return fused

Step 4: The Complete Model

Now we assemble the full cross-attention fusion model:

from transformers import AutoModel, AutoTokenizer

class CrossAttentionFusionModel(nn.Module):
    """
    Multimodal model combining text embeddings with structured features
    via cross-attention for classification or similarity tasks.
    """

    def __init__(
        self,
        text_model_name: str = "sentence-transformers/all-mpnet-base-v2",
        feature_dim: int = 19,
        hidden_dim: int = 768,
        num_attention_heads: int = 8,
        dropout: float = 0.1,
        freeze_text_encoder: bool = True
    ):
        super().__init__()

        # Text encoder (pre-trained transformer)
        self.tokenizer = AutoTokenizer.from_pretrained(text_model_name)
        self.text_encoder = AutoModel.from_pretrained(text_model_name)

        if freeze_text_encoder:
            for param in self.text_encoder.parameters():
                param.requires_grad = False

        # Feature encoder
        self.feature_encoder = FeatureEncoder(
            input_dim=feature_dim,
            hidden_dim=256,
            output_dim=hidden_dim,
            dropout=dropout
        )

        # Cross-attention fusion (one for each item in a pair)
        self.fusion_source = CrossAttentionFusion(
            embed_dim=hidden_dim,
            num_heads=num_attention_heads,
            dropout=dropout
        )
        self.fusion_target = CrossAttentionFusion(
            embed_dim=hidden_dim,
            num_heads=num_attention_heads,
            dropout=dropout
        )

        # Classification head
        self.classifier = nn.Sequential(
            nn.Linear(hidden_dim * 4, hidden_dim),  # 4 = concat, diff, product
            nn.LayerNorm(hidden_dim),
            nn.GELU(),
            nn.Dropout(dropout),
            nn.Linear(hidden_dim, hidden_dim // 2),
            nn.GELU(),
            nn.Dropout(dropout),
            nn.Linear(hidden_dim // 2, 1)
        )

    def encode_text(self, texts: List[str]) -> torch.Tensor:
        """Encode texts to sequence of token embeddings."""
        inputs = self.tokenizer(
            texts,
            padding=True,
            truncation=True,
            max_length=512,
            return_tensors="pt"
        ).to(next(self.parameters()).device)

        with torch.no_grad() if not self.text_encoder.training else torch.enable_grad():
            outputs = self.text_encoder(**inputs)

        # Return token-level embeddings (not just [CLS])
        return outputs.last_hidden_state  # [batch, seq_len, hidden]

    def forward(
        self,
        source_texts: List[str],
        source_features: torch.Tensor,  # [batch, feature_dim]
        target_texts: List[str],
        target_features: torch.Tensor,  # [batch, feature_dim]
        return_attention: bool = False
    ):
        """
        Predict similarity/match between source and target items.

        Returns:
            logits: [batch, 1] raw prediction scores
            attention_info: dict with attention weights (if requested)
        """
        # Encode texts
        source_text_emb = self.encode_text(source_texts)  # [batch, seq, hidden]
        target_text_emb = self.encode_text(target_texts)

        # Encode features
        source_feat_emb = self.feature_encoder(source_features)  # [batch, 1, hidden]
        target_feat_emb = self.feature_encoder(target_features)

        # Cross-attention fusion
        if return_attention:
            source_fused, source_attn = self.fusion_source(
                source_feat_emb, source_text_emb, return_attention=True
            )
            target_fused, target_attn = self.fusion_target(
                target_feat_emb, target_text_emb, return_attention=True
            )
        else:
            source_fused = self.fusion_source(source_feat_emb, source_text_emb)
            target_fused = self.fusion_target(target_feat_emb, target_text_emb)

        # Squeeze sequence dimension
        source_fused = source_fused.squeeze(1)  # [batch, hidden]
        target_fused = target_fused.squeeze(1)

        # Combine representations for classification
        # Multiple interaction types help capture different relationships
        combined = torch.cat([
            source_fused,
            target_fused,
            source_fused - target_fused,  # Difference
            source_fused * target_fused,  # Element-wise product
        ], dim=-1)

        logits = self.classifier(combined)

        if return_attention:
            return logits, {
                'source_attention': source_attn,
                'target_attention': target_attn
            }
        return logits

Step 5: Training Loop

A training approach with binary classification loss:

import torch.nn.functional as F
from torch.utils.data import DataLoader, Dataset
from typing import Dict, Tuple
import numpy as np

class ItemPairDataset(Dataset):
    """Dataset of item pairs with labels."""

    def __init__(self, pairs: List[ItemPair]):
        self.pairs = pairs

    def __len__(self):
        return len(self.pairs)

    def __getitem__(self, idx) -> Dict:
        pair = self.pairs[idx]
        return {
            'source_text': pair.source.text,
            'source_features': pair.source.features.to_tensor(),
            'target_text': pair.target.text,
            'target_features': pair.target.features.to_tensor(),
            'label': torch.tensor(pair.label, dtype=torch.float32)
        }

def collate_fn(batch: List[Dict]) -> Dict:
    """Custom collate for variable-length texts."""
    return {
        'source_texts': [b['source_text'] for b in batch],
        'source_features': torch.stack([b['source_features'] for b in batch]),
        'target_texts': [b['target_text'] for b in batch],
        'target_features': torch.stack([b['target_features'] for b in batch]),
        'labels': torch.stack([b['label'] for b in batch])
    }

class FusionTrainer:
    """Trainer for cross-attention fusion model."""

    def __init__(
        self,
        model: CrossAttentionFusionModel,
        learning_rate: float = 1e-4,
        weight_decay: float = 0.01,
        pos_weight: float = 2.0  # Weight positive examples higher (usually fewer)
    ):
        self.model = model
        self.optimizer = torch.optim.AdamW(
            model.parameters(),
            lr=learning_rate,
            weight_decay=weight_decay
        )
        self.scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(
            self.optimizer, mode='max', factor=0.5, patience=3
        )
        self.pos_weight = torch.tensor([pos_weight])

    def train_epoch(self, dataloader: DataLoader) -> Dict[str, float]:
        """Train for one epoch."""
        self.model.train()
        total_loss = 0
        correct = 0
        total = 0

        for batch in dataloader:
            self.optimizer.zero_grad()

            # Move to device
            device = next(self.model.parameters()).device
            source_features = batch['source_features'].to(device)
            target_features = batch['target_features'].to(device)
            labels = batch['labels'].to(device)

            # Forward pass
            logits = self.model(
                batch['source_texts'],
                source_features,
                batch['target_texts'],
                target_features
            )

            # Binary cross-entropy with logits
            loss = F.binary_cross_entropy_with_logits(
                logits.squeeze(-1),
                labels,
                pos_weight=self.pos_weight.to(device)
            )

            # Backward pass
            loss.backward()
            torch.nn.utils.clip_grad_norm_(self.model.parameters(), 1.0)
            self.optimizer.step()

            # Metrics
            total_loss += loss.item()
            predictions = (torch.sigmoid(logits.squeeze(-1)) > 0.5).float()
            correct += (predictions == labels).sum().item()
            total += labels.size(0)

        return {
            'loss': total_loss / len(dataloader),
            'accuracy': correct / total
        }

    def evaluate(self, dataloader: DataLoader) -> Dict[str, float]:
        """Evaluate on validation/test set."""
        self.model.eval()
        all_labels = []
        all_predictions = []
        all_scores = []

        with torch.no_grad():
            for batch in dataloader:
                device = next(self.model.parameters()).device
                source_features = batch['source_features'].to(device)
                target_features = batch['target_features'].to(device)
                labels = batch['labels'].to(device)

                logits = self.model(
                    batch['source_texts'],
                    source_features,
                    batch['target_texts'],
                    target_features
                )

                scores = torch.sigmoid(logits.squeeze(-1))
                predictions = (scores > 0.5).float()

                all_labels.extend(labels.cpu().numpy())
                all_predictions.extend(predictions.cpu().numpy())
                all_scores.extend(scores.cpu().numpy())

        # Calculate metrics
        labels = np.array(all_labels)
        predictions = np.array(all_predictions)

        tp = ((predictions == 1) & (labels == 1)).sum()
        fp = ((predictions == 1) & (labels == 0)).sum()
        fn = ((predictions == 0) & (labels == 1)).sum()

        precision = tp / (tp + fp) if (tp + fp) > 0 else 0
        recall = tp / (tp + fn) if (tp + fn) > 0 else 0
        f1 = 2 * precision * recall / (precision + recall) if (precision + recall) > 0 else 0

        return {
            'precision': precision,
            'recall': recall,
            'f1': f1,
            'accuracy': (predictions == labels).mean()
        }

Step 6: Attention Visualization

The cross-attention weights show why the model made its prediction—which text tokens the features attended to:

import matplotlib.pyplot as plt

def visualize_attention(
    model: CrossAttentionFusionModel,
    source_item: TextItem,
    target_item: TextItem,
    figsize: Tuple[int, int] = (14, 5)
):
    """
    Visualize which text tokens the structured features attend to.
    """
    model.eval()
    device = next(model.parameters()).device

    with torch.no_grad():
        logits, attention_info = model(
            [source_item.text],
            source_item.features.to_tensor().unsqueeze(0).to(device),
            [target_item.text],
            target_item.features.to_tensor().unsqueeze(0).to(device),
            return_attention=True
        )

    prediction = torch.sigmoid(logits).item()

    # Get tokens
    source_tokens = model.tokenizer.tokenize(source_item.text)[:50]
    target_tokens = model.tokenizer.tokenize(target_item.text)[:50]

    # Get attention weights
    source_attn = attention_info['source_attention'][0, 0, :len(source_tokens)].cpu().numpy()
    target_attn = attention_info['target_attention'][0, 0, :len(target_tokens)].cpu().numpy()

    # Normalize for visualization
    source_attn = source_attn / source_attn.max() if source_attn.max() > 0 else source_attn
    target_attn = target_attn / target_attn.max() if target_attn.max() > 0 else target_attn

    fig, axes = plt.subplots(1, 2, figsize=figsize)

    # Source attention
    axes[0].barh(range(len(source_tokens)), source_attn)
    axes[0].set_yticks(range(len(source_tokens)))
    axes[0].set_yticklabels(source_tokens, fontsize=8)
    axes[0].invert_yaxis()
    axes[0].set_xlabel('Attention Weight')
    axes[0].set_title(f'Source: {source_item.item_id}')

    # Target attention
    axes[1].barh(range(len(target_tokens)), target_attn)
    axes[1].set_yticks(range(len(target_tokens)))
    axes[1].set_yticklabels(target_tokens, fontsize=8)
    axes[1].invert_yaxis()
    axes[1].set_xlabel('Attention Weight')
    axes[1].set_title(f'Target: {target_item.item_id}')

    plt.suptitle(
        f'Prediction: {"MATCH" if prediction > 0.5 else "NO MATCH"} '
        f'(score: {prediction:.3f})',
        fontsize=12,
        fontweight='bold'
    )
    plt.tight_layout()
    return fig

def get_top_attended_tokens(
    model: CrossAttentionFusionModel,
    item: TextItem,
    top_k: int = 10
) -> List[Tuple[str, float]]:
    """Get the tokens with highest attention weights."""
    model.eval()
    device = next(model.parameters()).device

    with torch.no_grad():
        text_emb = model.encode_text([item.text])
        feat_emb = model.feature_encoder(
            item.features.to_tensor().unsqueeze(0).to(device)
        )
        _, attn_weights = model.fusion_source(
            feat_emb, text_emb, return_attention=True
        )

    tokens = model.tokenizer.tokenize(item.text)
    weights = attn_weights[0, 0, :len(tokens)].cpu().numpy()

    # Sort by attention weight
    token_weights = list(zip(tokens, weights))
    token_weights.sort(key=lambda x: x[1], reverse=True)

    return token_weights[:top_k]

Example Attention Output

Top attended tokens for CRITICAL incident:

“down” - 0.234
“production” - 0.198
“cannot” - 0.156
“affected” - 0.142
“all” - 0.089

Top attended tokens for ROUTINE change request:

“scheduled” - 0.267
“maintenance” - 0.223
“brief” - 0.178
“weekend” - 0.134
“planned” - 0.098

This interpretability is valuable for debugging and building trust—stakeholders can see why the model made its prediction.

Step 7: Putting It All Together

Here’s a complete working example:

# Create sample items
source_item = TextItem(
    item_id="TICKET-001",
    text="""Server is completely down, all production systems affected.
    Customers cannot access their accounts. Database connections failing.
    This is a critical outage requiring immediate attention.""",
    features=StructuredFeatures(
        category=Category.CRITICAL,
        item_type=ItemType.INCIDENT,
        source_channel=SourceChannel.PHONE,
        customer_tier=CustomerTier.ENTERPRISE,
        affected_count=0.9,  # Many affected
        time_sensitivity=1.0  # Urgent
    )
)

target_similar = TextItem(
    item_id="TICKET-002",
    text="""Major service disruption impacting production environment.
    Multiple customers reporting inability to login. All services down.
    Escalating to engineering team immediately.""",
    features=StructuredFeatures(
        category=Category.CRITICAL,
        item_type=ItemType.INCIDENT,
        source_channel=SourceChannel.EMAIL,
        customer_tier=CustomerTier.ENTERPRISE,
        affected_count=0.8,
        time_sensitivity=0.9
    )
)

target_different = TextItem(
    item_id="TICKET-003",
    text="""Server maintenance scheduled for next weekend.
    Users will experience brief downtime during the maintenance window.
    This is a planned upgrade to improve system performance.""",
    features=StructuredFeatures(
        category=Category.LOW,  # Different!
        item_type=ItemType.CHANGE,  # Different!
        source_channel=SourceChannel.PORTAL,
        customer_tier=CustomerTier.PROFESSIONAL,
        affected_count=0.1,  # Low impact
        time_sensitivity=0.2  # Not urgent
    )
)

# Initialize model
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = CrossAttentionFusionModel(
    text_model_name="sentence-transformers/all-mpnet-base-v2",
    feature_dim=19,
    freeze_text_encoder=True
).to(device)

print(f"Model parameters: {sum(p.numel() for p in model.parameters()):,}")
print(f"Trainable parameters: {sum(p.numel() for p in model.parameters() if p.requires_grad):,}")

# Test inference
model.eval()
with torch.no_grad():
    # Similar pair
    logits_similar = model(
        [source_item.text],
        source_item.features.to_tensor().unsqueeze(0).to(device),
        [target_similar.text],
        target_similar.features.to_tensor().unsqueeze(0).to(device)
    )

    # Different pair
    logits_different = model(
        [source_item.text],
        source_item.features.to_tensor().unsqueeze(0).to(device),
        [target_different.text],
        target_different.features.to_tensor().unsqueeze(0).to(device)
    )

print(f"\nSimilar pair score: {torch.sigmoid(logits_similar).item():.4f}")
print(f"Different pair score: {torch.sigmoid(logits_different).item():.4f}")

Output

Model parameters: 123,442,561 Trainable parameters: 13,956,097

Similar pair score: 0.4667 Different pair score: 0.4503

Concatenation vs Cross-Attention: When to Use Which

Cross-attention fusion isn’t always necessary. Here’s when each approach works best:

Approach	When to Use	Typical Improvement
Concatenation	Features are independent of text semantics	Baseline
Cross-Attention	Features should guide text interpretation	+10-30%

Use concatenation when:

Features are metadata (timestamp, user ID, source)
Text and features are orthogonal signals
You need maximum inference speed

Use cross-attention when:

Features encode domain knowledge (category, severity, type)
Same text should be interpreted differently based on features
Interpretability matters (attention weights explain decisions)

Results and Expectations

What can you expect from cross-attention fusion?

Approach	Relative Performance	Notes
Text embeddings only	Baseline	Misses structured signals
Features only	60-80% of text	Depends on feature quality
Concatenation	10-20% above text	No modality interaction
Cross-attention fusion	20-40% above text	Features interpret text

The improvement varies by domain. Cross-attention helps most when:

Features carry strong domain semantics (categories, types)
Text is ambiguous without context
The relationship between features and text is learnable

Full Code

Here’s the complete implementation:

"""
Cross-Attention Fusion Model
============================
Combines text embeddings with structured features via cross-attention
for classification and similarity tasks.

Usage:
    model = CrossAttentionFusionModel()
    trainer = FusionTrainer(model)

    # Train on your dataset
    for epoch in range(num_epochs):
        metrics = trainer.train_epoch(train_loader)
        val_metrics = trainer.evaluate(val_loader)
        print(f"Epoch {epoch}: F1={val_metrics['f1']:.4f}")

    # Inference with attention visualization
    logits, attention = model(
        source_texts, source_features,
        target_texts, target_features,
        return_attention=True
    )
"""

from dataclasses import dataclass
from typing import List, Optional, Dict, Tuple
from enum import Enum
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import DataLoader, Dataset
from transformers import AutoModel, AutoTokenizer
import numpy as np

# === Data Structures ===

class Category(Enum):
    CRITICAL = 0
    HIGH = 1
    MEDIUM = 2
    LOW = 3

class ItemType(Enum):
    INCIDENT = 0
    REQUEST = 1
    CHANGE = 2
    QUESTION = 3
    FEEDBACK = 4

class SourceChannel(Enum):
    EMAIL = 0
    CHAT = 1
    PHONE = 2
    PORTAL = 3

class CustomerTier(Enum):
    ENTERPRISE = 0
    PROFESSIONAL = 1
    STARTER = 2
    FREE = 3

@dataclass
class StructuredFeatures:
    category: Category
    item_type: ItemType
    source_channel: SourceChannel
    customer_tier: CustomerTier
    affected_count: float = 0.0
    time_sensitivity: float = 0.5

    def to_tensor(self) -> torch.Tensor:
        category_onehot = torch.zeros(4)
        category_onehot[self.category.value] = 1.0
        type_onehot = torch.zeros(5)
        type_onehot[self.item_type.value] = 1.0
        channel_onehot = torch.zeros(4)
        channel_onehot[self.source_channel.value] = 1.0
        tier_onehot = torch.zeros(4)
        tier_onehot[self.customer_tier.value] = 1.0
        return torch.cat([
            category_onehot, type_onehot, channel_onehot, tier_onehot,
            torch.tensor([self.affected_count, self.time_sensitivity])
        ])

@dataclass
class TextItem:
    item_id: str
    text: str
    features: StructuredFeatures
    domain: Optional[str] = None

@dataclass
class ItemPair:
    source: TextItem
    target: TextItem
    label: int

# === Model Components ===

class FeatureEncoder(nn.Module):
    def __init__(self, input_dim=19, hidden_dim=256, output_dim=768, dropout=0.1):
        super().__init__()
        self.encoder = nn.Sequential(
            nn.Linear(input_dim, hidden_dim),
            nn.LayerNorm(hidden_dim),
            nn.GELU(),
            nn.Dropout(dropout),
            nn.Linear(hidden_dim, hidden_dim),
            nn.LayerNorm(hidden_dim),
            nn.GELU(),
            nn.Dropout(dropout),
            nn.Linear(hidden_dim, output_dim),
            nn.LayerNorm(output_dim)
        )

    def forward(self, features: torch.Tensor) -> torch.Tensor:
        return self.encoder(features).unsqueeze(1)

class CrossAttentionFusion(nn.Module):
    def __init__(self, embed_dim=768, num_heads=8, ff_dim=2048, dropout=0.1):
        super().__init__()
        self.cross_attention = nn.MultiheadAttention(
            embed_dim=embed_dim, num_heads=num_heads,
            dropout=dropout, batch_first=True
        )
        self.norm1 = nn.LayerNorm(embed_dim)
        self.norm2 = nn.LayerNorm(embed_dim)
        self.ffn = nn.Sequential(
            nn.Linear(embed_dim, ff_dim),
            nn.GELU(),
            nn.Dropout(dropout),
            nn.Linear(ff_dim, embed_dim),
            nn.Dropout(dropout)
        )

    def forward(self, feature_embedding, text_embeddings, return_attention=False):
        attended, attention_weights = self.cross_attention(
            query=feature_embedding, key=text_embeddings, value=text_embeddings,
            need_weights=True
        )
        feature_embedding = self.norm1(feature_embedding + attended)
        fused = self.norm2(feature_embedding + self.ffn(feature_embedding))
        if return_attention:
            return fused, attention_weights
        return fused

class CrossAttentionFusionModel(nn.Module):
    def __init__(
        self,
        text_model_name="sentence-transformers/all-mpnet-base-v2",
        feature_dim=19,
        hidden_dim=768,
        num_attention_heads=8,
        dropout=0.1,
        freeze_text_encoder=True
    ):
        super().__init__()
        self.tokenizer = AutoTokenizer.from_pretrained(text_model_name)
        self.text_encoder = AutoModel.from_pretrained(text_model_name)

        if freeze_text_encoder:
            for param in self.text_encoder.parameters():
                param.requires_grad = False

        self.feature_encoder = FeatureEncoder(feature_dim, 256, hidden_dim, dropout)
        self.fusion_source = CrossAttentionFusion(hidden_dim, num_attention_heads, 2048, dropout)
        self.fusion_target = CrossAttentionFusion(hidden_dim, num_attention_heads, 2048, dropout)

        self.classifier = nn.Sequential(
            nn.Linear(hidden_dim * 4, hidden_dim),
            nn.LayerNorm(hidden_dim),
            nn.GELU(),
            nn.Dropout(dropout),
            nn.Linear(hidden_dim, hidden_dim // 2),
            nn.GELU(),
            nn.Dropout(dropout),
            nn.Linear(hidden_dim // 2, 1)
        )

    def encode_text(self, texts: List[str]) -> torch.Tensor:
        inputs = self.tokenizer(
            texts, padding=True, truncation=True, max_length=512, return_tensors="pt"
        ).to(next(self.parameters()).device)
        with torch.no_grad() if not self.text_encoder.training else torch.enable_grad():
            outputs = self.text_encoder(**inputs)
        return outputs.last_hidden_state

    def forward(self, source_texts, source_features, target_texts, target_features, return_attention=False):
        source_text_emb = self.encode_text(source_texts)
        target_text_emb = self.encode_text(target_texts)
        source_feat_emb = self.feature_encoder(source_features)
        target_feat_emb = self.feature_encoder(target_features)

        if return_attention:
            source_fused, source_attn = self.fusion_source(source_feat_emb, source_text_emb, True)
            target_fused, target_attn = self.fusion_target(target_feat_emb, target_text_emb, True)
        else:
            source_fused = self.fusion_source(source_feat_emb, source_text_emb)
            target_fused = self.fusion_target(target_feat_emb, target_text_emb)

        source_fused = source_fused.squeeze(1)
        target_fused = target_fused.squeeze(1)

        combined = torch.cat([
            source_fused, target_fused,
            source_fused - target_fused,
            source_fused * target_fused,
        ], dim=-1)

        logits = self.classifier(combined)

        if return_attention:
            return logits, {'source_attention': source_attn, 'target_attention': target_attn}
        return logits

# === Training ===

class ItemPairDataset(Dataset):
    def __init__(self, pairs: List[ItemPair]):
        self.pairs = pairs

    def __len__(self):
        return len(self.pairs)

    def __getitem__(self, idx):
        pair = self.pairs[idx]
        return {
            'source_text': pair.source.text,
            'source_features': pair.source.features.to_tensor(),
            'target_text': pair.target.text,
            'target_features': pair.target.features.to_tensor(),
            'label': torch.tensor(pair.label, dtype=torch.float32)
        }

def collate_fn(batch):
    return {
        'source_texts': [b['source_text'] for b in batch],
        'source_features': torch.stack([b['source_features'] for b in batch]),
        'target_texts': [b['target_text'] for b in batch],
        'target_features': torch.stack([b['target_features'] for b in batch]),
        'labels': torch.stack([b['label'] for b in batch])
    }

if __name__ == "__main__":
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    model = CrossAttentionFusionModel().to(device)
    print(f"Model ready on {device}")
    print(f"Trainable params: {sum(p.numel() for p in model.parameters() if p.requires_grad):,}")

What’s Next

You’ve built a cross-attention fusion model that combines text semantics with structured features. From here:

Adapt the features: Replace the example schema with your domain’s structured data
Scale up: Train on your full dataset with domain-specific labels
Two-stage retrieval: Use bi-encoders for candidate retrieval, this model for reranking
Single-item classification: Simplify to one item (remove target) for classification tasks

For more on the underlying techniques:

Bi-Encoders for Semantic Search — Fast retrieval fundamentals
Cross-Encoders for Reranking — Pair-wise scoring
Visual Semantic Search with CLIP — Image-text multimodal fusion

References

Attention Is All You Need (Vaswani et al., 2017) — The transformer architecture that powers cross-attention
Perceiver: General Perception with Iterative Attention (Jaegle et al., 2021) — Cross-attention for arbitrary modalities
Attention Bottlenecks for Multimodal Fusion (Nagrani et al., 2021) — Multimodal Bottleneck Transformer
TabTransformer: Tabular Data Modeling Using Contextual Embeddings (Huang et al., 2020) — Attention for tabular data
Sentence-BERT (Reimers & Gurevych, 2019) — Efficient text embeddings