2026 Frontier LLM Architectures Compared | MLA, iRoPE, mHC Explained

Every frontier LLM released in the past year uses Mixture-of-Experts. On that, the major labs agree. But the paths to efficiency diverge sharply from there—DeepSeek compresses attention into latent space, Meta interleaves position embeddings, Google bets on native multimodality, and Alibaba lets you toggle reasoning depth on demand.

Then on New Year’s Eve, DeepSeek dropped a paper that may reshape the next generation entirely.

The Architectures

Here’s the current landscape of frontier models and their architectural choices:

Model	Total Params	Active Params	Attention	MoE Config
DeepSeek V3.2	685B	37B	MLA + DSA	256 experts, 8+1 active
Llama 4 Behemoth	2T	288B	iRoPE	16 experts (unreleased)
Llama 4 Maverick	400B	17B	iRoPE	128 experts
Llama 4 Scout	109B	17B	iRoPE	16 experts
Gemini 3 Pro	~2-4T*	~150-200B*	Standard MHA	Sparse MoE
Qwen3-235B	235B	22B	GQA	128 experts, 8 active

Gemini 3 architecture not fully disclosed by Google. Parameter estimates based on inference latency analysis and industry reports.

Claude Opus 4.5 remains closed-source, so we can’t compare its architecture directly. But Anthropic’s focus on agentic stability and tool-use reliability suggests different optimization priorities than raw parameter efficiency.

Multi-head Latent Attention (MLA)

DeepSeek’s signature innovation compresses the key-value cache into a lower-dimensional latent space.

Standard attention stores full-dimensional keys and values for every token in the context. For a 128K context window, that’s enormous memory pressure. MLA projects KV pairs down to a compact latent representation before caching, then projects back up at inference time.

Think of it like LoRA for attention—down-project, store, up-project. The quality loss is minimal because the latent space captures the essential relationships. The memory savings are substantial because you’re caching compressed representations instead of full vectors.

Combined with DeepSeek Sparse Attention (DSA), which reduces attention complexity from O(L²) to O(Lk) by selecting only the most relevant tokens, V3.2 handles long contexts efficiently without the typical memory explosion.

iRoPE: Infinite Context Through Interleaving

Llama 4 takes a different approach to long-context handling with iRoPE—interleaved Rotary Position Embeddings.

Standard transformers apply position embeddings uniformly across all layers. Llama 4 alternates between RoPE layers (which encode position) and NoPE layers—“No Position Embedding” layers that operate without positional bias. The insight: some layers benefit from knowing where tokens are in the sequence, while others work better attending to semantic relationships regardless of position.

RoPE layers capture local context and relative positions. NoPE layers maintain global understanding without positional bias creating artifacts. Temperature scaling during inference helps the model generalize to sequence lengths beyond training.

The “i” stands for “interleaved” but hints at the long-term goal: infinite context length support. Meta’s 10 million token context claim comes from this architecture.

Native Multimodality: Early Fusion

Both Llama 4 and Gemini 3 moved to “early fusion” for multimodal processing—a significant departure from the adapter approach.

Previous multimodal models processed text and vision separately, then combined representations later (late fusion). Llama 4 immediately concatenates text tokens and visual tokens into a unified sequence before any transformer processing begins.

This creates cross-modal attention from layer one. The model learns joint text-vision representations from the earliest stages rather than trying to align pre-trained unimodal representations after the fact. The result is more natural cross-modal reasoning—the kind needed for questions like “what’s wrong with this code screenshot?” or “describe the relationship between these diagram elements.”

Gemini 3 extends this to video and audio, creating what Google describes as “the first truly multimodal foundation model”—a claim that ignores GPT-4o and other prior work.

Expert Routing Strategies

The MoE implementations diverge significantly:

DeepSeek V3.2: 256 experts with 8 routed + 1 shared expert active per token. The shared expert handles common patterns across all inputs, providing stability. The 8 routed experts specialize. This 8+1 configuration is more expensive than alternatives but provides better coverage.

Llama 4: Maverick uses 128 experts; Scout uses 16. Both activate only 17B parameters regardless of total size, suggesting aggressive sparsity—Maverick activates just 4.3% of its parameters per token. Meta hasn’t disclosed the exact routing configuration, but the low activation ratio implies very few experts per token.

Qwen3: 128 experts with 8 active, but notably dropped the shared expert that Qwen2.5 used. Alibaba hasn’t explained why, but eliminating the shared expert saves compute when routing already works well.

Gemini 3: Google hasn’t disclosed specifics, but describes “sparse MoE” with selective activation. The trillion-parameter scale suggests very low activation ratios.

Training Stability: Dense Layers First

One pattern emerged across multiple architectures: starting with dense layers before MoE.

DeepSeek V3 uses 3 dense layers before MoE begins—an approach other labs have since adopted. The reasoning: early layers need to extract basic syntactic and semantic features before expert specialization makes sense. Introducing sparse routing immediately causes instability—the router can’t make good decisions before basic representations form.

This is a training optimization, not an inference one. But it hints at a broader challenge: MoE is powerful but fragile. Getting expert routing to work well at scale requires careful initialization and curriculum.

The mHC Paper: What’s Coming in V4

On December 31, 2025, DeepSeek published Manifold-Constrained Hyper-Connections (mHC)—co-authored by DeepSeek founder Liang Wenfeng. The timing and authorship signal importance: this likely forms the backbone of V4.

The Problem

Hyper-Connections (HC), introduced by ByteDance in September 2025, generalize residual connections by allowing information to mix across multiple parallel residual streams. Instead of a single x + f(x) path, HC creates multiple pathways that can learn to route information dynamically.

The problem: unconstrained mixing amplifies or suppresses signals across layers. DeepSeek reports that in 27B parameter experiments, HC caused signal gains exceeding 3000×—leading to catastrophic training divergence.

The Solution

mHC constrains residual mixing to lie on the Birkhoff Polytope—a mathematical structure of doubly stochastic matrices. These matrices redistribute information without amplifying or suppressing the total signal.

The implementation uses the Sinkhorn-Knopp algorithm to project mixing matrices onto this manifold. The result: signal amplification drops from 3000× to 1.6×—bounded propagation regardless of model depth.

The Results

Testing on DeepSeek’s V3 architecture base at 3B parameters:

Benchmark	Baseline	HC	mHC
BIG-Bench Hard	43.8%	48.9%	51.0%
DROP (F1)	47.0	51.6	53.9

Test conditions: 3B parameter models, 5-shot evaluation, single attempt. Baseline is dense attention without HC. See arXiv:2512.24880 for methodology.

Standard Hyper-Connections already improved performance over baseline—mHC adds incremental gains on top. But the real contribution isn’t benchmark numbers. It’s that HC failed catastrophically at 27B parameters while mHC trained stably.

The overhead is minimal—6.7% additional training compute. At 27B scale, unconstrained HC diverged entirely; mHC continued training without issue.

V4 Timeline

DeepSeek follows a pattern: publish research introducing core techniques, then ship models using them weeks later. The V3 technical report preceded V3’s release. The R1 reasoning paper came before R1.

mHC arrived three weeks before China’s Spring Festival (February 2026). If DeepSeek follows its pattern of publishing research before shipping, V4 may launch during that window—likely incorporating mHC alongside potential native multimodality and more aggressive latent compression.

Architectural Convergence

Several patterns have become standard:

MoE is the default for frontier models. Dense architectures can’t compete on capability-per-FLOP. The debate is now about routing strategies, not whether to use experts.

Sparse attention is emerging. DeepSeek’s DSA, FlashAttention variants, and various “efficient attention” mechanisms all aim to break the O(L²) barrier. Long context is table stakes; efficient long context is the differentiator.

Native multimodality is replacing adapters. Early fusion produces better cross-modal reasoning than bolting vision encoders onto language models. Expect text-only models to become niche.

Training stability innovations matter. Meta’s parameter scaling (Llama 4), mHC (DeepSeek), and various initialization techniques separate models that scale from those that don’t. The engineering is as important as the architecture.

Architectural Divergence

The open questions:

Attention mechanism: MLA (DeepSeek) vs GQA (Qwen3) vs standard MHA (Gemini). MLA offers the best memory efficiency but requires custom kernels. GQA is a safer middle ground with broad framework support.

Expert activation ratio: 1 (Llama 4) vs 8 (Qwen3, DeepSeek) vs undisclosed (Gemini). Lower activation ratios mean more parameters for the same compute, but routing errors hurt more.

Position encoding: iRoPE (Llama 4) vs standard RoPE (most others). Interleaved position embeddings show promise for extreme context lengths, but the approach is newer and less tested.

What This Means for Practitioners

Self-hosting: Llama 4 Scout fits on a single H100 (with INT4 quantization). Qwen3 models are fully open-weight (Apache 2.0). DeepSeek is MIT-licensed. For on-premises deployment, you have real options now.

Fine-tuning: Architecture matters. MLA and DSA require specialized kernels not all frameworks support. Llama 4’s standard architecture has broader tooling compatibility. Check your stack before committing to a base model.

Cost: DeepSeek’s API remains significantly cheaper than alternatives for equivalent capability—roughly 6x less than GPT-5 for similar reasoning tasks. If you don’t need self-hosting, the cost advantage is hard to ignore.

Multimodal: If your use case involves vision, prefer models with early fusion (Llama 4, Gemini 3) over adapter-based approaches. The quality difference in cross-modal reasoning is noticeable.

What’s Next

The mHC paper signals DeepSeek’s next move. V4 will likely push efficiency further while maintaining frontier capability. If mHC enables stable training at even larger scales with minimal overhead, the cost curve continues bending.

Meta’s unreleased Llama 4 Behemoth represents the opposite bet—scale up the MoE configuration to 2T parameters. The iRoPE architecture should handle extreme context lengths, testing whether 10M+ token windows have practical value.

Google’s Gemini 3 Flash suggests the “small but capable” tier is expanding. Pro-level reasoning at Flash-level speed changes the economics of high-volume applications.

And Anthropic, characteristically quiet about architecture, may surprise everyone. Their focus on reliability over raw benchmarks has produced Claude Opus 4.5’s 80.9% SWE-bench score—the highest among frontier models. Whatever they’re doing architecturally, it works for code.

The convergence on MoE and sparse attention means the next differentiation will come from training techniques, data curation, and post-training optimization. The architecture war is maturing into an engineering war.

Sources: