10 Million Tokens of Context — and 10 Million New Ways for AI to Lose Track

On April 5, Meta released Llama 4 — and buried inside the announcement was a number that should make every engineering team pause: 10 million tokens of context window.

That's roughly 7.5 million words. The entirety of the Harry Potter series fits about 15 times over. Meta's Llama 4 Scout model, with its 17 billion active parameters drawn from a 109-billion-parameter mixture-of-experts architecture, can theoretically hold all of it in memory at once. On a single H100 GPU.

The immediate reaction across the AI community was predictable: more context means better AI. But for those of us building production systems, a different question surfaces — one that the benchmarks don't answer. What happens when your AI has 10 million tokens of context and still can't remember what it was supposed to do?

The Llama 4 Family: Impressive Engineering, Familiar Gaps

Meta's Llama 4 launch was genuinely significant. Three models, each with distinct capabilities. Scout — the one with the headline-grabbing 10 million token context — uses 16 mixture-of-experts (MoE) layers to keep its active parameter count at 17 billion while drawing from a total pool of 109 billion. That's the trick that lets it fit on a single GPU despite its massive context capacity.

Maverick steps up to 128 experts with 400 billion total parameters and a 1-million-token context window. And Behemoth, with 288 billion active parameters from roughly 2 trillion total, was still training at launch — a preview of Meta's ambitions for frontier-scale open models.

These are the first Llama models with MoE architecture. They're natively multimodal. Scout outperforms Gemma 3 and Gemini 2.0 Flash-Lite on key benchmarks. Maverick beats GPT-4o and Gemini 2.0 Flash. It's a strong showing.

But two details from the launch tell a more complicated story. First, controversy around benchmark manipulation surfaced almost immediately — questions about whether Llama 4's training process was specifically optimized for benchmark performance rather than real-world capability. Second, and perhaps more telling, EU users were prohibited from accessing the models entirely due to AI and data privacy regulations.

The benchmark controversy and the EU exclusion are connected by a thread that matters more than either issue alone: the gap between what models can theoretically do and what they reliably do in production under regulatory constraints is not shrinking. It's growing.

The Context Window Paradox

Here's the assumption embedded in every context window headline: more context equals better understanding. If your model can see more of the conversation, the codebase, the document — it should produce better output.

In controlled benchmarks, this holds. In production, it frequently doesn't. And the reason is something that anyone who has managed a long-running AI coding session already knows intuitively: models don't use context windows the way humans use memory.

When you give a model 10 million tokens of context, you're not giving it 10 million tokens of understanding. You're giving it 10 million tokens of input that it processes with attention mechanisms that have well-documented limitations. Information at the beginning and end of the context window tends to receive more attention than information in the middle. As context grows, the model's ability to retrieve specific relevant details from that context degrades — not because it can't access them, but because the signal-to-noise ratio worsens.

This is the context loss problem at scale. In our experience building AI-assisted development systems, we've observed this pattern repeatedly: the more context an AI has available, the more critical it becomes to have systems that verify whether the AI is actually using that context correctly. A model with 4,000 tokens of context that drifts from its instructions is a nuisance. A model with 10 million tokens of context that drifts is a liability — because the developer's assumption that "it has everything it needs" makes them less likely to catch the drift early.

Think of it this way. A pilot with a small instrument panel might miss something because they don't have the data. A pilot with a 747's full cockpit might miss something because they're overwhelmed by the data. The instruments don't help if nobody is checking whether the pilot is reading them correctly.

Why Bigger Context Windows Create Bigger Verification Challenges

The engineering community's response to context loss has traditionally been a workaround culture. Developers working with AI coding assistants learn to clear context regularly, restructure prompts to front-load critical information, or split tasks into smaller chunks that fit comfortably within the model's effective attention range.

Llama 4's 10-million-token window doesn't eliminate the need for these workarounds. In some ways, it makes them harder to implement, because the temptation to just dump everything into context and let the model figure it out is much stronger when the technical limitation appears to be gone.

There are three specific challenges that scale with context window size:

The first is instruction decay. When a model receives instructions at the beginning of a session and then processes millions of tokens of subsequent information, those initial instructions compete with a growing volume of in-context examples, code patterns, and conversational history. We've documented this extensively — AI models working on large codebases will begin conforming to patterns they find in the code rather than the specifications they were given, especially after extended sessions. With a 10-million-token context, the surface area for this kind of drift multiplies enormously.

The second is compaction loss. Even 10 million tokens isn't infinite, and real-world sessions can exceed it. When models compress or summarize earlier context to make room for new information — a process typically called compaction — critical details get lost. Architectural decisions made early in a session, constraints that seemed obvious at the time, edge cases that were explicitly discussed — all of these can vanish during compaction, and the model will proceed confidently without them.

The third is verification complexity. When a model's output is informed by a 10-million-token context, tracing why it made a particular decision becomes proportionally harder. If the model generates code that contradicts a specification, was it because it never saw the specification? Because it saw it but weighted other context more heavily? Because compaction removed it? With shorter contexts, developers can usually trace the reasoning. With contexts this large, you're essentially debugging a system that processed more information than any human could review.

What Actually Helps: Independent Verification at the Infrastructure Level

The solution to the context loss problem has never been bigger context windows. That's like solving forgetfulness by carrying a bigger notebook — it helps, but only if you also have a system for checking whether you read the right page at the right time.

What production AI systems need is verification that operates independently of the model's own context processing. This means machine-readable specifications that can be checked against output regardless of what the model claims to have considered. It means session management systems that preserve critical state across context boundaries, not within them. It means audit mechanisms that compare what was specified with what was produced, without relying on the model's self-reported understanding.

At CleanAim®, this is what our Session Handoff System does — we've processed over 1,000 handoffs with 92% automation, maintaining context continuity not by trusting the model to remember, but by independently restoring verified state at each session boundary. When context is lost — and it will be, regardless of window size — the system catches it before the output is affected.

The Broader Pattern: Capability Announcements vs. Reliability Infrastructure

Llama 4's launch follows a pattern we've seen throughout 2025. Each month brings models with impressive new capabilities — deeper reasoning in February, massive context windows in April — and each month, the infrastructure to make those capabilities reliable in production falls further behind.

The MoE architecture in Llama 4 is genuinely innovative. The ability to run a model with 109 billion parameters on a single GPU is an engineering achievement. But the architecture itself introduces its own verification challenges: different experts activate for different inputs, meaning the same model can behave quite differently depending on which subset of its knowledge is engaged. For teams deploying these models in production, this adds another dimension of variability that existing testing and monitoring approaches weren't designed to handle.

And the EU exclusion is a reminder that regulatory reality constrains deployment in ways that technical capability doesn't. A model you can't deploy in the EU is, for European enterprises, irrelevant — no matter how large its context window. The governance gap isn't theoretical. It's already limiting what organizations can use and where they can use it.

Looking Ahead

Meta's Llama 4 Behemoth — with 288 billion active parameters from roughly 2 trillion total — was still training when Scout and Maverick launched. When it arrives, it will likely push the boundaries of what open-weight models can do. The competition between Meta, OpenAI, Google, Anthropic, and the growing field of Chinese AI labs will continue driving rapid capability improvements.

But the question for engineering teams isn't whether next quarter's model will be more capable. It almost certainly will be. The question is whether the systems verifying that capability will keep pace.

A 10-million-token context window is an engineering marvel. Using it reliably in production is an engineering problem that context windows alone don't solve. The teams that recognize this distinction early will be the ones who can actually deploy these models at scale — not just benchmark them.