Lost in the Middle: Why Your LLM Ignores What You Give It

Lost in the Middle: Why Your LLM Ignores What You Give It

Article · 7 min read
🇫🇷 This article is also available in Français

One million tokens of context. That’s the number everyone throws around for the big models now. It’s a reassuring number: load dozens of files, entire conversations, a full project’s worth of docs. The model will have “seen everything.”

The problem is that “seen” and “used” are not the same thing.

Long context is a comfortable illusion

Concrete scenario. You give a LLM an 80,000-token context: application logs, a few config files, the relevant API documentation. The critical piece of information (the timeout value that explains everything) sits roughly in the middle. The model responds. The answer is confident, well-structured, technically coherent. It’s wrong.

No hesitation, no “I couldn’t find that information in the provided context.” The model filled the gaps with what it already knew, and it didn’t flag the difference.

That’s the real problem. Not that the model fails: it’s that it fails without an error signal.

A LLM’s context window isn’t a RAM buffer where every byte carries equal weight. It’s an attention distribution over tokens, and that distribution is not uniform. It never was. Models read differently depending on where information sits in the context, and we can now measure exactly how much.

Measuring the problem: two studies, same conclusion

Liu et al. (2023): the U-shaped curve

The “Lost in the Middle” paper by Liu et al. (TACL, 2023) has become the reference on this topic. The setup: multi-document Q&A and key-value retrieval tasks, where the position of the relevant document in the context varies systematically.

The result is a U-shaped curve. Performance peaks when the information is at the beginning or the end of the context. When it’s in the middle, performance collapses, and the collapse scales with context length.

graph LR
    D["📍 Start"] -->|"~75-85%"| P1(( ))
    M["📍 Middle"] -->|"~42-55%"| P2(( ))
    F["📍 End"] -->|"~72-80%"| P3(( ))

    P1 --> PERF["Performance"]
    P2 --> PERF
    P3 --> PERF

    style P1 fill:#22c55e,stroke:#22c55e
    style P2 fill:#ef4444,stroke:#ef4444
    style P3 fill:#22c55e,stroke:#22c55e

This isn’t a bug in one particular model. The study covers multiple architectures. The U-shape appears every time.

Boytsov et al.: FirstP, the embarrassing baseline

Boytsov et al. (arXiv:2207.01262) test over 20 models (including GPT-4o-mini and Claude Haiku-3) on a simple task: find the relevant information in a long document.

They compare these models against a baseline they call FirstP: cut the document after the first 512 tokens and ignore the rest. No long-context window, no extended attention, just the first 512 tokens.

Result: no long-context model beats FirstP by any meaningful margin. Maximum gap: 5%.

Why? Because in the training data for these models, the relevant information almost always appears in the first tokens of documents. The models learned to focus on the beginning, and that habit sticks even when you hand them 100,000 tokens.

Attention

A model with 1M tokens of available context can be statistically equivalent to a model reading only the first 512 tokens. That’s the Boytsov et al. finding on ranking benchmarks.

Why this is structural

Positional bias isn’t an implementation bug. It has distinct architectural causes, identified and formalized in recent literature.

Causal attention accumulates weight on early tokens. In a LLM, each token can only see the tokens before it. The token at position 1 is visible to every subsequent token: it accumulates attention across the full context length. A token in the middle of a 500,000-token context is only visible to half of them. That’s not a model choice. It’s the mechanics of causal attention, by design.

RoPE creates a dead zone in the middle. Most modern models use Rotary Position Embedding (RoPE) to encode token positions. RoPE introduces attention score decay proportional to the distance between tokens: the further apart, the less they influence each other. A token near the end benefits from proximity to adjacent tokens (recency), and tokens at the very beginning retain special salience (primacy). A token in the middle? Too far from the start for primacy, too far from the end for recency. Dead zone.

There’s a third dimension, more formal. Softmax (the function that normalizes attention scores into probabilities) always distributes probability mass across all visible tokens. Vasylenko et al. (2026) prove that the entropy of this distribution tends mathematically toward O(log n) as n grows: weights converge toward uniform. In plain terms, the longer the context, the more attention dilutes. Weak signals drown in statistical noise.

graph LR
    subgraph CTX["Context window"]
        D["📍 Start"]
        M["📍 Middle"]
        F["📍 End"]
    end

    subgraph EFF["Positional effects"]
        E1["① Causal attention\nVisible to all subsequent tokens"]
        E2["② RoPE decay\nDead zone - too far\nfrom start AND end"]
        E3["③ Softmax dilution\nRecency proximity\n→ score preserved"]
    end

    D --> E1
    M --> E2
    F --> E3

    E1 --> OK1(["✓ Strong signal"])
    E2 --> KO(["✗ Signal lost"])
    E3 --> OK2(["✓ Strong signal"])

    style D fill:#1e1e2e,stroke:#6b7280,color:#e2e8f0
    style M fill:#1e1e2e,stroke:#6b7280,color:#e2e8f0
    style F fill:#1e1e2e,stroke:#6b7280,color:#e2e8f0
    style E1 fill:#22c55e,stroke:#22c55e,color:#000
    style E2 fill:#ef4444,stroke:#ef4444,color:#fff
    style E3 fill:#22c55e,stroke:#22c55e,color:#000
    style OK1 fill:#22c55e,stroke:#22c55e,color:#000
    style KO fill:#ef4444,stroke:#ef4444,color:#fff
    style OK2 fill:#22c55e,stroke:#22c55e,color:#000

These three effects compound. This isn’t about model capability. 1M tokens don’t eliminate the bias, they amplify it. The longer the context, the higher the probability that critical information lands in a poorly covered zone.

And the core problem stays the same: you don’t know what the model used to build its answer. There’s no attention log, no trace of which tokens actually mattered. The response arrives, confident, with no way for you to tell whether it’s grounded in the information you provided or in training priors.

That’s a fundamental property of LLM non-determinism. An overloaded context isn’t transparent. It’s opaque.

What this breaks in practice

Two patterns that seem reasonable but become fragile once you factor in this bias.

Loading an agent with lots of information to make it autonomous. The idea: if the agent has access to everything, it can decide what’s relevant on its own. In practice, the agent will overweight whatever comes first in the context, ignore what’s in the middle, and fill gaps with training priors. It’ll be autonomous, but not reliable.

Cramming multiple capabilities into a single system. Putting into one agent the ability to read code, analyze logs, call APIs, and interpret specs, then feeding it the full available context every time. The result: a system whose behavior depends on the order information arrives in the context, which is almost never fully under your control.

In both cases, you’re delegating information prioritization to the model. And the model does that prioritization based on positional criteria you don’t control.

The answer: smaller contexts, not bigger ones

The natural intuition from the studies: if the model performs better on information at the start of the context, put important information at the start.

It helps. It doesn’t fix the underlying problem.

The real answer is to reduce cognitive load per agent, not to reorganize what goes inside. Each agent should know exactly what it knows, and nothing else.

graph TB
    subgraph "Fat context - single agent"
        FC_IN["All files<br/>All logs<br/>All docs<br/>All specs"] --> FC_A["Single agent"]
        FC_A --> FC_OUT["Response (opaque)"]
    end

    subgraph "Targeted contexts - orchestration"
        ORCH["Orchestrator"] --> A1["Agent A<br/>(context: logs only)"]
        ORCH --> A2["Agent B<br/>(context: code only)"]
        ORCH --> A3["Agent C<br/>(context: specs only)"]
        A1 --> ORCH
        A2 --> ORCH
        A3 --> ORCH
        ORCH --> OUT["Synthesis"]
    end

In the orchestrated model, each agent has a small, precise context whose content you control. The position of information in that context is deterministic. The orchestrator synthesizes results: it receives conclusions, not raw data dumps.

This isn’t just an architecture best practice. It’s a direct response to positional bias. If the context is short and focused enough that all useful information sits in the high-attention zone, the problem disappears.

Conseil

The question to ask yourself: do I know exactly what context my agent has? If the answer is “yes, and it’s short,” positional bias is under control. If the answer is “it has access to a lot of things,” you have a latent reliability problem.

A concrete implementation: the team-lead pattern

The team-lead pattern is a direct illustration of this principle. The orchestrator, the “team lead,” never reads code directly. It receives short summaries from specialized agents, synthesizes them, and issues directives.

The specialized agents work with surgical contexts: one file, one function, one well-defined scope. They don’t need to know the global state of the system to do their job correctly. And reviewers arrive without production context. They didn’t write the code, they have no investment in the choices made. That eliminates a different bias: familiarity.

The architectural benefit: the orchestrator’s scratchpad can be deterministic (a list of factual conclusions) where an auto-generated model summary would be probabilistic. You control what the orchestrator “knows” at each step.

This pattern is documented in detail in this dedicated article.

What this means for how you design your systems

The question to keep in mind when building systems that involve LLMs: do I know what my model used to answer?

If the answer is no, if you can’t trace which parts of the context influenced the response, you have a black box. That’s fine for a conversational assistant. It’s a problem for any system where reliability matters.

Context size is not a proxy for quality. 2,000 carefully selected tokens beat 200,000 tokens dumped in bulk. The studies confirm it. FirstP confirms it with a certain irony: in many cases, reading only the beginning is enough.

This isn’t a limitation that the next model release will fix. It’s a structural property of attention over long contexts. Building reliable systems with LLMs means building systems that don’t depend on the model’s ability to navigate a mass of information, because that ability is more limited than it looks.

← Back to articles