Skip to content

LLMs Corrupt Your Documents: A Hands-On Survival Guide

New Microsoft Research paper shows LLMs corrupt documents when you delegate. Here's what the DELEGATE-52 results mean and how to actually work around them.

8 min readIntermediate

As of April 2026, a Microsoft Research paper is making the rounds with a title that lands like a system alert: LLMs Corrupt Your Documents When You Delegate. The finding is blunt – when you hand a document to a frontier model and let it edit across many turns, the file quietly rots. This isn’t a news recap. It’s a practical guide to changing how you actually use ChatGPT, Claude, and Gemini so your documents stop getting silently mangled.

A quick comparison of two ways people use LLMs right now, because one of them is exactly what the paper warns about.

Two ways to use an LLM (one of them corrupts your work)

Approach A – Persistent delegation: You paste your document into a chat, then iterate. “Tighten the intro. Now add a section on pricing. Now reword paragraph three.” Twenty messages in, you copy the final version back out. Feels productive.

Approach B – Surgical, single-shot edits: You ask for one specific change, take the diff, apply it yourself, start a fresh chat for the next edit. Slower. Annoying. Less “agentic.”

Approach A is what the paper tested with its DELEGATE-52 benchmark, and it’s the one that breaks. Across 19 models and 52 professional domains, frontier models – Gemini 3.1 Pro, Claude 4.6 Opus, GPT 5.4 – corrupted an average of 25% of document content by the end of 20 interactions (arXiv, April 2026). The average across all 19 tested models was ~50% degradation. Approach B isn’t immune, but starting fresh means errors can’t compound across turns. That single design choice – fresh context vs. accumulating context – is what separates a working LLM session from a slowly poisoning one.

What DELEGATE-52 actually measured

The clever bit of the benchmark is how it scores damage without needing a human reviewer. The dataset – 234 work environments across 48 domains, open on GitHub and Hugging Face – works like this: given a seed document, the LLM performs a structural edit (the “forward” edit), then a second task that undoes it (the “backward” edit). Chain ten of those round-trips and you get 20 LLM interactions. Performance is scored by comparing the recovered document against the original using domain-specific parsers.

If the model were faithful, the round-trip should return the file unchanged. It doesn’t.

Self-test: You can feel this in five minutes. Take a document, ask the model to restructure it (“reorder by date”), then ask it to restructure it back. Diff the result against your original. The deltas you see in one round-trip are what’s compounding silently across every long session you’ve ever had.

Why the distribution is uglier than the headline

The numbers everyone quotes are averages. The distribution is worse. Catastrophic corruption – a benchmark score of 80 or below – occurs in more than 80% of model-domain combinations (paper Table 2). So if you’re randomly picking a model and a document type, severe failure is the base rate, not the exception.

The one bright spot is also a trap. Python is the only domain where a majority of models reach “ready” status (17 of 19 models achieve lossless manipulation). The best model overall, Gemini 3.1 Pro, is ready for delegated workflows in only 11 of 52 domains. “Works for code” doesn’t mean “works for your YAML config, your Jupyter notebook, your CSV.” It means works for plain Python source.

And the errors don’t look broken. They look correct. A clause subtly reworded, a regex’s anchor dropped, a row reordered. The document still reads fine. You ship it.

The practical setup: a delegation-safe workflow

Here’s the workflow I’ve moved to since reading the paper. It’s not glamorous. It works.

  1. One edit per chat. Start a new conversation for each meaningful change. The moment the model has “seen” the document twice in the same session, you’re on the compounding curve.
  2. Always ask for a diff, never a rewrite. Prompt: “Return only the lines you would change, in unified diff format. Do not output the full document.” This forces the model to localize its edit and gives you a tiny surface to review.
  3. Keep distractor files out. The paper’s environments include 8-12k tokens of distractor context by default to improve simulation realism – roughly the size of three PDFs you might “attach for context.” Don’t. Attach only the file the edit touches.
  4. Round-trip check before you trust. For any edit on a document that matters, ask the model to undo its own change in a fresh chat. Diff the result against your original. Anything that doesn’t restore cleanly is the bug surface.
  5. Cap the session. Treat ~10 turns as a hard ceiling on a single document. The researchers extended the experiment to 100 interactions and found no plateau – “monotonic decline” is how the paper describes it. There is no point at which it gets safe again.

A self-test you can run in 10 minutes

Pick something with structure – a Markdown file with headings, a YAML config, a contract template.

# Step 1 - in a fresh chat, paste your document, then ask:
"Rewrite this document with all sections in reverse order.
Output the full document only, no commentary."

# Step 2 - in ANOTHER fresh chat (new conversation), paste
# the reversed output and ask:
"Rewrite this document with all sections in their original
top-to-bottom order. Output the full document only."

# Step 3 - diff the result against your original:
diff original.md round_tripped.md

Links, footnotes, code fences, tables, numbered lists – any of those will drift. A trailing space here. A backtick swapped for a quote. A list re-numbered. None of it breaks the file. All of it is the same class of error the paper measured, just at one round-trip instead of ten.

Two findings most coverage skips

Tool use made things worse, not better. A basic file-and-code agentic use added an average of 6% additional degradation across the four tested systems. The model can now read files and run code – and that created new ways to drop state or carry a bad edit into the next step.

Bigger context: also worse. Larger documents, longer interactions, and distractor files all pushed degradation higher. The mental model of “more context = better answers” is specifically wrong for editing tasks. More tokens in the window means more surface area for the model to quietly overwrite something it shouldn’t have touched.

Honest limitations of the paper (and of this advice)

The benchmark is a stress test, not a user study. Real humans push back, re-prompt, notice things. The simulation chains edits automatically with no human review between turns – which is worst-case behavior, not typical behavior. The GitHub repo explicitly flags this: the dataset is research infrastructure, not a production recommendation system.

One other thing worth saying plainly: this is one paper, submitted to arXiv on April 17, 2026, with results that may shift as others replicate. The dataset is open – GitHub and Hugging Face – if you want to test your own domains directly.

The Hacker News reaction has been less “shocked” and more “finally, numbers.” Frequent LLM users already know that round-tripping long content through a model causes corruption. What’s useful is that there’s now a measurable name for the bad workflow – and a reason to stop using it.

FAQ

Does this mean I should stop using ChatGPT / Claude / Gemini for editing?

No. Stop using the same chat for the same document across many turns. Single-shot, scoped edits in a fresh session are still fine – that’s exactly the mode the paper didn’t find failing.

Will GPT-5.4 or Claude 4.6 Opus fix this with the next release?

The benchmark already tested those frontier models and they still hit 25% corruption on average over 20 interactions (as of April 2026). The degradation is also monotonic – it doesn’t plateau at some “safe” depth, it just keeps going. A model that benchmarks well on single-turn edits tells you nothing about what happens at turn 15 of a real session. Until vendors start publishing long-horizon scores, assume every current model degrades at depth.

What if I really do need 20+ edits on the same document?

Break the document into smaller files. Edit each in isolation. Re-assemble manually. Keep an original frozen outside the LLM’s reach so you have ground truth to diff against – because by turn 20, you won’t remember what the model quietly changed at turn 7.

Try this right now

Open the last long ChatGPT or Claude conversation where you edited a document across many turns. Find the original. Diff it against what you ended up with. Count what changed that you didn’t ask for. That number is your personal DELEGATE-52 score – and it’s the only one that decides whether to change how you work tomorrow.