Study Shows LLMs Corrupt 25% of Documents in Delegated Tasks
Original: LLMs Corrupt Your Documents When You Delegate
Why This Matters
Highlights critical reliability issues preventing widespread AI adoption in professional workflows.
Research reveals that large language models including GPT-5.4, Claude 4.6 Opus, and Gemini 3.1 Pro corrupt an average of 25% of document content during long delegated workflows across 52 professional domains, introducing sparse but severe errors.
Researchers introduced DELEGATE-52, a benchmark testing AI systems' reliability in delegated workflows across 52 professional domains including coding, crystallography, and music notation. The study evaluated 19 LLMs on long document editing tasks requiring sustained accuracy. Results show even frontier models like GPT-5.4, Claude 4.6 Opus, and Gemini 3.1 Pro corrupt 25% of document content on average, with other models performing worse. The research found that agentic tool use doesn't improve performance, and degradation worsens with larger documents, longer interactions, or distractor files. The errors are described as 'sparse but severe' and compound over extended interactions, making current LLMs unreliable for delegated knowledge work.