Research & Papers May 9 arxiv.org

Study Shows LLMs Corrupt 25% of Documents in Delegated Tasks

Original: LLMs Corrupt Your Documents When You Delegate

Why This Matters

Highlights critical reliability issues preventing widespread AI adoption in professional workflows.

Research reveals that large language models including GPT-5.4, Claude 4.6 Opus, and Gemini 3.1 Pro corrupt an average of 25% of document content during long delegated workflows across 52 professional domains, introducing sparse but severe errors.

Researchers introduced DELEGATE-52, a benchmark testing AI systems' reliability in delegated workflows across 52 professional domains including coding, crystallography, and music notation. The study evaluated 19 LLMs on long document editing tasks requiring sustained accuracy. Results show even frontier models like GPT-5.4, Claude 4.6 Opus, and Gemini 3.1 Pro corrupt 25% of document content on average, with other models performing worse. The research found that agentic tool use doesn't improve performance, and degradation worsens with larger documents, longer interactions, or distractor files. The errors are described as 'sparse but severe' and compound over extended interactions, making current LLMs unreliable for delegated knowledge work.

Source

arxiv.org — Read original →

Study Shows LLMs Corrupt 25% of Documents in Delegated Tasks

Why This Matters

Source

Related articles

Sign in to listen