- Microsoft researchers decide that present LLMs aren’t good at long-running tasks
- Extra interactions and fewer construction considerably scale back benchmark efficiency
- “Python is the one area the place most models are prepared”
New analysis from a trio of Microsoft staff has uncovered a elementary problem that may very well be blocking efficient agentic AI -namely that most AI models cannot truly reliably deal with long-running workflows.
To quantify their findings, the researchers launched a brand new DELEGATE-52 benchmark to offer metrics throughout 52 sectors, together with coding, accounting, science and extra.
Finally, the paper concluded present LLMs “introduce sparse however extreme errors that silently corrupt paperwork, compounding over lengthy interplay.”
AI is not that good at long-running tasks, but
The research goes into a number of the newest AI models together with Gemini 3.1 Professional, Claude 4.6 Opus and GPT-5.4. It discovered that even they “corrupt a median of 25% of doc content material by the top of lengthy workflows,” with lesser models much more more likely to get issues flawed.
The DELEGATE-52 benchmark makes use of actual paperwork at round 15K tokens in size and launched 5-10 complicated modifying tasks with a “round-trip relay simulation” that asks AI to carry out a change then reverse it. This enables the researchers to measure how successfully every mannequin reconstructs the paperwork again to their unique kinds.
Extremely structured and programmatic areas have been the place the models carried out greatest, with the Microsoft researchers concluding that “Python is the one area the place most models are prepared.” Conversely, pure language workflows, artistic areas and semi-structured paperwork noticed mannequin models struggle.
The paper additionally uncovers that, the longer the token size, the extra doubtless an AI mannequin is to struggle.
The place frontier models differed was not of their potential to eradicate errors – simply that they have been in a position to delay errors. A few of the different models examined by Microsoft’s researchers included quite a few GPT-5 and GPT-4 generations, Claude choices, Gemini models and one every from Mistral, xAI and Moonshot – totalling 19 completely different models from six households.
Gemini 3.1 Professional took first place with a DELEGATE-52 benchmark rating of 80.9% after 20 interactions; Claude 4.6 Opus (73.1%) and GPT-5.4 (71.5%) spherical out the highest three, and GPT 5 Nano (10.0%) falls into final place.
Briefly, the paper concludes that at this time’s AI models usually are not dependable sufficient to be trusted for long-running, autonomous workflows, highlighting key areas the place mannequin builders should concentrate on sooner or later and providing up yet one more benchmark to find out mannequin functionality.
Through The Register
Observe TechRadar on Google Information and add us as a most well-liked supply to get our knowledgeable information, evaluations, and opinion in your feeds.
Source link
#Microsoft #scientists #find #models #struggle #longrunning #tasks


