Microsoft Scientists Find Most AI Models Struggle With Long-running Tasks

Microsoft researchers decide that present LLMs aren’t good at long-running tasks
Extra interactions and fewer construction considerably scale back benchmark efficiency
“Python is the one area the place most models are prepared”

New analysis from a trio of Microsoft staff has uncovered a elementary problem that may very well be blocking efficient agentic AI -namely that most AI models cannot truly reliably deal with long-running workflows.

To quantify their findings, the researchers launched a brand new DELEGATE-52 benchmark to offer metrics throughout 52 sectors, together with coding, accounting, science and extra.

Finally, the paper concluded present LLMs “introduce sparse however extreme errors that silently corrupt paperwork, compounding over lengthy interplay.”

Newest Movies From

AI is not that good at long-running tasks, but

The research goes into a number of the newest AI models together with Gemini 3.1 Professional, Claude 4.6 Opus and GPT-5.4. It discovered that even they “corrupt a median of 25% of doc content material by the top of lengthy workflows,” with lesser models much more more likely to get issues flawed.

The DELEGATE-52 benchmark makes use of actual paperwork at round 15K tokens in size and launched 5-10 complicated modifying tasks with a “round-trip relay simulation” that asks AI to carry out a change then reverse it. This enables the researchers to measure how successfully every mannequin reconstructs the paperwork again to their unique kinds.

Extremely structured and programmatic areas have been the place the models carried out greatest, with the Microsoft researchers concluding that “Python is the one area the place most models are prepared.” Conversely, pure language workflows, artistic areas and semi-structured paperwork noticed mannequin models struggle.

The paper additionally uncovers that, the longer the token size, the extra doubtless an AI mannequin is to struggle.

The place frontier models differed was not of their potential to eradicate errors – simply that they have been in a position to delay errors. A few of the different models examined by Microsoft’s researchers included quite a few GPT-5 and GPT-4 generations, Claude choices, Gemini models and one every from Mistral, xAI and Moonshot – totalling 19 completely different models from six households.

Gemini 3.1 Professional took first place with a DELEGATE-52 benchmark rating of 80.9% after 20 interactions; Claude 4.6 Opus (73.1%) and GPT-5.4 (71.5%) spherical out the highest three, and GPT 5 Nano (10.0%) falls into final place.

Briefly, the paper concludes that at this time’s AI models usually are not dependable sufficient to be trusted for long-running, autonomous workflows, highlighting key areas the place mannequin builders should concentrate on sooner or later and providing up yet one more benchmark to find out mannequin functionality.

Through The Register

Google logo on a black background next to text reading 'Click to follow TechRadar'

Observe TechRadar on Google Information and add us as a most well-liked supply to get our knowledgeable information, evaluations, and opinion in your feeds.

Source link
#Microsoft #scientists #find #models #struggle #longrunning #tasks

Some Women Are Obsessively Testing Their Vaginas to Optimize Them

Forza Horizon 6 leaked: Playground games warns pirates with ‘franchise-wide and hardware bans’ | Mint

This T-Mobile MVNO is building a voice clone to take your calls for you

OnePlus Nord CE6 launched in India with 8,000mAh battery, price starts at ₹29,999 | Mint

Gold, silver price prediction: Will gold head to Rs 1.60 lakh/10 grams & silver hit Rs 2.80 lakh/kg? Check outlook – The Times of India

West Bengal: Tensions escalate as bulldozer razes structures in Kolkata’s iconic New Market

Most Popular

OnePlus Nord CE6 launched in India with 8,000mAh battery, price starts at ₹29,999 | Mint

Gold, silver price prediction: Will gold head to Rs 1.60 lakh/10 grams & silver hit Rs 2.80 lakh/kg? Check outlook – The Times of India

West Bengal: Tensions escalate as bulldozer razes structures in Kolkata’s iconic New Market

Our Picks

Microsoft scientists find most AI models struggle with long-running tasks

Zebra Technologies Posts 14% Revenue Jump in Q1 2026, Beats on EPS – Alphastreet

Russian Cargo Likely Carrying Nuclear Submarine Reactors For North Korea Unexplainedly Sank Off Spain: Report

Microsoft scientists find most AI models struggle with long-running tasks

Related Posts

Subscribe to Updates