AgentPerf from Synthetic Evaluation, the trade’s first agentic AI benchmark, provides builders, enterprises and infrastructure suppliers a transparent approach to evaluate techniques for agentic AI. Within the first spherical of revealed outcomes, the NVIDIA Blackwell Extremely NVL72 platform delivers main efficiency throughout the agentic AI workloads examined, operating 20x extra brokers per megawatt than NVIDIA Hopper.
Agentic AI is a basically completely different workload than conversational AI. A single chat completion is a dash: one massive language mannequin (LLM) name, one response. An agent features extra like a relay: It breaks a purpose into many steps and retains going till the duty is completed.

That leads to dozens to lots of of LLM calls chained collectively, every passing rising context to the subsequent, with device calls like code compile and execution, database search and internet shopping at each handoff. The complexity isn’t additive; it’s multiplicative.
The excellence issues enormously for efficiency measurement. Current AI inference benchmarks measure one LLM name: how briskly an LLM responds to a single request and what number of simultaneous requests a system can deal with. They weren’t designed for agentic workloads, the place chained LLM calls, device name delays and rising context stress accelerated computing techniques in basically other ways than a single LLM name ever might.
For corporations constructing and deploying brokers at scale, it’s essential to know how responsive brokers are, what number of will be deployed concurrently and the way a lot helpful work AI infrastructure can ship for each greenback and watt invested.
NVIDIA GB300 NVL72 Runs 20x Extra Brokers per Megawatt
On this first spherical, AgentPerf measures agentic efficiency with DeepSeek V4 Professional, a big mixture-of-experts (MoE) mannequin that represents the category of frontier fashions powering right this moment’s most succesful brokers. On this workload, NVIDIA GB300 NVL72 delivers the very best efficiency within the benchmark, operating as much as 20x extra brokers per megawatt than the NVIDIA HGX H200 system.

The efficiency benefit comes from excessive codesign throughout the total stack. GB300 NVL72 connects 72 GPUs right into a single rack-scale system, enabling massive MoE fashions like DeepSeek V4 Professional to distribute mannequin execution effectively at scale.
CUDA kernels speed up this additional by overlapping communication and compute, so the price of coordinating throughout consultants is absorbed quite than added to latency.
NVIDIA TensorRT LLM sustains effectivity as concurrent agent classes scale. For instance, it separates the processing of inputs from the technology of outputs so every will be optimized independently.
These outcomes are grounded in a benchmark methodology constructed from the bottom as much as mirror how agentic AI really works in manufacturing.
Synthetic Evaluation AgentPerf: Constructed on Actual-World Agentic Workloads
AgentPerf is constructed based mostly on actual coding agent trajectories: an agent receives a process, reads recordsdata, writes and edits code, executes instructions and iterates based mostly on the outcomes — all drawn from actual public code repositories throughout 12+ programming languages. The lengthy sequence lengths, device name patterns and delays are all consultant of real-world coding workflows.
AgentPerf then measures what number of of those agentic duties a platform can assist concurrently whereas assembly outlined efficiency thresholds for responsiveness and output token fee. Device calls aren’t executed however simulated utilizing consultant CPU processing time, so variations in outcomes mirror accelerated computing efficiency solely.
The outcomes translate immediately into infrastructure selections: what number of concurrent agentic duties will be run per accelerator and per megawatt of energy. For enterprises deploying AI brokers at scale, these numbers decide how a lot productive work a given infrastructure funding can really ship.
NVIDIA Ecosystem Companions Harness Blackwell’s Main Efficiency
Main inference suppliers together with Baseten, DeepInfra and Collectively AI are already serving agentic workloads on frontier fashions similar to DeepSeek V4 Professional on NVIDIA Blackwell and powering manufacturing agentic purposes right this moment.
Collectively AI powers real-time inference for Cursor, an AI-powered agentic coding platform, on NVIDIA Blackwell. Cursor’s brokers debug points, generate options and execute refactors whereas builders proceed working.
DeepInfra powers Pam.ai, an AI workforce platform for automotive dealerships, which deploys brokers to ebook service appointments, deal with calls and run outbound gross sales campaigns, completely on NVIDIA Blackwell.
As NVIDIA and the open supply ecosystem proceed to optimize inference software program, efficiency and effectivity on agentic workloads will solely enhance. The NVIDIA Vera Rubin structure is now in full manufacturing, bringing the subsequent technology of infrastructure capability to fulfill the rising calls for of agentic AI at scale.
Dive deeper into AgentPerf’s methodology and NVIDIA’s full-stack optimizations for agentic AI on this technical weblog.
Source link
#NVIDIA #Blackwell #Leads #Agentic #Infrastructure #Benchmark


