Popular AI fashions fail to successfully transcribe Indic languages, mishearing one in three phrases or dropping English phrases altogether in combined speech, as per a research by bodily and voice AI knowledge infrastructure firm Humyn Labs.
Based by gaming veteran Manish Agarwal, the startup appears to create a Benchmark of Regional & Worldwide Information for World Analysis (BRIDGE) to consider industrial AI speech-recognition tools. The research checked out tools like ElevenLabs Scribe v2, Deepgram Nova-3, Gemini 2.5 Flash, OpenAI GPT-4o, and Indian suppliers Sarvam saaras v3 and Gnani vachana v3 on actual Indian language knowledge.
The research confirmed that even essentially the most broadly deployed tools have a basic drawback of mishearing phrases in Indian language audio. Worse nonetheless, in circumstances of a pure mixing of Hindi or any Indic language with English mid-sentence, most AI tools both drop the English phrases or convert them into transliterated script, breaking the which means for anybody studying the transcript.
“The fashions are grading their very own work. ASR suppliers revealed their very own accuracy scores utilizing benchmarks constructed on English-first, internet-trained datasets, with little impartial validation. In the meantime, enterprises are making million-dollar deployment selections on numbers that not often mirror how their customers in World South truly communicate,” mentioned Manish Agarwal, Co-founder, Humyn Labs, including that theirs is the primary impartial benchmark for real-world conversational audio throughout non-English markets.
The scores reveal that Deepgram Nova-3 leads when it comes to the semantic hole at 0.906. Amazon Transcribe scores 0.199. OpenAI’s fashions fall beneath 0.4. Most enterprises utilizing these tools have been unaware of the errors as a result of the usual trade measure, Phrase Error Price (WER), was by no means designed to catch the failures that outline actual Indian speech.
Evaluating world fashions in opposition to Indian suppliers, the research confirmed that Sarvam AI’s saaras v3 ranks third general on WER at 20.2 per cent, forward of Google Gemini, Microsoft Azure, and AWS Transcribe, a robust outcome for a mannequin constructed particularly for Indian languages. Nonetheless, when it comes to combined speech, Sarvam scores 0.588, inserting it within the partial-reliability class the place efficiency varies by language and English density. This implies the hole between headline accuracy and code-switch reliability applies to home and worldwide suppliers alike.
Humyn applies a seven-metric stack to take a look at whether or not the AI fashions protect the which means of what was mentioned, make sure the LLM accurately tracks English phrases embedded in Indian language speech, how Indic phonology is transcribed in addition to Phrase Info Misplaced in case of under- or over-transcription.
“The fashions aren’t the one drawback the metrics are. You can’t consider non-English speech with a scoring system designed for English phonology and name it rigorous. The efficiency leaderboard for Hindi just isn’t the leaderboard for Tamil, Bengali and Marathi. A single combination benchmark rating can’t assist cross-regional deployment selections,” mentioned Ishank Gupta, Co-founder, Humyn Labs.
The research highlights how a mannequin that leads on Spanish could not lead on Vietnamese. Equally, the mannequin that leads on code-switching doesn’t lead on phrase accuracy, stressing the necessity for enterprises to consider the language, dialect, and speech sample that matches their precise customers.
Revealed on Might 11, 2026
Source link
#Popular #enterprise #tools #fail #accurately #transcribe #Indic #languages #Humyn #Labs


