June 5, 2026
15
mins read

Why Speech Recognition Fails on Hinglish: The Code-Switching Problem No One Explains

Chris Wilson
Content Creator
Be Updated
Get weekly update from Gnani
Thank You! Your submission has been received.
Oops! Something went wrong while submitting the form.

A collections agent in Lucknow is on a call with a customer who has missed three EMI payments. The conversation lasts four minutes. In that time, the agent switches between Hindi and English at least thirty times, sometimes mid-sentence. "Sir, aapka outstanding balance abhi bhi pending hai, so please confirm karo ki payment kab karoge." The customer responds in a mix of Bhojpuri-inflected Hindi and English, dropping account numbers, dates, and rupee amounts across both languages.

The STT system tasked with transcribing this call was built by one of the largest cloud providers in the world. It has a published accuracy rate that looks impressive on the vendor's benchmark page. By the time it finishes with this call, it has produced a transcript with a word error rate above 20%. Critical words are missing. Amounts are wrong. The consent confirmation the agent gave is transcribed as something that sounds legally ambiguous. The call cannot be used for compliance review.

This is not an edge case. It is the default audio profile of a contact center call in North India. And it exposes a problem in speech recognition that most vendors acknowledge in footnotes but never actually explain: the code-switching problem.

What Code-Switching Actually Is

Code-switching is the practice of alternating between two or more languages within a single conversation. The term comes from linguistics, where it has been studied for decades as a natural feature of bilingual and multilingual speech, not a deficiency or a mistake.

In India, code-switching is not an occasional aberration. It is how educated, urban, and semi-urban Indians communicate in professional and semi-formal contexts. The blending of Hindi and English, Tamil and English, Kannada and English, and dozens of other regional language pairings reflects a specific kind of fluency, the fluency of someone who has internalised two linguistic systems well enough to draw on both simultaneously.

For speech recognition, this creates a problem that is fundamentally different from the challenge of supporting multiple languages. Supporting Hindi and English as separate languages is a solved problem. Every major STT vendor offers both. The hard problem is handling a speaker who uses both languages at once, in the same sentence, without any signal that a language switch has occurred.

There are two types of code-switching, and the distinction matters enormously for how STT systems fail.

Inter-sentential code-switching happens between sentences. A speaker finishes a complete thought in Hindi, then starts the next sentence in English. "Aapka account verify ho gaya. Your application has been processed." This is relatively manageable for STT systems because the language boundary coincides with a natural acoustic break. The model can reset its language assumption at the sentence boundary.

Intra-sentential code-switching happens within a sentence. The speaker inserts words, phrases, or entire clauses from one language into the grammatical structure of another. "Aapka loan amount disburse ho gaya hai, so please check karo apna account." The syntax is Hindi. The vocabulary is mixed. There is no sentence boundary where the model can switch modes. This is where virtually every standard STT system breaks.

For a plain-English definition of code-switching and related linguistic terms, see our STT Glossary: Every Term You'll Encounter When Evaluating a Speech-to-Text API.

Why Intra-Sentential Code-Switching Breaks Standard ASR Models

To understand why code-switching is technically hard, you need to understand how a standard speech recognition model processes audio.

At the core of every STT system is an acoustic model, which maps raw audio signals to phonemes or sub-word units, and a language model, which uses statistical patterns to figure out which word sequence is most likely given those phonemes. Both components are trained on a specific language or set of languages, and their training data determines what they handle well.

When a speaker switches languages mid-sentence, three things happen simultaneously that the model was not designed to handle.

The phoneme inventory changes. Hindi uses phonemes that do not exist in English, including retroflex consonants and aspirated stops. English uses phonemes that do not appear in standard Hindi phonology. When a speaker moves from one language to the other within a sentence, the acoustic model must switch which phoneme inventory it is drawing from, often within a fraction of a second, without any explicit signal that this switch has occurred. Most models either default to one phoneme inventory for the entire utterance or make probabilistic guesses that produce substitution errors at language boundaries.

The vocabulary distribution changes. Language models assign probabilities to word sequences based on what words tend to appear together. A language model trained on Hindi text expects Hindi words to follow other Hindi words. When an English word appears mid-sentence in a Hindi grammatical structure, the language model has no statistical basis for assigning it a high probability. It will often substitute the closest-sounding Hindi word, or in some architectures, flag it as unrecognisable and produce a deletion error.

The language identification assumption breaks. Most STT APIs determine the language of an utterance at the start of processing and maintain that assumption for the entire segment. Utterance-level language identification was designed for monolingual speech. On intra-sentential code-switched speech, it means the model commits to one language for the entire sentence and interprets every phoneme through that lens, including the phonemes from the other language.

The Scale of the Problem in Indian Enterprise Deployments

This is not a fringe use case in India. It is the dominant communication style in the most commercially important voice AI deployments.

Consider the language profile of a BFSI contact center with operations across Hindi-belt states. Collections calls from UP, Bihar, Rajasthan, and MP involve agents who switch between Hindi and English for financial terminology, regulatory language, and system-generated prompts. The customers they are calling respond in regional Hindi dialects that themselves contain layers of English borrowing for modern vocabulary.

In insurance, claim settlement calls often move between the regional language of the customer, the formal Hindi of the agent's script, and the English terminology of the policy document being referenced. All three layers can appear in a single sentence.

In telecom, outbound IVR calls to customers in Hindi-speaking regions frequently involve a customer who responds in a mix of Bhojpuri or Awadhi-inflected Hindi and English. If the voice bot's STT layer cannot handle that response, the conversation fails before it has begun.

The practical consequence is that enterprises which deploy globally trained STT systems, or even India-trained systems that were not specifically built for code-mixed audio, face a gap between their accuracy in vendor demos and their accuracy in production that can be 15 to 25 percentage points wide. That gap shows up in compliance failures, incorrect entity extraction, and voice bot conversations that break mid-flow.

How Code-Switching Interacts With Noise in Indian Contact Centers

Code-switching does not happen in isolation. In Indian contact center environments, it occurs alongside the audio quality challenges that already degrade STT accuracy: 8kHz telephony audio, GSM codec compression, background office noise, and the acoustic profile of speakers on basic Android phones in tier-2 and tier-3 cities.

Noise and code-switching compound each other in a specific way. Noise degrades the acoustic signal, which makes the acoustic model's phoneme identification less reliable. When phoneme identification is already uncertain, the model leans more heavily on the language model for disambiguation. But the language model was designed for monolingual sequences, so it pushes the transcription toward the dominant language, overriding the acoustic signal even when that signal was pointing correctly to a word in the second language.

The result is that on noisy, code-switched telephony audio, the STT system is systematically biased toward the dominant language it was trained on, deleting or substituting words from the secondary language at a higher rate than chance would predict. In practical terms, this means that English technical terms embedded in Hindi sentences get dropped or mangled more than Hindi content, because the language model's prior is Hindi-dominant.

For a full analysis of how background noise and low-bandwidth audio affect STT accuracy in Indian deployments, see our guide on Speech Recognition in Noisy and Rural India: Why Your STT Fails Where It Matters Most.

What the WER Numbers Actually Look Like on Code-Mixed Audio

Most vendors do not publish WER on code-mixed audio. They publish WER on clean, monolingual audio in each supported language separately. These numbers look good precisely because they avoid the hardest test condition.

When you run a benchmark specifically on Hindi-English code-mixed telephony audio in real contact center conditions, the picture changes substantially.

Global models from major cloud providers, which were trained predominantly on English and European language data with Hindi added as a supported language, typically show WER between 14 and 16% on Hindi-English code-mixed telephony audio. These are systems that publish impressive numbers on clean English benchmarks, but the performance degrades significantly when the audio contains mid-sentence language switches on 8kHz telephony.

Homegrown Indic models, which were built specifically for Indian languages and have more representative training data for Hindi and other Indic languages, perform better but still face structural limitations on intra-sentential switching. On the same audio profile, WER for leading homegrown Indic models falls between 11 and 14%.

Gnani STT API achieves 9% WER on Hindi-English code-mixed audio in real telephony and noisy environment conditions. The difference comes from training data composition. Gnani's models are trained on production telephony audio from Indian contact centers at scale, including audio that reflects the actual code-switching patterns of agents and customers in BFSI, insurance, and telecom deployments across Hindi-belt states. The model was not adapted from a monolingual architecture; it was built from the ground up on the audio profile it is being asked to handle.

For a direct comparison of WER performance across Indian languages and audio conditions, see Gnani STT API vs Google Cloud Speech vs Azure: Accuracy on Indian Languages Compared.

The Architecture Difference: How Good Code-Switching Support Is Built

Understanding why some systems handle code-switching better than others requires a brief look at the architectural choices that matter.

The critical variable is whether language identification happens at the utterance level or the frame level.

Utterance-level language identification makes a single language decision for an entire audio segment before transcription begins. It is computationally cheaper and works well on monolingual speech. On intra-sentential code-switched speech, it is wrong by design: it commits to a language before the switch has occurred and processes subsequent frames through the wrong phoneme and vocabulary priors.

Frame-level language identification, or joint multilingual modeling, processes audio in very short windows and can update its language assumption continuously as the audio unfolds. This is architecturally closer to how a bilingual human listener processes code-switched speech. The trade-off is computational cost and the need for sufficient training data of both languages in mixed configurations, not just as separate monolingual corpora.

The training data requirement is the harder problem. A model trained on a large Hindi corpus and a large English corpus separately will not automatically perform well on code-mixed speech, because the mixing patterns, the specific words that tend to be borrowed, the grammatical structures in which mixing occurs, the phonological adaptations that happen when English words are pronounced within a Hindi sentence, are not represented in either monolingual corpus. Effective code-switching support requires training data that reflects actual mixed-language speech from the target deployment environment.

This is why contact center-specific training data is not just a nice-to-have for Indian enterprise STT. It is the core technical requirement.

Beyond Hinglish: Other Code-Switching Pairs That Matter in India

Hinglish is the most commercially prominent code-switching pair in India, but it is not the only one that matters for enterprise deployments.

Tanglish, the mixing of Tamil and English, is the dominant communication style in contact center operations in Tamil Nadu and among Tamil-speaking communities across India. Tamil's agglutinative morphology creates specific challenges: English words are often inflected with Tamil suffixes, creating hybrid forms that appear in neither an English vocabulary nor a standard Tamil vocabulary.

Kanglish (Kannada-English) is the equivalent in Karnataka, particularly in Bengaluru-based contact centers serving the technology and financial services sector. Telglish (Telugu-English) is prevalent in Andhra Pradesh and Telangana. In each case, the specific mixing patterns, which words get borrowed, how borrowed words are phonologically adapted, and which grammatical structures the mixing follows, are different. A model that handles Hindi-English code-switching well will not automatically handle Tamil-English code-switching correctly.

Malayalam, Bengali, Marathi, Gujarati, Punjabi, Odia, and Assamese all show similar code-switching patterns with English in professional contexts, each with its own phonological and lexical characteristics. An enterprise deploying voice AI across multiple Indian language regions needs STT infrastructure that handles each of these pairs, not just Hindi-English.

Gnani STT API supports all 12 major Indian languages: Hindi, Tamil, Telugu, Malayalam, Kannada, Odia, Marathi, Punjabi, Gujarati, Bengali, Assamese, and English, with training data drawn from production deployments across all of these language contexts.

How to Test an STT API for Code-Switching Before You Deploy

Most STT vendors will not proactively offer to benchmark their system on code-mixed audio. The test set you use in a vendor evaluation will determine whether you see the code-switching problem before or after you go live.

Here is how to structure a code-switching evaluation.

Build a test set that reflects your actual audio. Pull 100 to 200 calls from your production environment that involve agents and customers who code-switch. If you are evaluating for a BFSI deployment in UP or Bihar, your test set should contain calls from those regions with the actual language mixing patterns those agents and customers use. Do not use studio recordings or scripted demos.

Segment your test set by switching type. Separate calls that involve predominantly inter-sentential switching from those with frequent intra-sentential switching. WER on inter-sentential code-switched audio will be significantly lower than on intra-sentential. If a vendor shows you only inter-sentential test results, they are not showing you the hard problem.

Test the specific vocabulary that matters for your use case. In collections, that means amounts in words and numerals, account identifiers, date formats, and consent language. In insurance, it means policy terms, claim amounts, and product names. Code-switching errors often cluster on the technical vocabulary that matters most, precisely because that vocabulary is the most likely to be borrowed from English into a Hindi-structure sentence.

Test under realistic audio conditions. Run your code-switching evaluation on telephony audio at 8kHz, not on clean recordings. The interaction between noise and code-switching is where the real performance gap emerges. A system that shows acceptable WER on clean code-switched audio may degrade sharply when noise is added.

Ask for error type breakdown. A WER of 12% composed mostly of substitutions on English vocabulary within Hindi sentences tells you something specific: the model is hearing the English words but mapping them to Hindi vocabulary. A WER of 12% composed mostly of deletions tells you the model is missing the English segments entirely. Both require different fixes.

For a complete benchmarking methodology including jiwer-based WER calculation across Indian languages, see How to Benchmark a Speech-to-Text API on Indian Languages Before You Sign Anything.

The Business Consequence of Getting This Wrong

The code-switching problem is not an abstract accuracy metric. In enterprise deployments, it has specific, traceable business consequences.

In collections, a transcript that drops or mangles the amount field in a customer promise-to-pay creates a disputed record. The agent said "teen hazaar rupay ka payment Friday tak," the transcript says "teen rupay ka payment," and the CRM is populated with an amount three orders of magnitude lower than what was agreed. That is not a transcription error. It is a data integrity failure.

In compliance, an agent's disclosure statement that code-switches between Hindi and English, as most agents' scripts do in practice, may be partially or fully misTranscribed. If the regulatory disclosure was not captured accurately, the call cannot be used as a compliance record. The enterprise either needs a human reviewer on every call or accepts compliance exposure.

In voice bots, a customer response that contains English words in a Hindi-structure sentence will fail intent detection if the STT layer has dropped or substituted the English words. The bot does not understand the response, falls back to a clarification prompt, and the customer abandons the call. In outbound collection flows where completion rate is a revenue metric, this failure is directly quantifiable.

For a deeper look at the accuracy and compliance requirements specific to BFSI voice AI deployments, see Speech Recognition for BFSI: What Indian Banks and NBFCs Must Verify Before Going Live.

What Good Looks Like: Evaluating Code-Switching Capability in an STT Vendor

When evaluating STT vendors for a deployment that involves code-switched audio, here are the specific questions that separate vendors who have solved this problem from those who are still working on it.

What is your WER specifically on Hindi-English code-mixed telephony audio? If a vendor cannot give you a number on this specific condition, they have not tested for it. A general Hindi WER or a general English WER is not an answer to this question.

What is your language identification architecture? Ask whether the system uses utterance-level or frame-level language identification. This is a technical question with a concrete answer. If the vendor says utterance-level, ask how they handle intra-sentential switching. If they cannot answer that, the system does not handle it.

What code-mixed training data do you use? A vendor claiming strong code-switching support should be able to describe the composition of their training data. "We train on Indian telephony audio" is not sufficient. The relevant question is whether that training data includes code-mixed utterances, and in what proportion.

Can I benchmark your system on my own audio before committing? Any vendor confident in their code-switching performance will agree to an evaluation on your audio. One that deflects to published benchmarks is telling you that their published benchmarks are not representative of your use case.

For a full vendor evaluation framework including an RFP template with these questions formalised, see our guide on How to Write an STT API RFP: The Questions Your Speech Recognition Vendor Does Not Want You to Ask.

Frequently Asked Questions About Code-Switching and Hinglish ASR

What is code-switching in speech recognition?
Code-switching is when a speaker alternates between two or more languages within a conversation or within a single sentence. In Indian enterprise environments, the most common pairs are Hindi-English (Hinglish), Tamil-English (Tanglish), and Kannada-English (Kanglish). Standard STT systems break on intra-sentential code-switching, where the language switch happens mid-sentence without a natural boundary.

Why do most STT APIs fail on Hinglish?
Most STT systems use utterance-level language identification, which commits to a single language for an entire audio segment before processing begins. On intra-sentential code-switched speech, this means the system processes English phonemes through a Hindi model or vice versa, producing substitution and deletion errors on the switched-language vocabulary. The problem is architectural, not just a training data gap.

What WER should I expect on Hinglish contact center audio?
On real Hindi-English code-mixed telephony audio in noisy conditions, global models typically achieve 14 to 16% WER. Leading homegrown Indic models achieve 11 to 14%. Gnani STT API achieves 9% WER on the same audio profile, trained on production contact center audio that reflects real Indian code-switching patterns.

Is Hinglish the only code-switching problem in India?
No. Tamil-English, Kannada-English, Telugu-English, Malayalam-English, and other regional language pairings all present similar challenges. Each pair has distinct phonological adaptation patterns and vocabulary borrowing conventions. An enterprise deploying voice AI across multiple Indian language regions needs code-switching support for each relevant pair.

Does noise make code-switching worse?
Significantly. Noise degrades acoustic model confidence, causing the system to lean more heavily on language model priors. Language model priors push toward the dominant training language. The result is that on noisy, code-switched audio, the STT system systematically underperforms on the secondary language vocabulary, producing higher error rates on exactly the words most likely to be borrowed, such as English technical terms within Hindi sentences. For a full analysis of noise effects on Indian STT, see Speech Recognition in Noisy and Rural India.

How do I test a vendor's code-switching capability before deploying?
Build a test set from your actual production audio, segment it by inter-sentential and intra-sentential switching, and run the benchmark on telephony-quality audio. Ask for WER specifically on code-mixed audio, not general language WER. Ask about language identification architecture. Any vendor that cannot provide code-mixed WER on telephony audio has not tested for the problem you are trying to solve. For the full benchmarking methodology, see How to Benchmark a Speech-to-Text API on Indian Languages.

What downstream systems are affected by poor code-switching accuracy?
Intent detection, entity extraction, compliance monitoring, and voice bot conversation flows all depend on accurate transcription of code-switched utterances. In collections, dropped or substituted English vocabulary within Hindi sentences causes errors in amount capture and consent recording. In compliance, partially transcribed disclosure statements create regulatory exposure. In voice bots, failed intent detection on code-switched responses causes conversation breakdowns.

The Bottom Line

The code-switching problem is not going to go away. India's contact center workforce speaks in mixed languages because mixed-language communication is natural, efficient, and normal for bilingual speakers. The question is not whether your customers and agents will code-switch. They will. The question is whether your STT infrastructure was built to handle it.

Most global STT systems were not. Most were built for monolingual speech, with Indian language support added as an extension. That architectural history shows up in production WER numbers on code-mixed audio, consistently 14 to 16% for global providers and 11 to 14% for homegrown Indic models that are monolingual at their core.

The gap between 9% and 16% WER on code-mixed telephony audio is not a marginal difference in a performance chart. It is the difference between a compliance-grade transcript and a disputed one. Between a collections call that populates your CRM correctly and one that creates a data integrity problem. Between a voice bot conversation that completes and one that abandons.

Evaluate your STT vendor on code-mixed audio from your actual deployment environment, under real noise conditions, before you sign anything.

This post is part of The India STT Handbook by Gnani, a practitioner's guide to speech-to-text for Indian enterprises. Gnani STT API supports 12 Indian languages: Hindi, Tamil, Telugu, Malayalam, Kannada, Odia, Marathi, Punjabi, Gujarati, Bengali, Assamese, and English.

More for You

HR

Why AI Agents Leave Traditional CRMs Behind in the Quest for Superior Customer Engagement

Healthcare
Banking & NBFC
BPOs
Telecom
Consumer Durables

Developer Spotlight: Building Custom AI Skill-Blocks for Inya.ai

EdTech
Govt.

Voice AI to with Market Dominators: Small Finance Banks’ Edge

Enhance Your Customer Experience Now

Gnani Chip