How to Benchmark an STT API on Indian Languages Before You Go Live
- Introduction
- Why Vendor Benchmarks Fail
- Step 1: Build a Test Set
- Step 2: Ground Truth
- Step 3: Calculate WER
- Step 4: Language Evaluation
- Step 5: Code-Switching
- Step 6: Noise Testing
- Step 7: Diarization & Latency
- Step 8: Vendor Comparison
- Step 9: Fine-Tuning
- Evaluation Checklist
- FAQ
Most enterprise STT evaluations follow the same pattern. A vendor sends a benchmark sheet. The numbers look strong. A demo is scheduled on clean audio. The procurement team is impressed. The contract is signed. Three months after go-live, production accuracy is 15 percentage points below what the demo showed, and nobody is sure why.
The gap is almost always explained by the same set of factors: the vendor’s benchmark was on clean audio, the demo was on scripted speech, and the production calls are noisy, spontaneous, code-switched, 8kHz telephony from tier-2 cities. The evaluation never tested for the conditions that matter.
This guide gives you a complete methodology for benchmarking an STT API on Indian languages before you commit. It covers how to build a representative test set, how to calculate WER using jiwer, how to structure language-specific evaluations across all 12 major Indian languages, how to test for noise and code-switching, and how to compare vendors on metrics that predict production performance rather than demo performance.
Why Standard Vendor Benchmarks Are Not Enough
Before getting into methodology, it is worth being precise about why vendor-supplied benchmarks cannot substitute for your own evaluation.
Vendor benchmarks are accurate in the narrow sense: they reflect performance on the specific dataset under the specific conditions described. The problem is selection. Vendors benchmark on datasets that show their systems at their best. For Indian language STT, the most commonly cited benchmark is Kathbath Clean, a 1,684-hour dataset across 12 Indian languages collected in controlled conditions at 16kHz. It is a rigorous dataset and a legitimate research benchmark.
It is not your audio.
Your contact center calls are 8kHz telephony. Your agents and customers code-switch between Hindi and English, or Tamil and English, within sentences. Your audio includes background noise from open-plan offices, GSM codec compression from mobile connections, and spontaneous speech from rural callers with regional dialectal features. None of these conditions are represented in Kathbath Clean, and WER on Kathbath Clean does not predict WER on your audio with any reliability.
The table below illustrates why this matters. These are real benchmark results across five major STT providers on both Kathbath Clean and Gramvaani, a rural Hindi telephony dataset that more closely represents real Indian contact center conditions.
| Provider | Kathbath Clean WER | Gramvaani WER | Degradation |
|---|---|---|---|
| Gnani STT API | 8.38% | 26.24% | +17.9pp |
| Microsoft | 11.01% | 29.88% | +18.9pp |
| Sarvam | 9.41% | 26.71% | +17.3pp |
| ElevenLabs Scribe v1 | 7.67% | 37.46% | +29.8pp |
| Deepgram Nova | 21.50% | 35.20% | +13.7pp |
ElevenLabs Scribe achieves the best WER on Kathbath Clean. It shows the worst WER on Gramvaani. A procurement decision based on the clean benchmark alone would select the worst-performing system for real Indian telephony conditions. Your own benchmark evaluation, on your own audio, is not optional. It is the only way to make a defensible decision.
Build a Representative Test Set
The quality of your evaluation is entirely determined by the quality of your test set. A test set that does not reflect your production audio will produce evaluation results that do not predict production performance.
Define your deployment scope first
Before pulling audio, write down the answers to four questions: Which languages will your deployment handle? What is the audio channel (telephony, VoIP, in-person recording)? What domains and vocabulary are most critical (financial terms, product names, regulatory language)? What geographic regions will your callers come from? These answers define the dimensions your test set must cover.
Pull from production audio, not demos
Your test set should be drawn from real calls in your environment. If you are evaluating for a BFSI contact center, pull from actual customer calls, not scripted quality assurance recordings. If your deployment will handle collections calls from UP and Bihar, your test set should include audio from those regions. Do not weight toward your best audio. The clean calls will perform fine on every system. You are evaluating for the hard cases.
Target 200 to 500 utterances per language
For each language in scope, aim for 200 to 500 audio segments with corresponding ground truth transcripts. Fewer than 200 produces WER estimates with too much variance to be statistically reliable. More than 500 adds evaluation cost without proportional precision gain for an initial vendor comparison.
Segment by audio condition
Divide your test set into at minimum two tiers: clean or reasonable quality audio (clear speaker, minimal background noise) and degraded audio (background noise, mobile connections, low bandwidth). Calculate WER separately for each tier. A single average WER across both tiers hides the degradation behaviour that matters most.
Include domain-critical vocabulary
Identify the 50 to 100 words and phrases that are most consequential in your use case: product names, regulatory terms, amount formats, account identifiers. Manually check transcription accuracy on these terms in addition to running WER calculations. A system with 8% overall WER that consistently misrecognises “NPA” or “DPD” is not usable for BFSI regardless of what the headline number says.
Create Ground Truth Transcripts
Ground truth quality determines evaluation reliability. A poorly annotated ground truth produces unreliable WER numbers regardless of how good or bad the STT system is.
Use native speaker annotators for each language
This is non-negotiable. For Tamil evaluation, your annotators must be fluent Tamil speakers. For Odia or Assamese, find annotators who are native speakers of those specific languages, not just multilingual annotators who have some familiarity. Non-native annotation introduces systematic phoneme-level errors that distort WER results.
Establish a consistent transcription convention before annotation begins
Decide upfront how to handle: numbers (digits vs words), punctuation (include or exclude), English words in Indian-language sentences (transliterate or keep in Roman script), disfluencies (transcribe verbatim or normalise), and overlapping speech. Inconsistent conventions across annotators produce noisy ground truth that makes WER comparisons unreliable.
For Hindi-English code-switched audio
Decide whether English words spoken in a Hindi sentence are transcribed in Devanagari script (phonetically) or Roman script. This decision affects WER calculation and should match how your downstream systems will use the transcript. Document the convention and apply it consistently.
Double-annotate a sample for quality control
Have two annotators independently transcribe 10% of your test set and calculate inter-annotator agreement. If agreement is below 95%, your annotation guidelines need tightening before you proceed.
Calculate WER Using jiwer
jiwer is the standard Python library for WER calculation. It is lightweight, well-maintained, and handles the normalisation steps that raw string comparison misses.
Installation
pip install "jiwer>=3.0"
Verify your version before running any of the code below:
import jiwer; print(jiwer.__version__)
Basic WER calculation
from jiwer import wer, cer
reference = "aapka outstanding balance abhi bhi pending hai"
hypothesis = "aapka outstanding balance abhi pending hai"
word_error_rate = wer(reference, hypothesis)
char_error_rate = cer(reference, hypothesis)
print(f"WER: {word_error_rate:.4f} ({word_error_rate*100:.2f}%)")
print(f"CER: {char_error_rate:.4f} ({char_error_rate*100:.2f}%)")Batch evaluation across a test set
import pandas as pd
from jiwer import wer, cer
from jiwer import Compose, ToLowerCase, RemovePunctuation, Strip, ReduceToListOfListOfWords
df = pd.read_csv('test_set_results.csv')
normaliser = Compose([
ToLowerCase(),
RemovePunctuation(),
Strip(),
ReduceToListOfListOfWords(),
])
results = []
for _, row in df.iterrows():
w = wer(
row['reference'],
row['hypothesis'],
reference_transform=normaliser,
hypothesis_transform=normaliser
)
c = cer(row['reference'], row['hypothesis'])
results.append({
'language': row['language'],
'audio_condition': row['audio_condition'],
'wer': w,
'cer': c
})
results_df = pd.DataFrame(results)
summary = results_df.groupby(['language', 'audio_condition']).agg(
mean_wer=('wer', 'mean'),
mean_cer=('cer', 'mean'),
sample_count=('wer', 'count')
).reset_index()
print(summary.to_string(index=False))Getting the error breakdown
from jiwer import process_words
measures = process_words(reference, hypothesis)
print(f"WER: {measures.wer*100:.2f}%")
print(f"Substitutions: {measures.substitutions}")
print(f"Deletions: {measures.deletions}")
print(f"Insertions: {measures.insertions}")
print(f"Hits: {measures.hits}")Always request the error breakdown, not just the headline WER number. A WER composed primarily of deletions on noisy audio points to a different root cause than a WER composed primarily of substitutions on domain-specific vocabulary. The fix for one is not the fix for the other.
Normalisation for Indian languages
Standard jiwer normalisation handles English text well. For Indian language transcripts, you may need custom normalisation to handle number formats, script variants, and punctuation in Devanagari and other Indian scripts.
import unicodedata
import re
def normalise_indian_text(text):
text = unicodedata.normalize('NFC', text)
text = re.sub(r'[।|॥\.\,\!\?\:\;\-\"\']', '', text)
text = re.sub(r'\s+', ' ', text).strip()
return text.lower()
reference_clean = normalise_indian_text(reference)
hypothesis_clean = normalise_indian_text(hypothesis)
word_error_rate = wer(reference_clean, hypothesis_clean)Structure Your Language-Specific Evaluation
Indian language STT evaluation is not a single benchmark. It is twelve separate evaluations with different challenges, different training data availability profiles, and different downstream use cases.
Tier your languages by deployment priority
Not all twelve languages will be equally critical for your specific deployment. Identify your primary languages, the ones that carry 80%+ of your call volume, and allocate most of your test set there. Secondary languages should still be evaluated but can have smaller test sets.
Adjust your WER acceptance threshold by language
Languages with large training data availability, primarily Hindi and English, will show lower WER across all providers. Languages with limited training data, Odia, Assamese, Punjabi, will show higher WER across all providers. Do not apply a single WER threshold across all languages. Establish language-specific thresholds based on what is achievable, not what is ideal.
| Language tier | Languages | Target WER (telephony) |
|---|---|---|
| High resource | Hindi, English | Below 12% |
| Medium resource | Tamil, Telugu, Kannada, Malayalam, Marathi, Bengali, Gujarati | Below 18% |
| Lower resource | Odia, Punjabi, Assamese | Below 25% |
Evaluate CER alongside WER for Dravidian languages
For Tamil, Telugu, Kannada, and Malayalam, character error rate is a more stable signal than WER because of agglutinative morphology. A system that produces the right root word but incorrect inflected forms will show acceptable WER but high CER, and those CER errors will break downstream NLP pipelines.
Test for Code-Switching
A vendor’s Hindi WER and English WER tested separately do not tell you how the system handles Hindi-English code-switched speech. This is a separate test with a distinct audio profile.
Build a code-switching specific test set
Pull audio that contains intra-sentential code-switching: sentences where the speaker switches language within a single utterance. “Sir, aapka loan amount disburse ho gaya hai, so please check karo apna account.” These sentences are the hard test. Inter-sentential switching, where the speaker completes a full sentence in one language before switching, is significantly easier for most systems.
Calculate WER separately on code-mixed audio
Do not average code-switched WER into your general Hindi WER. It will dilute the signal. Report it as a separate number.
Benchmark WER expectations on code-mixed telephony audio
On Hindi-English code-mixed audio in real noisy telephony conditions, the performance range across providers is: Global cloud providers at 14 to 16% WER, homegrown Indic models at 11 to 14% WER, and Gnani STT API at 9% WER. These numbers come from evaluation on production contact center audio.
Test for the vocabulary that switches most
In BFSI calls, the English vocabulary most likely to appear in Hindi sentences includes: amount terms, date formats, product names, regulatory acronyms (NPA, EMI, DPD, KYC, OTP), and system-generated phrases. Manually verify transcription accuracy on these terms in addition to running WER.
Stress-Test for Noise and Audio Degradation
Clean audio WER is the ceiling. Production WER is what you will actually get. Stress-testing for noise tells you how far below that ceiling your system will operate.
Test at multiple SNR levels
Use your segmented test set to calculate WER separately at different signal-to-noise ratios: above 20dB (clean), 15 to 20dB (moderate noise), and below 15dB (high noise, typical of rural mobile telephony). The degradation curve across these tiers tells you more about production fitness than any single number.
Test on 8kHz audio specifically
If your calls arrive at 8kHz, test on 8kHz audio. Do not allow vendors to test on upsampled audio and present that as telephony performance. Upsampling does not recover lost acoustic information.
import librosa
import soundfile as sf
def downsample_to_8k(input_path, output_path):
"""Downsample audio to 8kHz to simulate telephony conditions."""
audio, sr = librosa.load(input_path, sr=None)
audio_8k = librosa.resample(audio, orig_sr=sr, target_sr=8000)
sf.write(output_path, audio_8k, 8000)
return output_pathAdd synthetic noise for controlled degradation testing
If your test set does not have enough naturally noisy audio at specific SNR levels, you can add calibrated noise synthetically to create test conditions at known SNR values.
import numpy as np
import librosa
import soundfile as sf
def add_noise_at_snr(clean_audio, target_snr_db):
"""Add white noise to achieve a target SNR in dB."""
signal_power = np.mean(clean_audio ** 2)
snr_linear = 10 ** (target_snr_db / 10)
noise_power = signal_power / snr_linear
noise = np.random.normal(0, np.sqrt(noise_power), len(clean_audio))
return clean_audio + noise
audio, sr = librosa.load('clean_call.wav', sr=8000)
noisy_audio = add_noise_at_snr(audio, target_snr_db=10)
sf.write('noisy_call_10db.wav', noisy_audio, sr)Use synthetic noise testing to understand the degradation curve, not as a substitute for real production audio in your final evaluation.
Evaluate Diarization, Latency, and Confidence Scores
WER is the primary accuracy metric but not the only deployment-relevant dimension. Three additional metrics should be part of any enterprise STT evaluation.
Speaker diarization accuracy
For two-party contact center calls, diarization separates agent speech from customer speech. Diarization accuracy is measured separately from WER. Test it by running the vendor’s diarization output against manually labelled speaker segments and calculating the percentage of words correctly attributed to the right speaker.
Latency for real-time use cases
If your deployment includes real-time transcription, measure P95 latency under production load conditions, not in a single-request test. Send concurrent requests that reflect your peak call volume and measure the latency distribution. A system that achieves 180ms at low load but 600ms under peak concurrency is not fit for live agent assist.
Confidence score calibration
Ask each vendor for a sample of transcripts with confidence scores and manually verify accuracy on low-confidence segments. A well-calibrated confidence score should predict accuracy: segments with scores below 0.5 should show meaningfully higher WER than segments above 0.8. If confidence scores are uncalibrated, they cannot be used to trigger human review workflows reliably.
Structure Your Vendor Comparison
Once you have WER results across languages, audio conditions, and use cases for each vendor, structure the comparison to surface the dimensions that matter for your specific deployment.
Weight your scoring by deployment priority
If your primary use case is Hindi collections calls from rural UP, WER on Hindi noisy telephony should carry the most weight. WER on clean English is nearly irrelevant. Define your weighting before you see the results to avoid post-hoc rationalisation.
Build a comparison matrix
Vendor | Language | Clean WER | Noisy WER | Code-mixed WER | Diarization Acc | P95 Latency | Confidence Calibrated
Do not compress this into a single score. Different use cases will weight dimensions differently, and a single composite score hides the trade-offs that matter for your decision.
Ask vendors to sign off on benchmark conditions
Before using benchmark results in a procurement decision, share your methodology with each vendor and ask them to confirm that the results are representative of their system under those conditions. This protects you from vendors later claiming the test was unfair, and it surfaces any legitimate concerns about test set construction before the decision is made.
Domain-Specific Fine-Tuning Assessment
For BFSI, insurance, and other regulated deployments, general benchmark WER is necessary but not sufficient. You also need to evaluate domain-specific accuracy.
Build a domain vocabulary test set
Identify the 100 to 200 terms most critical to your deployment: product names, regulatory terms, amount formats, acronyms, and proper nouns. Run a targeted evaluation specifically on utterances containing these terms and report accuracy separately.
Ask vendors about custom vocabulary injection
Every enterprise STT vendor should support adding domain-specific vocabulary to improve accuracy on OOV terms. Ask for the mechanism (word list, phonetic spelling, weighted boosting), the latency of applying vocabulary updates, and whether custom vocabulary affects the general language model or only the lexicon.
Ask about fine-tuning availability
Custom vocabulary injection improves OOV accuracy. Fine-tuning on your domain data improves broader domain accuracy. For high-volume, high-stakes deployments, full fine-tuning on a sample of your annotated audio is worth evaluating. Ask vendors for their fine-tuning process, the amount of data required, and the expected WER improvement.
Putting It All Together: Your Evaluation Checklist
Before signing a contract with any STT vendor for an Indian language deployment, you should have answers to all of the following.
Test set: Built from production audio? Covers all languages in scope? Segmented by audio condition? Contains domain-critical vocabulary?
Ground truth: Annotated by native speakers? Consistent transcription convention documented? Inter-annotator agreement above 95%?
WER results: Calculated per language? Calculated per audio condition tier? Error breakdown (substitutions, deletions, insertions) available? CER calculated for Dravidian languages?
Code-switching: Separate evaluation on intra-sentential code-mixed audio? WER on Hindi-English code-mixed telephony specifically?
Noise robustness: WER at multiple SNR levels? Tested on 8kHz audio specifically? Gramvaani or equivalent rural telephony benchmark requested?
Additional metrics: Diarization accuracy tested? Latency measured under concurrency? Confidence score calibration verified?
Domain accuracy: Domain vocabulary test completed? Custom vocabulary injection tested? Fine-tuning availability confirmed?
Vendor sign-off: Methodology shared with vendor? Results acknowledged as representative?
A vendor that cannot provide answers to these questions, or that declines to be evaluated on your audio, is telling you something important about the confidence they have in their production performance.
Frequently Asked Questions
How long does a proper STT benchmark evaluation take?
Building the test set and creating ground truth transcripts is the most time-consuming part: typically two to four weeks depending on language count and test set size. Running the actual evaluation once the test set is ready takes two to three days per vendor. A full evaluation of three to four vendors for an Indian language deployment can be completed in four to six weeks if the test set is built in parallel with vendor shortlisting.
Can I use publicly available datasets instead of my own audio?
You can use public datasets like Gramvaani and Kathbath as baseline references to compare vendors on a common standard. But they should not replace evaluation on your own audio. Public datasets represent average conditions. Your deployment has specific audio conditions, vocabulary, and use cases that only your own test set can capture.
How many languages do I need to test if I am deploying across multiple Indian states?
Test every language that will carry more than 5% of your call volume. For the others, the publicly available Kathbath results are a reasonable proxy. Gnani.ai supports all 12 major Indian languages: Hindi, Tamil, Telugu, Malayalam, Kannada, Odia, Marathi, Punjabi, Gujarati, Bengali, Assamese, and English, and can be evaluated on all of them.
What sample size do I need for statistically reliable WER results?
For a 95% confidence interval with a margin of error of plus or minus 1 percentage point on WER, you need approximately 400 utterances per condition. For initial vendor shortlisting, 200 utterances per language per condition is sufficient to identify major performance differences. For a final procurement decision, go to 400 or higher.
Should I run this evaluation myself or bring in a third party?
Running it yourself gives you better control over test set construction and deeper understanding of the results. If you do not have internal ML engineering capacity, a third-party evaluation is preferable to relying on vendor-supplied benchmarks. Whoever runs it, the methodology should be documented and shared with all vendors being evaluated.


