Abstract

Abstract

Across enterprise voice AI deployments, a silent but consistent failure mode appears at the moment of production: models trained on clean, studio-grade speech collapse when the audio arrives through a telephone network. We document this phenomenon with precision, trace its root causes to acoustic domain mismatch, and describe the training methodology Gnani.ai has developed over seven years to build models that treat telephonic noise as signal.

This paper examines the physics of telephonic audio degradation, the mathematical cost of training on mismatched data, and the specific pipeline decisions that allowed us to build acoustic models trained across 14 million hours of real telephonic speech spanning 12 Indian languages. We describe our approach to corpus construction, noise-aware training architectures, codec simulation strategies, and evaluation methodology on the Kathbath Noisy benchmark, where Gnani.ai achieves 17.5% WER against 19.1% (ElevenLabs), 19.9% (Sarvam 3.0), and 23.3% (Microsoft Azure).

Our central argument is not that studio data is useless. It is that in telephonic AI deployments, it is actively harmful when allowed to dominate the training distribution. The acoustic assumptions baked into clean data create priors that telephonic audio violates at every level of the signal chain.


Section 1

1. Introduction: The Deployment Gap

Most ASR research is benchmarked on LibriSpeech. The dataset consists of read-aloud audiobook recordings sampled at 16 kHz, captured by speakers who are quiet, positioned close to a high-quality microphone, and speaking text they can see in front of them. LibriSpeech is an excellent research tool. It is an almost entirely misleading proxy for enterprise telephonic AI.

The real deployment surface for enterprise voice AI in India looks nothing like LibriSpeech. It looks like a 58-year-old retail borrower calling a collections agent from a feature phone in Coimbatore, speaking Hindi mixed with Tamil, over a 2G connection, with a ceiling fan running in the background. The audio arrives at the ASR engine at 8 kHz, compressed by G.711 mu-law codec, carrying approximately -6 dB SNR, with formant smearing caused by the telephone channel filter.

The gap between these two acoustic realities is not a fine-tuning problem. It is a data problem, and behind that, a philosophy problem: the assumption that clean data, augmented with artificial noise, can approximate what real telephonic data encodes. Research validates this concern with uncomfortable clarity.

A 2022 study published in Frontiers in Signal Processing demonstrated that ASR models trained on clean studio speech suffered a 58% performance drop when evaluated on cellular network audio, while models trained directly on network-distorted speech improved recognition accuracy by 82% over their clean-trained counterparts.

[1] Frontiers in Signal Processing, ‘Performance Evaluation of ASR Systems on Integrated Noise-Network Distorted Speech,’ 2022. DOI: 10.3389/frsip.2022.999457

Benchmark data from AssemblyAI confirms this at scale: Whisper Large-v3, trained on 5 million hours of web audio, achieves 2.7% WER on clean English speech but degrades to 17.7% on call center telephonic recordings. A 6.5x degradation factor, despite training at a scale most research teams will never attempt.

[2] AssemblyAI Benchmark Report on Whisper Large-v3, 2025. Compiled in Whisper Statistics Analysis, Quantumrun Foresight, December 2025.

Figure 1: Clean vs Telephonic WER for major ASR systems

Figure 1: Clean vs Telephonic WER for major ASR systems. Bubble size represents training data volume. Models above the dashed line degrade significantly under telephonic conditions. Gnani.ai (lower-left) achieves the best balance of clean accuracy and telephonic robustness.


Section 2

2. The Physics of Telephonic Degradation

To understand why studio data fails in telephonic deployment, it helps to work through the acoustic transformation a speech signal undergoes between a speaker’s mouth and an ASR engine’s input layer. Every stage in the telephone channel removes or distorts information that a clean-speech-trained model depends on.

Figure 2: The telephonic signal degradation chain

Figure 2: The telephonic signal degradation chain. Each stage strips or distorts acoustic information. By the time audio reaches the ASR engine, the signal occupies only 0-4 kHz, carries codec quantization noise, and has been further degraded by network transmission artifacts.

2.1 Bandwidth Compression and Spectral Truncation

Standard telephone networks operate at 8 kHz sampling rate, imposing a Nyquist limit of 4 kHz on representable frequencies. Human speech carries meaningful acoustic information up to approximately 8 kHz; fricatives like /s/, /f/, /sh/, and aspiration in languages like Hindi and Tamil carry energy concentrated between 4 and 8 kHz. In telephonic audio, this entire band is discarded.

The consequence for ASR is not merely a reduction in signal quality. It is the loss of phoneme-discriminative cues that a model trained on wideband audio has learned to rely on. A model that has learned to distinguish /s/ from /sh/ using spectral energy above 4 kHz will encounter a world in which that cue does not exist, and must fall back on lower-frequency formant structure that it has not been trained to prioritize.

A 2025 study on clinical telephonic ASR quantified this as an ‘acoustic bottleneck’: models trained on 16 kHz speech applied to 8 kHz telephonic data hit a floor at approximately 34% WER that parameter-efficient fine-tuning alone could not breach, because the architectural mismatch was fundamental.

[3] ‘Navigating the Reality Gap: Privacy-Preserving On-Device Continual Adaptation of ASR for Clinical Telephony,’ arXiv:2512.16401, January 2026.

Figure 3: Spectrogram comparison across acoustic conditions

Figure 3: Spectrogram comparison across acoustic conditions. Left: full-bandwidth studio recording with energy across 0-8 kHz. Center: G.711 codec telephonic audio with hard 4 kHz cutoff and codec noise floor. Right: VoIP + background noise representing production worst-case. Note the near-complete loss of spectral content above the 4 kHz cutoff (cyan dashed line).

2.2 Codec-Induced Distortion

Telephone networks apply lossy compression codecs before transmission. G.711 (mu-law in North America and Asia, A-law in Europe) operates at 64 kbps and introduces quantization noise and a characteristic tonal coloring of the speech signal. This codec was designed for intelligibility in human listening, not for acoustic feature extraction by a neural network.

The impact on log-mel spectrogram features, which form the input to most modern acoustic models, is a systematic distortion of formant trajectories and a non-stationary noise floor imposed by quantization error. Models trained on uncompressed audio learn feature distributions that do not match the codec-distorted distributions they encounter at inference time. VoIP deployments add further degradation: packet loss, jitter, and resampling artifacts compound the codec distortion.

[4] ibid, Frontiers in Signal Processing, 2022.

2.3 Channel Mismatch as a Structural Problem

A 2025 IEEE paper examined channel-induced ASR degradation and made an important distinction: the performance gap between recording conditions is not merely a training-test distribution mismatch, but a reflection of intrinsic signal properties that remain consistent regardless of how domain adaptation is performed. Their controlled experiments showed a consistent performance hierarchy among channels irrespective of which channel data was used for fine-tuning. The implication is that surface-level domain adaptation on top of clean-speech-pretrained models has a ceiling that cannot be breached through adaptation alone.

[5] ‘Revealing the Role of Audio Channels in ASR Performance Degradation,’ arXiv:2508.08967, August 2025.


Section 3

3. Why Speech Enhancement Does Not Solve This

The intuitive response to telephonic degradation is to apply a speech enhancement frontend before ASR inference. Denoise the audio, reconstruct the missing frequencies, and feed clean-ish audio to a standard acoustic model. The research literature on this is surprisingly consistent: speech enhancement preprocessing frequently degrades ASR performance on models that have already learned from diverse, noisy data.

A 2025 systematic evaluation of MetricGAN-plus denoising applied to four state-of-the-art ASR systems across 500 medical recordings found that enhanced audio produced higher word error rate in every one of forty tested configurations. The mechanism is worth understanding: speech enhancement algorithms are trained to maximize perceptual quality metrics like PESQ or STOI. These measure human intelligibility, not ASR feature quality. An enhancement model may reconstruct a spectrogram that sounds cleaner to a human listener while systematically distorting the acoustic features a trained ASR encoder uses for phoneme discrimination.

[6] ‘When De-noising Hurts: A Systematic Study of Speech Enhancement Effects on Modern Medical ASR Systems,’ arXiv:2512.17562, December 2025.

There is a second failure mode specific to Indic language telephony: bandwidth expansion (BWE) models trained on English or high-resource language clean speech do not generalize to the phonemic structures of Hindi, Tamil, Telugu, or Kannada. Formant patterns differ. Agglutinative morphology creates a different phoneme duration distribution. A BWE model trained to reconstruct English fricatives cannot reconstruct the retroflex consonants of Indian languages.

The correct engineering choice is to train acoustic models on data that already reflects the acoustic conditions of deployment, not to pre-process inference audio into a distribution the model was not trained on.

Section 4

4. Our Training Pipeline: A Walk Through the Core Decisions

The following architecture represents seven years of iteration on what it actually takes to build an acoustic model that survives telephonic deployment at enterprise scale. Every design decision maps back to a specific acoustic failure mode observed in production.

Figure 4: Gnani.ai end-to-end acoustic model training pipeline

Figure 4: Gnani.ai end-to-end acoustic model training pipeline. Raw telephonic corpus enters at the top-left. Three parallel tracks, corpus curation, augmentation, and SSL pretraining for low-resource languages, feed into a shared Conformer-CTC architecture trained with GAN-based adversarial noise invariance. The output is a single production model covering 12 Indian languages.

4.1 Corpus Philosophy: Only What Production Looks Like

The foundational decision in our training pipeline was to source data that matches what our models encounter in production, not what research datasets look like. Over seven years, we have accumulated and curated a training corpus of 14 million hours of real telephonic speech across 12 Indian languages: Hindi, Tamil, Telugu, Kannada, Malayalam, Marathi, Bengali, Gujarati, Punjabi, Odia, Assamese, and Indian-accented English.

4.2 Architecture: Conformer-Based Encoder with CTC Objective

Our acoustic model backbone follows the Conformer architecture, introduced in Gulati et al. (2020), which combines multi-headed self-attention for global sequence modeling with convolutional layers for local acoustic feature extraction. The architecture has become the standard for ASR because the convolutional component captures co-articulation and phoneme boundary effects that pure Transformer encoders handle less efficiently.

We train using the Connectionist Temporal Classification (CTC) objective, which removes the need for forced alignment between audio frames and transcript tokens. CTC is particularly well-suited to telephonic speech where spontaneous disfluencies, variable speaking rates, and overlapping background events create alignment ambiguities that HMM-based forced alignment cannot resolve gracefully. Our encoder operates on 40-dimensional log-mel filterbank features computed at a frame rate of 10ms with a 25ms window, spanning the full telephonic bandwidth of 0 to 4 kHz without assuming energy above 4 kHz.

Figure 5: Conformer block architecture

Figure 5: Conformer block architecture used in Gnani.ai’s acoustic encoder. Each block comprises two half-weighted Feed-Forward modules sandwiching a Multi-Head Self-Attention module and a Convolution module, followed by LayerNorm. Residual connections at each sub-module allow gradients to flow cleanly through stacked blocks (N=12 to 24 depending on model tier). The architecture is repeated N times before the CTC output layer.

4.3 Data Augmentation: SpecAugment and Codec Simulation

For augmentation, we apply SpecAugment as introduced by Park et al. (2019) at Google Brain. SpecAugment operates directly on the log-mel spectrogram, applying three policies: time warping, frequency masking, and time masking. It prevents over-fitting by corrupting the training signal in ways that force the model to learn robust representations from partial information.

[8] Park, D.S., Chan, W., Zhang, Y., et al., ‘SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition,’ Interspeech 2019. DOI: 10.21437/Interspeech.2019-2680

Figure 6: SpecAugment applied to a telephonic log-mel spectrogram

Figure 6: SpecAugment applied to a telephonic log-mel spectrogram. Left: original. Center-left: frequency masking (18 consecutive mel bins zeroed). Center-right: time masking (two time windows zeroed). Right: combined policy LD applied during training. The model must learn to reconstruct phoneme identity from incomplete spectral evidence, a proxy for the information loss that occurs across real telephonic channels.

We extend SpecAugment with a codec simulation layer applied at the waveform level before feature extraction. This layer probabilistically applies G.711 mu-law and A-law codec transformations, variable jitter simulation (5%-25% packet loss), and additive background noise drawn from a telephonic noise corpus: call center floor noise, ambient road noise captured on mobile devices, and fan/HVAC noise consistent with small office and home environments where callers typically originate.

The noise corpus is stratified by SNR: we maintain roughly equal training exposure across SNR ranges of -5 dB to 20 dB, with deliberate oversampling of the -5 dB to 5 dB range, which represents the acoustic conditions most likely to cause failures in production. For adversarial robustness, we employ a discriminator-based auxiliary objective inspired by Liu et al. (2018), which forces the acoustic encoder to produce representations that cannot be distinguished between clean and noisy versions of the same utterance.

[9] Liu, B., et al., ‘Boosting Noise Robustness of Acoustic Model via Deep Adversarial Training,’ arXiv:1805.01357, ICASSP 2018.

4.4 Self-Supervised Pretraining for Low-Resource Languages

For languages where labeled telephonic data falls below approximately 1,000 hours, we use a self-supervised pretraining phase following the wav2vec 2.0 framework. The architecture, introduced by Baevski et al. (2020), learns speech representations from unlabeled audio by solving a contrastive task over quantized latent representations at masked time steps. This produces a powerful general acoustic encoder fine-tunable with as few as one hour of labeled data.

[10] Baevski, A., Zhou, Y., Mohamed, A., Auli, M., ‘wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations,’ NeurIPS 2020.

Our self-supervised pretraining operates on unlabeled telephonic audio, not clean speech. This is critical: by pretraining on unlabeled telephonic data, we ensure that the quantized codebook the model learns during pretraining reflects the acoustic space of 8 kHz codec-distorted speech. Fine-tuning on labeled telephonic data then teaches the model to map these telephonic representations to text. The WavLM framework (Chen et al., 2022), which extends HuBERT with a denoising pretext task, provides additional motivation: denoising pseudo-label prediction on simulated noisy speech implicitly learns speaker identity and noise separation as side effects of the primary ASR task.

[11] Chen, S., et al., ‘WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing,’ IEEE/ACM TASLP, 2022.


Section 5

5. Evaluation: Kathbath Noisy and Real-World Telephonic Benchmarks

5.1 Why We Evaluate on Kathbath Noisy

The Kathbath benchmark suite, developed by AI4Bharat (Javed et al., 2023), is the closest India has to a rigorous standardized ASR evaluation for Indic languages. Kathbath includes both clean (studio) and Kathbath Noisy variants. The noisy variant introduces codec-realistic degradation and represents the acoustic conditions closest to real telephonic deployment.

We chose to report performance on Kathbath Noisy as our primary benchmark because clean-condition WER is a poor predictor of enterprise deployment performance. A model that achieves 8% WER on Kathbath clean and 25% on Kathbath Noisy is a worse production system than one that achieves 14% on clean and 17% on noisy, even though its headline number looks better.

5.2 Benchmark Results

On Kathbath Noisy, averaging across all supported languages, Gnani.ai achieves 17.5% WER. Comparisons against models evaluated under identical conditions:

Training Condition Clean WER % 8kHz Telephonic WER % Degradation Factor
Studio-only training (LibriSpeech baseline) 3.0 17.7 – 40.9+ 5x – 13x
Web-scale weak supervision (Whisper Large-v3) 5.6 17.7 3.1x
Noisy telephonic in-domain training 8 – 12 12.5 – 19.5 ~1.5x
Telephonic + SpecAugment + codec simulation 9 – 13 ~13 – 17 ~1.3x
Native telephonic corpus (matched train/test) 11 – 14 10 – 15 ~1.0x

Table 1: WER comparison under clean and telephonic (8kHz) conditions. Degradation factor = telephonic WER / clean WER. Values approaching 1.0 indicate models built for telephonic deployment. Values above 3.0 indicate studio-trained models deployed into a mismatched acoustic environment.

Figure 7: Side-by-side WER comparison

Figure 7: Side-by-side WER comparison. Left: clean speech benchmark where studio-trained models look competitive. Right: Kathbath Noisy (telephonic proxy) where the performance hierarchy inverts. Gnani.ai’s telephonic-native training yields the lowest noisy WER (17.5%) while remaining competitive on clean speech. Annotations show relative improvement versus ElevenLabs and Microsoft Azure.

5.3 Performance Hierarchy Across Languages

Within our 12-language corpus, we observe consistent performance patterns. Languages with larger telephonic training corpora (Hindi, Tamil, Telugu, Kannada) achieve WER in the 12%-17% range on Kathbath Noisy. Languages where our corpus is thinner and self-supervised pretraining carries more weight (Assamese, Odia) achieve 22%-28% WER on the same benchmark. This performance hierarchy validates that telephonic data volume is the primary driver of noisy-condition performance, even when self-supervised pretraining provides a strong acoustic foundation. There is no architectural substitute for matched training data at sufficient scale.


Section 6

6. Implications for Teams Building or Evaluating Voice AI

6.1 Evaluate on the Acoustic Conditions of Your Deployment

The most common mistake in voice AI procurement and internal model development is evaluating on clean-speech benchmarks and assuming the results transfer to telephonic deployment. They do not. Require telephonic-condition WER from any ASR vendor. Specifically: require WER numbers on data at 8 kHz, after G.711 or OPUS codec, at SNR conditions below 10 dB. These represent the acoustic conditions of the worst 20% of call volume, and the worst 20% is where customer experience and outcomes diverge most sharply.

6.2 The Augmentation Trap

Teams that build acoustic models internally frequently fall into what we call the augmentation trap: adding noise augmentation to a studio-trained model and expecting it to match a natively telephonic model. Augmentation improves robustness at the margins. It does not recover the acoustic priors baked into a model that learned speech representations from wideband clean data.

Narayanan and Wang (2013) showed that matched training and testing conditions are the primary driver of DNN acoustic model robustness, not post-hoc augmentation. Eickhoff et al. (2023) demonstrated systematic WER reduction from noise robustness pretraining, but noted that even noise-pretrained models still lagged native noisy-trained models on telephonic conditions.

[12] Narayanan, A., Wang, D., ‘Ideal Ratio Mask Estimation Using Deep Neural Networks for Robust Speech Recognition,’ ICASSP 2013.

[13] Eickhoff, P., et al., ‘Bring the Noise: Introducing Noise Robustness to Pretrained ASR,’ ICANN 2023. Springer LNCS 14260.

6.3 Data Is the Moat

For any team training ASR for Indic telephonic deployment, the strategic question is not which architecture to choose. Conformer, wav2vec 2.0 fine-tuning, and CTC-based end-to-end models are all capable of reaching competitive WER on well-matched data. The architectural choices are well-understood and largely converging. The competitive moat is data: how much real telephonic speech, in what languages, at what acoustic diversity, with what annotation quality.

Gnani.ai’s 14 million hours of telephonic corpus was not purchased or licensed. It was accumulated through 200+ enterprise client deployments over seven years, with each deployment adding to a continuously growing and evolving training distribution. The models that run on that corpus have been trained on acoustic conditions that cover a representative sample of the Indian telephonic deployment surface.


Section 7

7. Open Research Questions

Several problems in this space remain genuinely unsolved.

Bandwidth expansion for Indic languages at inference time

Frequency super-resolution models show promise for reconstructing high-frequency content from 8 kHz audio before ASR processing. However, the domain mismatch problem applies here too: BWE models trained on English speech do not reconstruct Indic phonemes correctly. A BWE model trained on Indic telephonic speech at scale is an open engineering problem with significant potential upside for models trained on mixed-bandwidth data.

Cross-lingual transfer in low-resource Indic languages

For languages at the low-resource end of the Indic spectrum, the question of how to efficiently transfer acoustic representations from high-resource languages without importing the acoustic priors of those languages is unresolved. Multilingual pretraining (XLS-R, mSLAM) helps, but the interaction between cross-lingual transfer and telephonic domain specificity is not well characterized in the literature.

Continual learning under production distribution shift

Enterprise telephonic deployments evolve. Codec standards change. New device types introduce new channel characteristics. The question of how to update acoustic models continuously without catastrophic forgetting of prior telephonic conditions is an active research problem. Multi-domain experience replay and LoRA-based parameter-efficient adaptation offer partial solutions, but a principled framework for this in the telephonic domain does not yet exist.

[14] Chaudhry, A., et al., ‘Efficient Lifelong Learning with A-GEM,’ ICLR 2019.


Section 8

8. Conclusion

The gap between studio AI and production voice AI is not a fine-tuning problem. It is a data philosophy problem. Models trained on clean speech learn acoustic priors that telephone networks violate at every level: spectral bandwidth, channel coloring, codec quantization, transmission noise. No amount of augmentation, enhancement, or domain adaptation fully corrects for the wrong foundational distribution.

Building acoustic models for telephonic AI at enterprise scale requires building them on telephonic data. This means sourcing, curating, and annotating real call audio across the languages and acoustic conditions of the deployment surface. It means evaluating on telephonic benchmarks, not clean-speech leaderboards. And it means recognizing that the architecture choices, while important, are secondary to the data decisions.

The question is not which model achieves the lowest WER on LibriSpeech. The question is which model still works at 3am in a tier-3 city when the caller is on a 2G connection and has a ceiling fan running.

That is the problem Gnani.ai has been training for. Fourteen million hours at a time.


References

References

  1. [1]Frontiers in Signal Processing, ‘Performance Evaluation of ASR Systems on Integrated Noise-Network Distorted Speech,’ 2022. https://doi.org/10.3389/frsip.2022.999457
  2. [2]AssemblyAI / Quantumrun Foresight, Whisper Statistics Analysis, December 2025.
  3. [3]‘Navigating the Reality Gap: Privacy-Preserving On-Device Continual Adaptation of ASR for Clinical Telephony,’ arXiv:2512.16401, January 2026.
  4. [4]ibid.
  5. [5]‘Revealing the Role of Audio Channels in ASR Performance Degradation,’ arXiv:2508.08967, August 2025.
  6. [6]‘When De-noising Hurts: A Systematic Study of Speech Enhancement Effects on Modern Medical ASR Systems,’ arXiv:2512.17562, December 2025.
  7. [7]IndicWav2Vec, AI4Bharat. Javed, T. et al., Interspeech 2021; Chadha, A. et al., 2022.
  8. [8]Park, D.S., et al., ‘SpecAugment: A Simple Data Augmentation Method for ASR,’ Interspeech 2019. DOI: 10.21437/Interspeech.2019-2680
  9. [9]Liu, B., et al., ‘Boosting Noise Robustness of Acoustic Model via Deep Adversarial Training,’ arXiv:1805.01357, ICASSP 2018.
  10. [10]Baevski, A., et al., ‘wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations,’ NeurIPS 2020.
  11. [11]Chen, S., et al., ‘WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing,’ IEEE/ACM TASLP, 2022.
  12. [12]Narayanan, A., Wang, D., ‘Ideal Ratio Mask Estimation Using Deep Neural Networks for Robust Speech Recognition,’ ICASSP 2013.
  13. [13]Eickhoff, P., et al., ‘Bring the Noise: Noise Robustness for Pretrained ASR,’ ICANN 2023. Springer LNCS 14260.
  14. [14]Chaudhry, A., et al., ‘Efficient Lifelong Learning with A-GEM,’ ICLR 2019.
  15. [15]Gulati, A., et al., ‘Conformer: Convolution-Augmented Transformer for Speech Recognition,’ Interspeech 2020.
  16. [16]Hsu, W.N., et al., ‘Robust wav2vec 2.0: Analyzing Domain Shift in Self-Supervised Pre-Training,’ Interspeech 2021.
  17. [17]Li, J., Deng, L., ‘Recent Progresses in Deep Learning Based Acoustic Models,’ IEEE/CAA JAS, 2018.
  18. [18]Bhanushali, A., et al., ‘Gram Vaani ASR Challenge on Spontaneous Telephone Speech,’ Interspeech 2022.
Found this useful? Share it.

Gnani.ai Research Lab

Continue reading.

Explore original research on AI, voice, and language from the Gnani.ai team.

Browse all papers →