Thank You! Your submission has been received.

Oops! Something went wrong while submitting the form.

If you have spent any time evaluating speech-to-text APIs, you have run into a wall of acronyms. WER. CER. VAD. LM. Diarization. Acoustic model. Confidence scores. Beam search. Vendors use these terms as if they are self-explanatory, and the effect is that procurement conversations get tangled in jargon before anyone has agreed on what the system actually needs to do.

This glossary covers every major term you will encounter when evaluating, benchmarking, or deploying a speech recognition API for enterprise use. Terms are grouped thematically rather than alphabetically so that related concepts sit together and the logic between them becomes clear. Each entry includes a plain-English definition and, where relevant, a note on why it matters specifically for Indian language deployments or enterprise contact center use cases.

Jump to Section

Accuracy and Evaluation Metrics

Audio and Signal Processing

Model Architecture and Training

Language and Linguistics

API and Deployment

Enterprise and Contact Center Features

Compliance and Data

Accuracy and Evaluation Metrics

These are the terms you will use most frequently during vendor evaluation and benchmarking. Understanding them precisely is the difference between making a defensible procurement decision and buying on faith.

Word Error Rate (WER)

Word error rate is the primary metric for measuring speech recognition accuracy. It calculates the percentage of words in a transcript that were wrong compared to what was actually spoken, counting three types of errors: substitutions (wrong word), deletions (missing word), and insertions (extra word that was not spoken).

The formula is: WER = (Substitutions + Deletions + Insertions) / Total Words in Reference Transcript.

A WER of 8% means the system made errors on 8 out of every 100 words. Lower is better. For English on clean audio, top-tier systems now achieve WER in the 3 to 6% range. For Indian languages on production telephony audio, the numbers vary significantly by language and vendor. Gnani STT API achieves under 4% WER for English on noisy telephony audio and delivers 10 to 20% better accuracy than other ASR providers across Indian languages.

WER is the starting point for any STT evaluation, but it should not be read in isolation. A full breakdown of how to use WER correctly in enterprise evaluations is in our guide What Is Word Error Rate? The Only STT Accuracy Metric That Actually Matters.

Character Error Rate (CER)

Character error rate applies the same substitution-deletion-insertion logic as WER, but at the character level rather than the word level. CER is particularly relevant for agglutinative languages, where words carry meaning through suffixes and inflections rather than through separate words.

For Indian languages like Tamil, Kannada, and Telugu, where verb forms and noun cases are expressed as morphological variations of a root word, CER can be a more stable accuracy signal than WER. A system that produces the right root word but the wrong inflection will look accurate on WER but will show errors on CER, and those errors matter for downstream NLP tasks like intent classification and entity extraction.

When evaluating STT APIs for Dravidian language use cases, always ask for both WER and CER.

Sentence Error Rate (SER)

Sentence error rate measures the percentage of sentences in a transcript that contain at least one error. A transcript can have low WER but high SER if errors are distributed evenly rather than clustered. For compliance and quality monitoring use cases where the accuracy of entire utterances matters, not just individual words, SER is a more useful metric than WER alone.

Match Error Rate (MER)

A variant of WER that normalises against the number of matched words rather than total reference words. MER produces slightly different numbers than WER on the same audio, particularly when there are many insertions. Used in some research contexts but rarely in enterprise procurement. WER remains the standard for vendor comparisons.

Accuracy

Accuracy in speech recognition is typically expressed as 100% minus WER. A system with 8% WER has 92% accuracy. Vendors often quote accuracy rather than WER because the number looks larger and more reassuring. Always ask for WER directly. Accuracy percentages obscure the error breakdown and make it harder to compare systems on equal terms.

Ground Truth Transcript

The reference transcript used to calculate WER. It is a manual transcription of what was actually said, produced by a human annotator. The quality of your ground truth determines the reliability of your WER measurement. Poor annotation, missed words, or inconsistent transcription conventions in the ground truth will make WER calculations unreliable regardless of how well the STT system performs.

For Indian language evaluations, ground truth transcription should always be done by native speakers of each language. Non-native or bilingual annotators who are not fluent in the relevant language will introduce systematic errors that distort the benchmark.

Benchmark Dataset

A curated set of audio recordings with verified ground truth transcripts, used to measure and compare STT system performance. Published benchmarks like LibriSpeech (English audiobooks) and Common Voice (crowdsourced multilingual) are widely cited in vendor materials but reflect clean audio conditions that rarely match enterprise production environments.

For Indian enterprise deployments, published benchmarks are a poor proxy for real performance. The audio profile of Indian contact center calls, 8kHz telephony, GSM codec compression, code-switching, background noise, is not represented in any major public benchmark dataset. For a step-by-step guide to building your own benchmark on Indian telephony audio, see How to Benchmark a Speech-to-Text API on Indian Languages Before You Sign Anything.

Confidence Score

A number between 0 and 1 that an STT system assigns to each word or phrase in a transcript, indicating how certain the model is about that transcription. A confidence score of 0.95 means the model is highly certain. A score of 0.4 means it is guessing.

Confidence scores are useful for flagging low-certainty segments for human review, for filtering transcripts before feeding them into downstream automation, and for building quality monitoring workflows that escalate uncertain transcriptions. In BFSI deployments where amounts, account numbers, and consent confirmations must be accurate, confidence scoring is not optional. For more on how confidence scores fit into enterprise STT deployments, see our guide on Speaker Diarization, Confidence Scores, Latency: The STT Features Enterprises Overlook Until It's Too Late.

Hallucination

When a speech recognition system generates words or phrases that were not present in the audio at all. Hallucinations are different from substitution errors: a substitution means the system heard something and got it wrong, a hallucination means the system invented content. Hallucinations tend to occur when audio quality is very poor, when the model encounters unfamiliar phoneme sequences, or when silence is misinterpreted as speech. In regulated deployments, hallucinations in call transcripts are a compliance risk, not just an accuracy problem.

Audio and Signal Processing

These terms describe what happens to audio before and during the transcription process. They matter for understanding why a system performs differently under different recording conditions.

Sample Rate (kHz)

The number of audio samples captured per second, expressed in kilohertz. Higher sample rates capture more acoustic detail. Standard telephony audio in India is typically 8kHz (narrowband), while wideband VoIP calls can reach 16kHz. Most published STT benchmarks are run on 16kHz or higher audio. This is one reason vendor benchmark numbers almost never transfer directly to Indian contact center environments, which predominantly operate on 8kHz telephony. A system that achieves 5% WER on 16kHz audio may show 12 to 18% WER on the same content at 8kHz.

Signal-to-Noise Ratio (SNR)

The ratio of the level of the desired audio signal (the speaker's voice) to the level of background noise, measured in decibels (dB). Higher SNR means cleaner audio. A call recorded in a quiet environment might have an SNR of 25 to 30dB. A field collections call from a crowded street in Kanpur might have an SNR below 10dB. WER degrades non-linearly as SNR drops. Most STT systems perform well above 20dB SNR and begin to degrade significantly below 15dB. For a full analysis of how audio environment affects transcription quality in Indian deployments, see Speech Recognition in Noisy and Rural India: Why Your STT Fails Where It Matters Most.

Codec

The compression algorithm used to encode and decode audio for transmission. Telephony codecs like GSM, G.711, and G.729 reduce file size for transmission but introduce audio artifacts. GSM codec in particular, which is used heavily on Indian mobile networks, introduces a characteristic distortion that many STT models trained on clean audio handle poorly. If your calls go over mobile networks, codec resilience should be a specific evaluation criterion when comparing STT APIs.

Voice Activity Detection (VAD)

A component that identifies which parts of an audio stream contain speech and which contain silence or background noise. VAD is used to skip non-speech segments before feeding audio to the transcription model, which improves both accuracy and processing efficiency. In multi-party call recordings, VAD also helps segment speech by identifying pauses between speakers. Poor VAD in noisy environments can result in noise being passed to the transcription model, increasing insertions and hallucinations.

Audio Segmentation

The process of dividing a continuous audio stream into shorter segments for processing. Long recordings are typically segmented before transcription, either by time (fixed-length chunks) or by speech boundaries (silence-based splits). Poor segmentation can cause the model to split words across segments, producing errors at boundaries. For real-time streaming use cases, segmentation strategy directly affects both latency and accuracy.

Noise Robustness

The ability of a speech recognition system to maintain acceptable accuracy when audio contains background noise, reverberation, or other distortions. Noise robustness is typically measured by comparing WER on clean audio against WER on degraded audio at various SNR levels. Systems trained exclusively on clean or studio-quality data tend to degrade sharply in real production environments. Noise robustness is particularly critical for Indian enterprise deployments where call audio quality is variable and often poor.

Beamforming

A signal processing technique used in multi-microphone setups that focuses audio capture in a specific direction, reducing noise from other directions. Relevant in meeting room transcription and device-side speech capture but not typically applicable to single-channel telephony recordings.

8kHz vs 16kHz Audio

The two most common sample rates in enterprise speech processing. 8kHz is narrowband telephony, the standard for traditional PSTN and most mobile calls in India. 16kHz is wideband, used in modern VoIP systems and many recorded audio formats. STT systems optimised for 16kHz audio will often upsample 8kHz input, but upsampling does not recover lost acoustic information. A system that explicitly trains on 8kHz telephony audio will outperform one that upsamples on Indian contact center deployments.

Model Architecture and Training

These terms describe how speech recognition models are built. You do not need deep technical knowledge of them to make a procurement decision, but understanding them helps you ask better questions about why a vendor's system performs the way it does.

Automatic Speech Recognition (ASR)

The broader technical term for the technology that converts spoken audio into text. ASR encompasses the full pipeline from raw audio input to text output, including signal processing, acoustic modeling, language modeling, and decoding. Speech-to-text (STT) is often used interchangeably, though technically STT refers specifically to the output function. When a vendor's documentation says ASR, they mean the same underlying capability as STT.

Acoustic Model

The component of a speech recognition system that maps audio signals to phonetic units. It learns the relationship between acoustic features of speech, how sounds look as frequency patterns, and the phonemes or sub-word units that make up words. The acoustic model is where most of the heavy lifting in speech recognition happens, and it is the component most affected by audio quality, accents, and recording conditions. A model trained primarily on American English acoustic data will perform poorly on Hindi, Tamil, or any language with a different phoneme inventory.

Language Model (LM)

The component that assigns probabilities to word sequences, helping the system choose between acoustically similar options. If the acoustic model hears a sound that could be "loan" or "lone," the language model uses context to pick the more likely word. Language models are domain-specific: a general language model will be less accurate than one fine-tuned on financial services vocabulary, because it assigns lower probability to terms like "NPA," "EMI," or "DPD" that appear frequently in BFSI calls.

End-to-End Model

A speech recognition architecture that combines acoustic modeling, language modeling, and decoding into a single neural network trained jointly, rather than as separate components. End-to-end models have largely replaced hybrid architectures in production systems. They are generally more accurate and easier to fine-tune for new domains, but require large amounts of labeled training data.

Fine-Tuning

The process of taking a pre-trained speech recognition model and further training it on domain-specific data to improve accuracy on that domain. A general-purpose STT model fine-tuned on financial services call data will have significantly lower WER on BFSI vocabulary than the same model without fine-tuning. Fine-tuning is the mechanism that gets WER down on high-stakes vocabulary like amounts, product names, and regulatory terminology.

Training Data

The audio recordings and corresponding transcripts used to train a speech recognition model. The composition of training data determines what the model will and will not handle well. A model trained primarily on read speech (news broadcasts, audiobooks) will struggle with spontaneous conversational speech. A model trained on American or British English will not handle Indian accents, code-switching, or regional language phonemes without significant degradation. For Indian enterprise deployments, training data provenance is one of the most important questions to ask a vendor.

Transfer Learning

A technique where a model trained on one task or dataset is adapted to a different but related task. In speech recognition, transfer learning is used to build models for low-resource languages (languages with limited training data) by starting from a model trained on a higher-resource language. Transfer learning is one reason some vendors can claim support for languages like Odia or Assamese without having built those models from scratch, but the quality of transferred models varies significantly.

Beam Search

The decoding algorithm that searches for the most likely word sequence given the acoustic and language model scores. Beam search explores multiple candidate transcriptions simultaneously (the "beam width") and returns the highest-scoring result. Wider beams are more accurate but slower. For real-time applications, beam width is one of the parameters that trades off latency against accuracy.

Vocabulary / Lexicon

The set of words a speech recognition system knows. Out-of-vocabulary (OOV) words, terms not in the system's lexicon, are a common source of substitution errors. In enterprise deployments, OOV errors often fall on product names, acronyms, regulatory terms, and proper nouns that were not in the training data. Custom vocabulary injection, the ability to add domain-specific terms to the model's lexicon, is a critical feature for BFSI, insurance, and healthcare deployments.

Language and Linguistics

These terms describe the linguistic challenges that make Indian language speech recognition harder than the vendor brochure suggests.

Code-Switching

The practice of alternating between two or more languages within a single conversation, or even within a single sentence. In Indian enterprise environments, code-switching between Hindi and English (Hinglish), Tamil and English (Tanglish), and other language pairs is not an edge case. It is how agents and customers naturally communicate.

Code-switching creates two distinct problems for speech recognition. Inter-sentential switching, where the speaker changes language between sentences, gives the model a natural boundary to reset its language assumption. Intra-sentential switching, where a speaker inserts English vocabulary into a Hindi grammatical structure mid-sentence, is harder because the model must handle two phoneme systems and two vocabularies simultaneously without a clear boundary. For a full treatment of why this breaks most STT systems and what good code-switching support looks like, see Why Hinglish Breaks Most STT APIs: The Code-Switching Problem in Indian Voice AI.

Language Identification (LID)

The component of a speech recognition system that detects which language is being spoken, either at the utterance level or at the frame level. Frame-level LID is necessary for handling intra-sentential code-switching: the system needs to identify language switches at the sub-word level to correctly apply the right acoustic and language model context. Most off-the-shelf STT systems use utterance-level LID, which fails completely on mid-sentence switches.

Phoneme

The smallest unit of sound that distinguishes one word from another in a given language. English has approximately 44 phonemes. Hindi has around 58. Tamil has distinct phonemes that do not exist in any Indo-European language. This phonemic diversity is one reason English-trained ASR systems perform poorly on Indian languages: the acoustic model has never learned to distinguish sounds that are phonemically irrelevant in English but phonemically critical in Tamil or Telugu.

Agglutinative Language

A language type where words are formed by combining multiple morphemes (meaning units) into a single word, with each morpheme carrying a distinct grammatical or semantic function. Tamil, Telugu, Kannada, Malayalam, and Odia are agglutinative to varying degrees. In agglutinative languages, a single word can carry information that English would express across several words. This creates challenges for WER-based evaluation (a single morphological error counts as one word error but may change the meaning of an entire clause) and for vocabulary coverage (the number of unique word forms is much larger than in English).

Accent and Dialect Variation

The variation in pronunciation patterns across speakers from different regions, educational backgrounds, or age groups. Indian English spoken by a Tamilian and Indian English spoken by someone from Delhi have distinct phonemic differences. Within Hindi, the pronunciation of Bhojpuri-influenced speakers from eastern UP differs from that of speakers from Delhi or Rajasthan. STT systems that do not explicitly train on regional accent variation will show higher WER for speakers whose accent deviates from the training distribution.

Spontaneous Speech

Natural, unscripted speech, as opposed to read speech (where a speaker reads from a text). Spontaneous speech contains disfluencies: filled pauses ("um," "uh," "acha"), false starts, incomplete sentences, and conversational backchannels. Contact center calls are spontaneous speech. Most public benchmark datasets are read speech. This is another reason benchmark WER numbers do not transfer to production contact center environments.

Disfluency

Non-standard speech events that occur in natural conversation: filled pauses ("um," "uh," "haan"), repetitions, false starts, and self-corrections. Disfluencies do not carry semantic content but affect WER measurement and can confuse language models. Some STT systems are configured to filter common disfluencies from transcripts; others transcribe them verbatim. For compliance use cases where everything said on a call must be recorded, verbatim transcription of disfluencies matters.

Out-of-Vocabulary (OOV) Word

A word that appears in the audio but is not in the STT system's vocabulary or language model. OOV words are typically transcribed as the closest-sounding in-vocabulary word (substitution error) or omitted entirely (deletion error). For enterprise deployments, the most consequential OOV errors fall on product names, brand names, regulatory acronyms, and domain-specific terminology. Custom vocabulary injection is the standard solution.

API and Deployment

These terms describe how STT systems are delivered and integrated into enterprise infrastructure.

Speech-to-Text API

A cloud service that accepts audio input and returns text transcriptions via an application programming interface. STT APIs handle all model inference on the vendor's infrastructure, requiring no local model deployment. They are the standard delivery mechanism for enterprise STT. Key evaluation dimensions include accuracy (WER by language), latency, supported audio formats, language coverage, and additional features like diarization and confidence scoring.

Real-Time Transcription (Streaming)

A transcription mode where audio is processed and text is returned as the speaker talks, with minimal delay. Real-time transcription uses a streaming protocol (typically WebSocket) to send audio chunks to the model and receive partial transcripts as output. It is used in live agent assist, real-time compliance monitoring, voice bots, and any application where the transcript must be available before the call ends. Latency is the critical metric: the time between a word being spoken and that word appearing in the transcript. For live voice applications, P95 latency under 200ms is the production-ready threshold. For more on choosing between real-time and batch modes, see Real-Time vs Batch Transcription: Which STT Mode Does Your Contact Center Actually Need.

Batch Transcription

A transcription mode where completed audio recordings are submitted to the STT API and transcripts are returned after processing. Batch transcription is used for post-call analytics, quality monitoring, compliance review, and any use case where the transcript is needed after the conversation rather than during it. Batch processing typically achieves higher accuracy than real-time streaming because the model has access to the full audio context before producing output.

WebSocket

The network protocol most commonly used for real-time STT streaming. WebSocket maintains a persistent connection between the client and the STT API, allowing continuous audio data to flow to the server and continuous partial transcripts to flow back. If your real-time transcription use case involves voice bots or live agent assist, your STT API must support WebSocket streaming.

Latency

The delay between an audio event and the corresponding output. For STT, latency is measured as the time between a word being spoken and that word appearing in the transcript. Latency has two components: network latency (the time for audio to travel to the API server and the transcript to return) and processing latency (the time the model takes to transcribe). For real-time voice applications, processing latency is the critical variable. P95 latency is the standard enterprise measurement: the latency value below which 95% of all requests fall.

On-Premises Deployment

A deployment model where the STT model runs on the customer's own infrastructure rather than in the vendor's cloud. On-premises deployment is required by some BFSI and government customers for data residency and security reasons. It introduces additional operational complexity and typically requires dedicated engineering to set up and maintain, but eliminates the data sovereignty concerns associated with sending sensitive call audio to a third-party cloud. For BFSI compliance considerations around deployment models, see Speech Recognition for BFSI: What Indian Banks and NBFCs Must Verify Before Going Live.

Custom Vocabulary Injection

The ability to add domain-specific words, phrases, product names, or acronyms to the STT system's vocabulary so they are transcribed correctly. Custom vocabulary injection improves accuracy on OOV terms without requiring full model fine-tuning. Most enterprise STT APIs support it. The method varies: some vendors accept simple word lists, others accept phonetic spellings, and some support weighted boosting of specific terms.

Webhook

A callback mechanism that notifies your system when an asynchronous transcription job is complete, rather than requiring your application to poll the API repeatedly. Used in batch transcription workflows to reduce infrastructure overhead.

Speaker Diarization

The process of segmenting a transcript by speaker, answering the question "who spoke when?" rather than just "what was said." In a two-party contact center call, diarization separates agent speech from customer speech. In multi-party calls or meetings, it identifies each individual speaker. Diarization is critical for quality monitoring, compliance review, and agent coaching, where it matters not just what was said but which party said it. Diarization accuracy is measured separately from WER. A system can have excellent WER but poor diarization, correctly transcribing words while misattributing them to the wrong speaker. For a full breakdown of diarization as an enterprise evaluation criterion, see Speaker Diarization, Confidence Scores, Latency: The STT Features Enterprises Overlook Until It's Too Late.

Enterprise and Contact Center Features

These terms describe the capabilities that matter specifically for enterprise deployments in contact centers, BFSI, and regulated industries.

Speech Analytics

The application of speech recognition and NLP to call recordings to extract business intelligence: topics discussed, sentiment, compliance adherence, agent performance, and customer intent. Speech analytics is the downstream application that STT enables. The quality of speech analytics output is directly constrained by the quality of the underlying STT transcripts. WER of 15% or higher in a contact center environment typically makes speech analytics unreliable because too many errors cascade into incorrect intent classification and sentiment labeling.

Call Transcription

The conversion of recorded call audio into a text transcript. Call transcription is the most common enterprise STT use case. It is used as the input for speech analytics, compliance monitoring, agent coaching, and CRM data enrichment. For Indian contact centers, accurate call transcription requires STT systems trained specifically on Indian telephony audio, regional language support, and code-switching handling.

Natural Language Understanding (NLU)

The layer of processing that interprets the meaning, intent, and entities in a transcribed text. NLU is downstream of STT: ASR converts audio to text, NLU extracts meaning from that text. The distinction matters for enterprise buyers because STT and NLU are separate systems with separate accuracy metrics. A highly accurate STT transcript does not guarantee accurate NLU output if the NLU model was not trained on your domain vocabulary. Conversely, even a moderately accurate transcript can produce acceptable NLU output if errors do not fall on intent-bearing words.

Intent Detection

An NLU task that classifies what a speaker is trying to accomplish. In a collections contact center, intent detection might classify a customer statement as "dispute," "payment promise," "payment made," or "escalation request." Intent detection accuracy depends on STT accuracy: if the STT transcript misses or substitutes the words that carry intent, the classification will fail.

Entity Extraction

An NLU task that identifies and extracts structured information from speech: amounts, dates, account numbers, names, locations. In BFSI, entity extraction from call transcripts is used to populate CRM records, verify compliance with disclosure requirements, and flag discrepancies between what agents said and what policies require. Entity extraction is highly sensitive to STT accuracy on domain-specific vocabulary.

Post-Call Analytics

Analysis performed on call recordings after the call has ended, as opposed to real-time analysis during the call. Post-call analytics typically uses batch transcription for higher accuracy and involves speech analytics, quality scoring, compliance flagging, and agent performance review. Most contact center deployments start with post-call analytics before moving to real-time use cases, because the accuracy and latency requirements are more forgiving.

Real-Time Agent Assist

A use case where STT transcription and NLU analysis happen during a live call, providing the agent with real-time prompts, suggested responses, or compliance alerts. Real-time agent assist requires streaming STT with P95 latency under 200ms, plus NLU inference time. It is technically more demanding than post-call analytics and requires both high accuracy and consistent low latency under production load.

Quality Monitoring

The use of call transcription and speech analytics to evaluate agent performance, compliance adherence, and customer experience across a sample of calls. Historically done by human reviewers on a small percentage of calls (typically 2 to 5%). Automated quality monitoring using STT and NLP can cover 100% of calls. Accuracy requirements for quality monitoring are high because false positives (flagging compliant calls as non-compliant) and false negatives (missing actual violations) both have real costs.

Compliance and Data

These terms matter for enterprise buyers in regulated sectors, particularly BFSI and insurance.

Data Residency

The requirement that data (in this case, call audio and transcripts) be stored and processed within a specific geographic boundary, typically a country or regulatory jurisdiction. For Indian enterprises operating under DPDP Act requirements or RBI guidelines, data residency may require that call audio not be sent to offshore cloud infrastructure. On-premises or India-region cloud deployment of STT APIs is the standard solution. Any STT vendor evaluation for a regulated Indian enterprise should confirm where audio data is processed and stored.

PII Redaction

The automatic identification and removal or masking of personally identifiable information from call transcripts before storage or downstream processing. PII in call transcripts includes customer names, Aadhaar numbers, PAN numbers, account numbers, phone numbers, and financial details. Most enterprise STT APIs offer PII redaction as a feature. The accuracy of PII redaction is a separate evaluation criterion from WER: a system can transcribe accurately but miss PII redaction, creating a compliance exposure.

DPDP Act

India's Digital Personal Data Protection Act, which governs how personal data of Indian citizens is collected, processed, and stored. For enterprises using STT APIs that process call recordings, DPDP Act compliance requires clear consent frameworks, data minimization, and defined retention policies for audio and transcript data. Voice biometrics (speaker recognition and voiceprint data) are classified as sensitive personal data under DPDP and require explicit consent.

RBI Guidelines on Call Recording

The Reserve Bank of India requires regulated entities including banks and NBFCs to maintain records of customer interactions. For contact centers using automated transcription, this creates requirements around transcript accuracy, retention, and tamper-evidence. STT deployments in BFSI must account for the retention and auditability requirements of call transcript records as well as the underlying audio. For a full breakdown of what BFSI enterprises need to validate before going live with speech recognition, see Speech Recognition for BFSI: What Indian Banks and NBFCs Must Verify Before Going Live.

Voiceprint / Speaker Embedding

A mathematical representation of a speaker's voice characteristics, used for speaker verification and identification. Voiceprints are biometric data under most regulatory frameworks, including India's DPDP Act. Enterprises using speaker identification features in their STT deployments need explicit consent frameworks and data handling policies for voiceprint data. This is distinct from speaker diarization, which identifies speaker turns within a call without necessarily identifying who the speaker is.

Consent Management

The process of obtaining, recording, and managing customer consent for call recording and transcription. In outbound voice AI and automated collection calls, TRAI regulations require disclosure that the call is automated and consent for recording. The consent disclosure and customer response must themselves be accurately transcribed if consent records are to be legally defensible.

Frequently Asked Questions

What is the difference between ASR and STT?
‍ ASR (Automatic Speech Recognition) is the broader technical term for the technology that converts spoken audio into text. STT (Speech-to-Text) refers specifically to the output function. In practice, vendors and enterprise buyers use the terms interchangeably. When you see ASR in a vendor's technical documentation, it means the same capability as STT in their marketing materials.

What is a good word error rate for enterprise STT?
For English on production telephony audio, below 8% WER is strong. For Indian languages on the same audio profile, the acceptable threshold depends on the use case. Quality monitoring can tolerate higher WER than compliance transcription or entity extraction. Gnani STT API achieves under 4% WER for English on noisy telephony and delivers 10 to 20% better accuracy than comparable providers across Indian languages. For the full WER benchmarking framework, see What Is Word Error Rate? The Only STT Accuracy Metric That Actually Matters.

What is speaker diarization and why does it matter?
‍ Speaker diarization segments a transcript by speaker, identifying who said what. In a contact center call, it separates agent speech from customer speech. Without diarization, post-call analytics cannot distinguish agent language from customer language, making quality monitoring and compliance review significantly harder.

What Indian languages does Gnani STT API support?
Gnani STT API supports 12 languages: Hindi, Tamil, Telugu, Malayalam, Kannada, Odia, Marathi, Punjabi, Gujarati, Bengali, Assamese, and English.

What should I look for beyond WER when evaluating an STT API?
‍ Latency (P95 under 200ms for real-time use cases), diarization accuracy, confidence scoring, custom vocabulary support, on-premises deployment options, PII redaction, and domain-specific WER on your actual vocabulary. For a full evaluation framework, see How to Benchmark a Speech-to-Text API on Indian Languages Before You Sign Anything.

What is the difference between real-time and batch transcription?
‍ Real-time transcription processes audio as it is spoken and returns transcripts with minimal delay. Batch transcription processes completed recordings after the call ends. Real-time is used for live agent assist and voice bots. Batch is used for post-call analytics and quality monitoring. Batch typically achieves higher accuracy because the model processes the full audio context. For a detailed comparison, see Real-Time vs Batch Transcription: Which STT Mode Does Your Contact Center Actually Need.

This glossary is part of The India STT Handbook by Gnani, a practitioner's guide to speech-to-text for Indian enterprises. Gnani STT API supports 12 Indian languages: Hindi, Tamil, Telugu, Malayalam, Kannada, Odia, Marathi, Punjabi, Gujarati, Bengali, Assamese, and English.

‍

STT Glossary: Every Term You'll Encounter When Evaluating a Speech-to-Text API