Illustration with collage of pictograms of clouds, pie chart, graph pictograms on the following

Speech recognition, also known as automatic speech recognition (ASR), computer speech recognition or speech-to-text, is a capability that enables a program to process human speech into a written format.

While speech recognition is commonly confused with voice recognition, speech recognition focuses on the translation of speech from a verbal format to a text one whereas voice recognition just seeks to identify an individual user’s voice.

IBM has had a prominent role within speech recognition since its inception, releasing of “Shoebox” in 1962. This machine had the ability to recognize 16 different words, advancing the initial work from Bell Labs from the 1950s. However, IBM didn’t stop there, but continued to innovate over the years, launching VoiceType Simply Speaking application in 1996. This speech recognition software had a 42,000-word vocabulary, supported English and Spanish, and included a spelling dictionary of 100,000 words.

While speech technology had a limited vocabulary in the early days, it is utilized in a wide number of industries today, such as automotive, technology, and healthcare. Its adoption has only continued to accelerate in recent years due to advancements in deep learning and big data.  Research  (link resides outside ibm.com) shows that this market is expected to be worth USD 24.9 billion by 2025.

Explore the free O'Reilly ebook to learn how to get started with Presto, the open source SQL engine for data analytics.

Register for the guide on foundation models

Many speech recognition applications and devices are available, but the more advanced solutions use AI and machine learning . They integrate grammar, syntax, structure, and composition of audio and voice signals to understand and process human speech. Ideally, they learn as they go — evolving responses with each interaction.

The best kind of systems also allow organizations to customize and adapt the technology to their specific requirements — everything from language and nuances of speech to brand recognition. For example:

  • Language weighting: Improve precision by weighting specific words that are spoken frequently (such as product names or industry jargon), beyond terms already in the base vocabulary.
  • Speaker labeling: Output a transcription that cites or tags each speaker’s contributions to a multi-participant conversation.
  • Acoustics training: Attend to the acoustical side of the business. Train the system to adapt to an acoustic environment (like the ambient noise in a call center) and speaker styles (like voice pitch, volume and pace).
  • Profanity filtering: Use filters to identify certain words or phrases and sanitize speech output.

Meanwhile, speech recognition continues to advance. Companies, like IBM, are making inroads in several areas, the better to improve human and machine interaction.

The vagaries of human speech have made development challenging. It’s considered to be one of the most complex areas of computer science – involving linguistics, mathematics and statistics. Speech recognizers are made up of a few components, such as the speech input, feature extraction, feature vectors, a decoder, and a word output. The decoder leverages acoustic models, a pronunciation dictionary, and language models to determine the appropriate output.

Speech recognition technology is evaluated on its accuracy rate, i.e. word error rate (WER), and speed. A number of factors can impact word error rate, such as pronunciation, accent, pitch, volume, and background noise. Reaching human parity – meaning an error rate on par with that of two humans speaking – has long been the goal of speech recognition systems. Research from Lippmann (link resides outside ibm.com) estimates the word error rate to be around 4 percent, but it’s been difficult to replicate the results from this paper.

Various algorithms and computation techniques are used to recognize speech into text and improve the accuracy of transcription. Below are brief explanations of some of the most commonly used methods:

  • Natural language processing (NLP): While NLP isn’t necessarily a specific algorithm used in speech recognition, it is the area of artificial intelligence which focuses on the interaction between humans and machines through language through speech and text. Many mobile devices incorporate speech recognition into their systems to conduct voice search—e.g. Siri—or provide more accessibility around texting. 
  • Hidden markov models (HMM): Hidden Markov Models build on the Markov chain model, which stipulates that the probability of a given state hinges on the current state, not its prior states. While a Markov chain model is useful for observable events, such as text inputs, hidden markov models allow us to incorporate hidden events, such as part-of-speech tags, into a probabilistic model. They are utilized as sequence models within speech recognition, assigning labels to each unit—i.e. words, syllables, sentences, etc.—in the sequence. These labels create a mapping with the provided input, allowing it to determine the most appropriate label sequence.
  • N-grams: This is the simplest type of language model (LM), which assigns probabilities to sentences or phrases. An N-gram is sequence of N-words. For example, “order the pizza” is a trigram or 3-gram and “please order the pizza” is a 4-gram. Grammar and the probability of certain word sequences are used to improve recognition and accuracy.
  • Neural networks: Primarily leveraged for deep learning algorithms, neural networks process training data by mimicking the interconnectivity of the human brain through layers of nodes. Each node is made up of inputs, weights, a bias (or threshold) and an output. If that output value exceeds a given threshold, it “fires” or activates the node, passing data to the next layer in the network. Neural networks learn this mapping function through supervised learning, adjusting based on the loss function through the process of gradient descent.  While neural networks tend to be more accurate and can accept more data, this comes at a performance efficiency cost as they tend to be slower to train compared to traditional language models.
  • Speaker Diarization (SD): Speaker diarization algorithms identify and segment speech by speaker identity. This helps programs better distinguish individuals in a conversation and is frequently applied at call centers distinguishing customers and sales agents.

A wide number of industries are utilizing different applications of speech technology today, helping businesses and consumers save time and even lives. Some examples include:

Automotive: Speech recognizers improves driver safety by enabling voice-activated navigation systems and search capabilities in car radios.

Technology: Virtual agents are increasingly becoming integrated within our daily lives, particularly on our mobile devices. We use voice commands to access them through our smartphones, such as through Google Assistant or Apple’s Siri, for tasks, such as voice search, or through our speakers, via Amazon’s Alexa or Microsoft’s Cortana, to play music. They’ll only continue to integrate into the everyday products that we use, fueling the “Internet of Things” movement.

Healthcare: Doctors and nurses leverage dictation applications to capture and log patient diagnoses and treatment notes.

Sales: Speech recognition technology has a couple of applications in sales. It can help a call center transcribe thousands of phone calls between customers and agents to identify common call patterns and issues. AI chatbots can also talk to people via a webpage, answering common queries and solving basic requests without needing to wait for a contact center agent to be available. It both instances speech recognition systems help reduce time to resolution for consumer issues.

Security: As technology integrates into our daily lives, security protocols are an increasing priority. Voice-based authentication adds a viable level of security.

Convert speech into text using AI-powered speech recognition and transcription.

Convert text into natural-sounding speech in a variety of languages and voices.

AI-powered hybrid cloud software.

Enable speech transcription in multiple languages for a variety of use cases, including but not limited to customer self-service, agent assistance and speech analytics.

Learn how to keep up, rethink how to use technologies like the cloud, AI and automation to accelerate innovation, and meet the evolving customer expectations.

IBM watsonx Assistant helps organizations provide better customer experiences with an AI chatbot that understands the language of the business, connects to existing customer care systems, and deploys anywhere with enterprise security and scalability. watsonx Assistant automates repetitive tasks and uses machine learning to resolve customer support issues quickly and efficiently.

Speech Recognition: Everything You Need to Know in 2024

introduction to speech recognition

Speech recognition, also known as automatic speech recognition (ASR) , enables seamless communication between humans and machines. This technology empowers organizations to transform human speech into written text. Speech recognition technology can revolutionize many business applications , including customer service, healthcare, finance and sales.

In this comprehensive guide, we will explain speech recognition, exploring how it works, the algorithms involved, and the use cases of various industries.

If you require training data for your speech recognition system, here is a guide to finding the right speech data collection services.

What is speech recognition?

Speech recognition, also known as automatic speech recognition (ASR), speech-to-text (STT), and computer speech recognition, is a technology that enables a computer to recognize and convert spoken language into text.

Speech recognition technology uses AI and machine learning models to accurately identify and transcribe different accents, dialects, and speech patterns.

What are the features of speech recognition systems?

Speech recognition systems have several components that work together to understand and process human speech. Key features of effective speech recognition are:

  • Audio preprocessing: After you have obtained the raw audio signal from an input device, you need to preprocess it to improve the quality of the speech input The main goal of audio preprocessing is to capture relevant speech data by removing any unwanted artifacts and reducing noise.
  • Feature extraction: This stage converts the preprocessed audio signal into a more informative representation. This makes raw audio data more manageable for machine learning models in speech recognition systems.
  • Language model weighting: Language weighting gives more weight to certain words and phrases, such as product references, in audio and voice signals. This makes those keywords more likely to be recognized in a subsequent speech by speech recognition systems.
  • Acoustic modeling : It enables speech recognizers to capture and distinguish phonetic units within a speech signal. Acoustic models are trained on large datasets containing speech samples from a diverse set of speakers with different accents, speaking styles, and backgrounds.
  • Speaker labeling: It enables speech recognition applications to determine the identities of multiple speakers in an audio recording. It assigns unique labels to each speaker in an audio recording, allowing the identification of which speaker was speaking at any given time.
  • Profanity filtering: The process of removing offensive, inappropriate, or explicit words or phrases from audio data.

What are the different speech recognition algorithms?

Speech recognition uses various algorithms and computation techniques to convert spoken language into written language. The following are some of the most commonly used speech recognition methods:

  • Hidden Markov Models (HMMs): Hidden Markov model is a statistical Markov model commonly used in traditional speech recognition systems. HMMs capture the relationship between the acoustic features and model the temporal dynamics of speech signals.
  • Estimate the probability of word sequences in the recognized text
  • Convert colloquial expressions and abbreviations in a spoken language into a standard written form
  • Map phonetic units obtained from acoustic models to their corresponding words in the target language.
  • Speaker Diarization (SD): Speaker diarization, or speaker labeling, is the process of identifying and attributing speech segments to their respective speakers (Figure 1). It allows for speaker-specific voice recognition and the identification of individuals in a conversation.

Figure 1: A flowchart illustrating the speaker diarization process

The image describes the process of speaker diarization, where multiple speakers in an audio recording are segmented and identified.

  • Dynamic Time Warping (DTW): Speech recognition algorithms use Dynamic Time Warping (DTW) algorithm to find an optimal alignment between two sequences (Figure 2).

Figure 2: A speech recognizer using dynamic time warping to determine the optimal distance between elements

Dynamic time warping is a technique used in speech recognition to determine the optimum distance between the elements.

5. Deep neural networks: Neural networks process and transform input data by simulating the non-linear frequency perception of the human auditory system.

6. Connectionist Temporal Classification (CTC): It is a training objective introduced by Alex Graves in 2006. CTC is especially useful for sequence labeling tasks and end-to-end speech recognition systems. It allows the neural network to discover the relationship between input frames and align input frames with output labels.

Speech recognition vs voice recognition

Speech recognition is commonly confused with voice recognition, yet, they refer to distinct concepts. Speech recognition converts  spoken words into written text, focusing on identifying the words and sentences spoken by a user, regardless of the speaker’s identity. 

On the other hand, voice recognition is concerned with recognizing or verifying a speaker’s voice, aiming to determine the identity of an unknown speaker rather than focusing on understanding the content of the speech.

What are the challenges of speech recognition with solutions?

While speech recognition technology offers many benefits, it still faces a number of challenges that need to be addressed. Some of the main limitations of speech recognition include:

Acoustic Challenges:

  • Assume a speech recognition model has been primarily trained on American English accents. If a speaker with a strong Scottish accent uses the system, they may encounter difficulties due to pronunciation differences. For example, the word “water” is pronounced differently in both accents. If the system is not familiar with this pronunciation, it may struggle to recognize the word “water.”

Solution: Addressing these challenges is crucial to enhancing  speech recognition applications’ accuracy. To overcome pronunciation variations, it is essential to expand the training data to include samples from speakers with diverse accents. This approach helps the system recognize and understand a broader range of speech patterns.

  • For instance, you can use data augmentation techniques to reduce the impact of noise on audio data. Data augmentation helps train speech recognition models with noisy data to improve model accuracy in real-world environments.

Figure 3: Examples of a target sentence (“The clown had a funny face”) in the background noise of babble, car and rain.

Background noise makes distinguishing speech from background noise difficult for speech recognition software.

Linguistic Challenges:

  • Out-of-vocabulary words: Since the speech recognizers model has not been trained on OOV words, they may incorrectly recognize them as different or fail to transcribe them when encountering them.

Figure 4: An example of detecting OOV word

introduction to speech recognition

Solution: Word Error Rate (WER) is a common metric that is used to measure the accuracy of a speech recognition or machine translation system. The word error rate can be computed as:

Figure 5: Demonstrating how to calculate word error rate (WER)

Word Error Rate (WER) is metric to evaluate the performance  and accuracy of speech recognition systems.

  • Homophones: Homophones are words that are pronounced identically but have different meanings, such as “to,” “too,” and “two”. Solution: Semantic analysis allows speech recognition programs to select the appropriate homophone based on its intended meaning in a given context. Addressing homophones improves the ability of the speech recognition process to understand and transcribe spoken words accurately.

Technical/System Challenges:

  • Data privacy and security: Speech recognition systems involve processing and storing sensitive and personal information, such as financial information. An unauthorized party could use the captured information, leading to privacy breaches.

Solution: You can encrypt sensitive and personal audio information transmitted between the user’s device and the speech recognition software. Another technique for addressing data privacy and security in speech recognition systems is data masking. Data masking algorithms mask and replace sensitive speech data with structurally identical but acoustically different data.

Figure 6: An example of how data masking works

Data masking protects sensitive or confidential audio information in speech recognition applications by replacing or encrypting the original audio data.

  • Limited training data: Limited training data directly impacts  the performance of speech recognition software. With insufficient training data, the speech recognition model may struggle to generalize different accents or recognize less common words.

Solution: To improve the quality and quantity of training data, you can expand the existing dataset using data augmentation and synthetic data generation technologies.

13 speech recognition use cases and applications

In this section, we will explain how speech recognition revolutionizes the communication landscape across industries and changes the way businesses interact with machines.

Customer Service and Support

  • Interactive Voice Response (IVR) systems: Interactive voice response (IVR) is a technology that automates the process of routing callers to the appropriate department. It understands customer queries and routes calls to the relevant departments. This reduces the call volume for contact centers and minimizes wait times. IVR systems address simple customer questions without human intervention by employing pre-recorded messages or text-to-speech technology . Automatic Speech Recognition (ASR) allows IVR systems to comprehend and respond to customer inquiries and complaints in real time.
  • Customer support automation and chatbots: According to a survey, 78% of consumers interacted with a chatbot in 2022, but 80% of respondents said using chatbots increased their frustration level.
  • Sentiment analysis and call monitoring: Speech recognition technology converts spoken content from a call into text. After  speech-to-text processing, natural language processing (NLP) techniques analyze the text and assign a sentiment score to the conversation, such as positive, negative, or neutral. By integrating speech recognition with sentiment analysis, organizations can address issues early on and gain valuable insights into customer preferences.
  • Multilingual support: Speech recognition software can be trained in various languages to recognize and transcribe the language spoken by a user accurately. By integrating speech recognition technology into chatbots and Interactive Voice Response (IVR) systems, organizations can overcome language barriers and reach a global audience (Figure 7). Multilingual chatbots and IVR automatically detect the language spoken by a user and switch to the appropriate language model.

Figure 7: Showing how a multilingual chatbot recognizes words in another language

introduction to speech recognition

  • Customer authentication with voice biometrics: Voice biometrics use speech recognition technologies to analyze a speaker’s voice and extract features such as accent and speed to verify their identity.

Sales and Marketing:

  • Virtual sales assistants: Virtual sales assistants are AI-powered chatbots that assist customers with purchasing and communicate with them through voice interactions. Speech recognition allows virtual sales assistants to understand the intent behind spoken language and tailor their responses based on customer preferences.
  • Transcription services : Speech recognition software records audio from sales calls and meetings and then converts the spoken words into written text using speech-to-text algorithms.

Automotive:

  • Voice-activated controls: Voice-activated controls allow users to interact with devices and applications using voice commands. Drivers can operate features like climate control, phone calls, or navigation systems.
  • Voice-assisted navigation: Voice-assisted navigation provides real-time voice-guided directions by utilizing the driver’s voice input for the destination. Drivers can request real-time traffic updates or search for nearby points of interest using voice commands without physical controls.

Healthcare:

  • Recording the physician’s dictation
  • Transcribing the audio recording into written text using speech recognition technology
  • Editing the transcribed text for better accuracy and correcting errors as needed
  • Formatting the document in accordance with legal and medical requirements.
  • Virtual medical assistants: Virtual medical assistants (VMAs) use speech recognition, natural language processing, and machine learning algorithms to communicate with patients through voice or text. Speech recognition software allows VMAs to respond to voice commands, retrieve information from electronic health records (EHRs) and automate the medical transcription process.
  • Electronic Health Records (EHR) integration: Healthcare professionals can use voice commands to navigate the EHR system , access patient data, and enter data into specific fields.

Technology:

  • Virtual agents: Virtual agents utilize natural language processing (NLP) and speech recognition technologies to understand spoken language and convert it into text. Speech recognition enables virtual agents to process spoken language in real-time and respond promptly and accurately to user voice commands.

Further reading

  • Top 5 Speech Recognition Data Collection Methods in 2023
  • Top 11 Speech Recognition Applications in 2023

External Links

  • 1. Databricks
  • 2. PubMed Central
  • 3. Qin, L. (2013). Learning Out-of-vocabulary Words in Automatic Speech Recognition . Carnegie Mellon University.
  • 4. Wikipedia

introduction to speech recognition

Next to Read

10+ speech data collection services in 2024, top 5 speech recognition data collection methods in 2024, top 4 speech recognition challenges & solutions in 2024.

Your email address will not be published. All fields are required.

Related research

Top 11 Voice Recognition Applications in 2024

Top 11 Voice Recognition Applications in 2024

Essential Guide to Automatic Speech Recognition Technology

introduction to speech recognition

Over the past decade, AI-powered speech recognition systems have slowly become part of our everyday lives, from voice search to virtual assistants in contact centers, cars, hospitals, and restaurants. These speech recognition developments are made possible by deep learning advancements.

Developers across many industries now use automatic speech recognition (ASR) to increase business productivity, application efficiency, and even digital accessibility. This post discusses ASR, how it works, use cases, advancements, and more.

What is automatic speech recognition?

Speech recognition technology is capable of converting spoken language (an audio signal) into written text that is often used as a command.

Today’s most advanced software can accurately process varying language dialects and accents. For example, ASR is commonly seen in user-facing applications such as virtual agents, live captioning, and clinical note-taking. Accurate speech transcription is essential for these use cases.

Developers in the speech AI space also use  alternative terminologies  to describe speech recognition such as ASR, speech-to-text (STT), and voice recognition.

ASR is a critical component of  speech AI , which is a suite of technologies designed to help humans converse with computers through voice.

Why natural language processing is used in speech recognition

Developers are often unclear about the role of natural language processing (NLP) models in the ASR pipeline. Aside from being applied in language models, NLP is also used to augment generated transcripts with punctuation and capitalization at the end of the ASR pipeline.

After the transcript is post-processed with NLP, the text is used for downstream language modeling tasks:

  • Sentiment analysis
  • Text analytics
  • Text summarization
  • Question answering

Speech recognition algorithms

Speech recognition algorithms can be implemented in a traditional way using statistical algorithms or by using deep learning techniques such as neural networks to convert speech into text.

Traditional ASR algorithms

Hidden Markov models (HMM) and dynamic time warping (DTW) are two such examples of traditional statistical techniques for performing speech recognition.

Using a set of transcribed audio samples, an HMM is trained to predict word sequences by varying the model parameters to maximize the likelihood of the observed audio sequence.

DTW is a dynamic programming algorithm that finds the best possible word sequence by calculating the distance between time series: one representing the unknown speech and others representing the known words.

Deep learning ASR algorithms

For the last few years, developers have been interested in deep learning for speech recognition because statistical algorithms are less accurate. In fact, deep learning algorithms work better at understanding dialects, accents, context, and multiple languages, and they transcribe accurately even in noisy environments.

Some of the most popular state-of-the-art speech recognition acoustic models are Quartznet , Citrinet , and Conformer . In a typical speech recognition pipeline, you can choose and switch any acoustic model that you want based on your use case and performance.

Implementation tools for deep learning models

Several tools are available for developing deep learning speech recognition models and pipelines, including Kaldi , Mozilla DeepSpeech, NVIDIA NeMo , NVIDIA Riva , NVIDIA TAO Toolkit , and services from Google, Amazon, and Microsoft.

Kaldi, DeepSpeech, and NeMo are open-source toolkits that help you build speech recognition models. TAO Toolkit and Riva are closed-source SDKs that help you develop customizable pipelines that can be deployed in production.

Cloud service providers like Google, AWS, and Microsoft offer generic services that you can easily plug and play with.

Deep learning speech recognition pipeline

An ASR pipeline consists of the following components:

  • Spectrogram generator that converts raw audio to spectrograms.
  • Acoustic model that takes the spectrograms as input and outputs a matrix of probabilities over characters over time.
  • Decoder (optionally coupled with a language model) that generates possible sentences from the probability matrix.
  • Punctuation and capitalization model that formats the generated text for easier human consumption.

A typical deep learning pipeline for speech recognition includes the following components:

  • Data preprocessing
  • Neural acoustic model
  • Decoder (optionally coupled with an n-gram language model)
  • Punctuation and capitalization model

Figure 1 shows an example of a deep learning speech recognition pipeline:.

Diagram showing the ASR pipeline

Datasets are essential in any deep learning application. Neural networks function similarly to the human brain. The more data you use to teach the model, the more it learns. The same is true for the speech recognition pipeline.

A few popular speech recognition datasets are

  • LibriSpeech
  • Fisher English Training Speech
  • Mozilla Common Voice (MCV)
  • 2000 HUB 5 English Evaluation Speech
  • AN4 (includes recordings of people spelling out addresses and names)
  • Aishell-1/AIshell-2 Mandarin speech corpus

Data processing is the first step. It includes data preprocessing and augmentation techniques such as speed/time/noise/impulse perturbation and time stretch augmentation, fast Fourier Transformations (FFT) using windowing, and normalization techniques.

For example, in Figure 2, the mel spectrogram is generated from a raw audio waveform after applying FFT using the windowing technique.

Diagram showing two forms of an audio recording: waveform (left) and mel spectrogram (right).

We can also use perturbation techniques to augment the training dataset. Figures 3 and 4 represent techniques like noise perturbation and masking being used to increase the size of the training dataset in order to avoid problems like overfitting.

Diagram showing two forms of a noise augmented audio recording: waveform (left) and mel spectrogram (right).

The output of the data preprocessing stage is a spectrogram/mel spectrogram, which is a visual representation of the strength of the audio signal over time. 

Mel spectrograms are then fed into the next stage: a neural acoustic model . QuartzNet, CitriNet, ContextNet, Conformer-CTC, and Conformer-Transducer are examples of cutting-edge neural acoustic models. Multiple ASR models exist for several reasons, such as the need for real-time performance, higher accuracy, memory size, and compute cost for your use case.

However, Conformer-based models are becoming more popular due to their improved accuracy and ability to comprehend. The acoustic model returns the probability of characters/words at each time stamp.

Figure 5 shows the output of the acoustic model, with time stamps. 

Diagram showing the output of acoustic model which includes probabilistic distribution over vocabulary characters per each time step.

The acoustic model’s output is fed into the decoder along with the language model. Decoders include beam search and greedy decoders, and language models include n-gram language, KenLM, and neural scoring. When it comes to the decoder, it helps to generate top words, which are then passed to language models to predict the correct sentence.

In Figure 6, the decoder selects the next best word based on the probability score. Based on the final highest score, the correct word or sentence is selected and sent to the punctuation and capitalization model.

Diagram showing how a decoder picks the next word based on the probability scores to generate a final transcript.

The ASR pipeline generates text with no punctuation or capitalization.

Finally, a punctuation and capitalization model is used to improve the text quality for better readability. Bidirectional Encoder Representations from Transformers (BERT) models are commonly used to generate punctuated text.

Figure 7 shows a simple example of a before-and-after punctuation and capitalization model.

Diagram showing how a punctuation and capitalization model adds punctuations & capitalizations to a generated transcript.

Speech recognition industry impact

There are many unique applications for ASR . For example, speech recognition could help industries such as finance, telecommunications, and unified communications as a service (UCaaS) to improve customer experience, operational efficiency, and return on investment (ROI).

Speech recognition is applied in the finance industry for applications such as call center agent assist and trade floor transcripts. ASR is used to transcribe conversations between customers and call center agents or trade floor agents. The generated transcriptions can then be analyzed and used to provide real-time recommendations to agents. This adds to an 80% reduction in post-call time.

Furthermore, the generated transcripts are used for downstream tasks:

  • Intent and entity recognition

Telecommunications

Contact centers are critical components of the telecommunications industry. With contact center technology, you can reimagine the telecommunications customer center, and speech recognition helps with that.

As previously discussed in the finance call center use case, ASR is used in Telecom contact centers to transcribe conversations between customers and contact center agents to analyze them and recommend call center agents in real time. T-Mobile uses ASR for quick customer resolution , for example.

Unified communications as a software

COVID-19 increased demand for UCaaS solutions, and vendors in the space began focusing on the use of speech AI technologies such as ASR to create more engaging meeting experiences.

For example, ASR can be used to generate live captions in video conferencing meetings. Captions generated can then be used for downstream tasks such as meeting summaries and identifying action items in notes.

Future of ASR technology

Speech recognition is not as easy as it sounds. Developing speech recognition is full of challenges, ranging from accuracy to customization for your use case to real-time performance. On the other hand, businesses and academic institutions are racing to overcome some of these challenges and advance the use of speech recognition capabilities.

ASR challenges

Some of the challenges in developing and deploying speech recognition pipelines in production include the following:

  • Lack of tools and SDKs that offer state-of-the-art (SOTA) ASR models makes it difficult for developers to take advantage of the best speech recognition technology.
  • Limited customization capabilities that enable developers to fine-tune on domain-specific and context-specific jargon, multiple languages, dialects, and accents to have your applications understand and speak like you
  • Restricted deployment support; for example, depending on the use case, the software should be capable of being deployed in any cloud, on-premises, edge, and embedded. 
  • Real-time speech recognition pipelines; for instance, in a call center agent assist use case, we cannot wait several seconds for conversations to be transcribed before using them to empower agents.

For more information about the major pain points that developers face when adding speech-to-text capabilities to applications, see Solving Automatic Speech Recognition Deployment Challenges .

ASR advancements

Numerous advancements in speech recognition are occurring on both the research and software development fronts. To begin, research has resulted in the development of several new cutting-edge ASR architectures, E2E speech recognition models, and self-supervised or unsupervised training techniques.

On the software side, there are a few tools that enable quick access to SOTA models, and then there are different sets of tools that enable the deployment of models as services in production. 

Key takeaways

Speech recognition continues to grow in adoption due to its advancements in deep learning-based algorithms that have made ASR as accurate as human recognition. Also, breakthroughs like multilingual ASR help companies make their apps available worldwide, and moving algorithms from cloud to on-device saves money, protects privacy, and speeds up inference.

NVIDIA offers Riva , a speech AI SDK, to address several of the challenges discussed above. With Riva, you can quickly access the latest SOTA research models tailored for production purposes. You can customize these models to your domain and use case, deploy on any cloud, on-premises, edge, or embedded, and run them in real-time for engaging natural interactions.

Learn how your organization can benefit from speech recognition skills with the free ebook, Building Speech AI Applications .

Related resources

  • GTC session: Speech AI Demystified
  • GTC session: Mastering Speech AI for Multilingual Multimedia Transformation
  • GTC session: Human-Like AI Voices: Exploring the Evolution of Voice Technology
  • NGC Containers: Domain Specific NeMo ASR Application
  • NGC Containers: MATLAB
  • Webinar: How Telcos Transform Customer Experiences with Conversational AI

About the Authors

Avatar photo

Related posts

Decorative image of groups of people using speech AI in different ways standing around a globe.

Video: Exploring Speech AI from Research to Practical Production Applications

Deep learning is transforming asr and tts algorithms.

introduction to speech recognition

Making an NVIDIA Riva ASR Service for a New Language

introduction to speech recognition

Exploring Unique Applications of Automatic Speech Recognition Technology

introduction to speech recognition

An Easy Introduction to Speech AI

Decorative image of a telco network as beams of light on a city street.

Enabling the World’s First GPU-Accelerated 5G Open RAN for NTT DOCOMO with NVIDIA Aerial

introduction to speech recognition

How Language Neutralization Is Transforming Customer Service Contact Centers

Image of a chatbot as the interface between customers, with speech bubbles.

Enhancing Customer Experience in Telecom with NVIDIA Customized Speech AI

NVIDIA AX800

NVIDIA AX800 Delivers High-Performance 5G vRAN and AI Services on One Common Cloud Infrastructure

introduction to speech recognition

Transforming IPsec Deployments with NVIDIA DOCA 2.0

Introduction to Automatic Speech Recognition (ASR)

Speech Processing

16 minute read

Maël Fabien

Maël Fabien

co-founder & ceo @ biped.ai

  • Switzerland
  • Custom Social Profile Link

This article provides a summary of the course “Automatic speech recognition” by Gwénolé Lecorvé from the Research in Computer Science (SIF) master , to which I added notes of the Statistical Sequence Processing course of EPFL, and from some tutorials/personal notes. All references are presented at the end.

Introduction to ASR

What is asr.

Automatic Speech Recognition (ASR), or Speech-to-text (STT) is a field of study that aims to transform raw audio into a sequence of corresponding words.

Some of the speech-related tasks involve:

  • speaker diarization: which speaker spoke when?
  • speaker recognition: who spoke?
  • spoken language understanding: what’s the meaning?
  • sentiment analysis: how does the speaker feel?

The classical pipeline in an ASR-powered application involves the Speech-to-text, Natural Language Processing and Text-to-speech.

image

ASR is not easy since there are lots of variabilities:

  • variability between speakers (inter-speaker)
  • variability for the same speaker (intra-speaker)
  • noise, reverberation in the room, environment…
  • articulation
  • elisions (grouping some words, not pronouncing them)
  • words with similar pronounciation
  • size of vocabulary
  • word variations

From a Machine Learning perspective, ASR is also really hard:

  • very high dimensional output space, and a complex sequence to sequence problem
  • few annotated training data
  • data is noisy

How is speech produced?

Let us first focus on how speech is produced. An excitation \(e\) is produced through lungs. It takes the form of an initial waveform, describes as an airflow over time.

Then, vibrations are produced by vocal cords, filters \(f\) are applied through pharynx, tongue…

image

The output signal produced can be written as \(s = f * e\), a convolution between the excitation and the filters. Hence, assuming \(f\) is linear and time-independent:

From the initial waveform, we generate the glotal spectrum, right out of the vocal cords. A bit higher the vocal tract, at the level of the pharynx, pitches are formed and produce the formants of the vocal tract. Finally, the output spectrum gives us the intensity over the range of frequencies produced.

image

Breaking down words

In automatic speech recognition, you do not train an Artificial Neural Network to make predictions on a set of 50’000 classes, each of them representing a word.

In fact, you take an input sequence, and produce an output sequence. And each word is represented as a phoneme , a set of elementary sounds in a language based on the International Phonetic Alphabet (IPA). To learn more about linguistics and phonetic, feel free to check this course from Harvard. There are around 40 to 50 different phonemes in English.

Phones are speech sounds defined by the acoustics, potentially unlimited in number,

For example, the word “French” is written under IPA as : / f ɹ ɛ n t ʃ /. The phoneme describes the voiceness / unvoiceness as well as the position of articulators.

Phonemes are language-dependent, since the sounds produced in languages are not the same. We define a minimal pair as two words that differ by only one phoneme. For example, “kill” and “kiss”.

For the sake of completeness, here are the consonant and vowel phonemes in standard french:

image

There are several ways to see a word:

  • as a sequence of phonemes
  • as a sequence of graphemes (mostly a written symbol representing phonemes)
  • as a sequence of morphemes (meaningful morphological unit of a language that cannot be further divided) (e.g “re” + “cogni” + “tion”)
  • as a part-of-speech (POS) in morpho-syntax: grammatical class, e.g noun, verb, … and flexional information, e.g singular, plural, gender…
  • as a syntax describing the function of the word (subject, object…)
  • as a meaning

Therefore, labeling speech can be done at several levels:

And the labels may be time-algined if we know when they occur in speech.

The vocabulary is defined as the set of words in a specific task, a language or several languages based on the ASR system we want to build. If we have a large vocabulary, we talk about Large vocabulary continuous speech recognition (LVCSR) . If some words we encounter in production have never been seen in training, we talk about Out Of Vocabulary words (OOV).

We distinguish 2 types of speech recognition tasks:

  • isolated word recognition
  • continuous speech recognition, which we will focus on

Evaluation metrics

We usually evaluate the performance of an ASR system using Word Error Rate (WER). We take as a reference a manual transcript. We then compute the number of mistakes made by the ASR system. Mistakes might include:

  • Substitutions, \(N_{SUB}\), a word gets replaced
  • Insertions, \(N_{INS}\), a word which was not pronounced in added
  • Deletions, \(N_{DEL}\), a word is omitted from the transcript

The WER is computed as:

The perfect WER should be as close to 0 as possible. The number of substitutions, insertions and deletions is computed using the Wagner-Fischer dynamic programming algorithm for word alignment.

Statistical historical approach to ASR

Let us denote the optimal word sequence \(W^{\star}\) from the vocabulary. Let the input sequence of acoustic features be \(X\). Stastically, our aim is to identify the optimal sequence such that:

This is known as the “Fundamental Equation of Statistical Speech Processing”. Using Bayes Rule, we can rewrite is as :

Finally, we suppose independence and remove the term \(P(X)\). Hence, we can re-formulate our problem as:

  • \(argmax_W\) is the search space, a function of the vocabulary
  • \(P(X \mid W)\) is called the acoustic model
  • \(P(W)\) is called the language model

The steps are presented in the following diagram:

image

Feature extraction \(X\)

From the speech analysis, we should extract features \(X\) which are:

  • robust across speakers
  • robust against noise and channel effects
  • low dimension, at equal accuracy
  • non-redondant among features

Features we typically extract include:

  • Mel-Frequency Cepstral Coefficients (MFCC), as desbribed here
  • Perceptual Linear Prediction (PLP)

We should then normalize the features extracted to avoid mismatches across samples with mean and variance normalization.

Acoustic model

1. hmm-gmm acoustic model.

The acoustic model is a complex model, usually based on Hidden Markov Models and Artificial Neural Networks, modeling the relationship between the audio signal and the phonetic units in the language.

In isolated word/pattern recognition, the acoustic features (here \(Y\)) are used as an input to a classifier whose rose is to output the correct word. However, we take input sequence and should output sequences too when it comes to continuous speech recognition .

image

The acoustic model goes further than a simple classifier. It outputs a sequence of phonemes.

image

Hidden Markov Models are natural candidates for Acoustic Models since they are great at modeling sequences. If you want to read more on HMMs and HMM-GMM training, you can read this article . The HMM has underlying states \(s_i\), and at each state, observations \(o_i\) are generated.

image

In HMMs, 1 phoneme is typically represented by a 3 or 5 state linear HMM (generally the beginning, middle and end of the phoneme).

image

The topology of HMMs is flexible by nature, and we can choose to have each phoneme being represented by a single state, or 3 states for example:

image

The HMM supposes observation independence, in the sense that:

The HMM can also output context-dependent phonemes, called triphones. Triphones are simply a group of 3 phonemes, the left one being the left context, and the right one, the right context.

The HMM is trained using Baum-Welsch algorithm. The HMMs learns to give the probability of each end of phoneme at time t. We usually suppose the observations are generated by a mixture of Gaussians (Gaussian Mixture Models, GMMs) at each state, i.e:

The training of the HMM-GMM is solved by Expectation Maximization (EM). In the EM training, the outputs of the GMM \(P(X \mid W)\) are used as inputs for the GMM training iteratively, and the Viterbi or Baum Welsch algorithm trains the HMM (i.e. identifies the transition matrices) to produce the best state sequence.

The full pipeline is presented below:

image

2. HMM-DNN acoustic model

Latest models focus on hybrid HMM-DNN architectures and approach the acoustic model in another way. In such approach, we do not care about the acoustic model \(P(X \mid W)\), but we directly tackle \(P(W \mid X)\) as the probability of observing state sequences given \(X\).

Hence, back to the first acoustic modeling equation, we target:

The aim of the DNN is to model the posterior probabilities over HMM states.

image

Some considerations on the HMM-DNN framework:

  • we usually take a large number of hidden layers
  • the inputs features typically are extracted from large windows (up to 1-2 seconds) to have a large context
  • early stopping can be used

You might have noticed that the training of the DNN produces posterior, whereas the Viterbi Backward-Forward algorithm requires \(P(X \mid W)\) to identify the optimal sequence when training the HMM. Therefore, we use Bayes Rule:

The probability of the acoustic feature \(P(X)\) is not known, but it just scales all the likelihoods by the same factor, and therefore does not modify the alignment.

The training of HMM-DNN architectures is based:

  • E-step keeps DNN and HMM parameters constant and estimates the DNN outputs to produce scaled likelihoods
  • M-step re-trains the DNN parameters on the new targets from E-step
  • either using REMAP, with a similar architecture, except that the states priors are also given as inputs to the DNN

3. HMM-DNN vs. HMM-GMM

Here is a brief summary of the pros and cons of HMM/DNN and HMM/GMM:

4. End-to-end models

In End-to-end models, the steps of feature extraction and phoneme prediction are combined:

image

This concludes the part on acoustic modeling.

Pronunciation

In small vocabulary sizes, it is quite easy to collect a lot of utterances for each word, and the HMM-GMM or HMM-DNN training is efficient. However, “statistical modeling requires a sufficient number of examples to get a good estimate of the relationship between speech input and the parts of words”. In large-vocabulary tasks, we might collect 1 or even 0 training examples. t. Thus, it is not feasible to train a model for each word, and we need to share information across words, based on the pronunciation.

We consider words are being sequences of states \(Q\).

Where \(P(Q \mid W)\) is the pronunciation model .

The pronunciation dictionary is written by human experts, and defined in the IPA. The pronunciation of words is typically stored in a lexical tree, a data structure that allows us to share histories between words in the lexicon.

image

When decoding a sequence in prediction, we must identify the most likely path in the tree based on the HMM-DNN output.

In ASR, most recent approaches are:

  • either end to end
  • or at the character level

In both approaches, we do not care about the full pronunciation of the words. Grapheme-to-phoneme (G2P) models try to learn automatically the pronunciation of new words.

Language Modeling

Let’s get back to our ASR base equation:

The language model is defined as \(P(W)\). It assigns a probability estimate to word sequences, and defines:

  • what the speaker may say
  • the vocabulary
  • the probability over possible sequences, by training on some texts

The contraint on \(P(W)\) is that \(\sum_W P(W) = 1\).

In statistical language modeling, we aim to disambiguate sequences such as:

“recognize speech”, “wreck a nice beach”

The maximum likelihood estimation of a sequence is given by:

Where \(C(w_1, ..., w_i)\) is the observed count in the training data. For example:

image

We call this ratio the relative frequency . The probability of a whole sequence is given by the chain rule of probabilities:

This approach seems logic, but the longer the sequence, the most likely it will be that we encounter 0’s, hence bringing the probability of the whole sequence at 0.

What solutions can we apply?

  • smoothing: redistribute the probability mass from observed to unobserved events (e.g Laplace smoothing, Add-k smoothing)
  • backoff: explained below

1. N-gram language model

But one of the most popular solution is the n-gram model . The idea behind the n-gram model is to truncate the word history to the last 2, 3, 4 or 5 words, and therefore approximate the history of the word:

We take \(n\) as being 1 (unigram), 2 (bigram), 3 (trigram)…

Let us now discuss some practical implementation tricks:

  • we compute the log of the probabilities, rather than the probabilities themselves (to avoid floating point approximation to 0)
  • for the first word of a sequence, we need to define pseudo-words as being the first 2 missing words for the trigram: \(P(I \mid <s><s>)\)

With N-grams, it is possible that we encounter unseen N-grams in prediction. There is a technique called backoff that states that if we miss the trigram evidence, we use the bigram instead, and if we miss the bigram evidence, we use the unigram instead…

Another approach is linear interpolate , where we combine different order n-grams by linearly interpolating all the models:

2. Language models evaluation metrics

There are 2 types of evaluation metrics for language models:

  • extrinsic evaluation , for which we embed the language model in an application and see by which factor the performance is improved
  • intrinsic evaluation that measures the quality of a model independent of any application

Extrinsic evaluations are often heavy to implement. Hence, when focusing on intrinsic evaluations, we:

  • split the dataset/corpus into train and test (and development set if needed)
  • learn transition probabilities from the trainig set
  • use the perplexity metric to evaluate the language model on the test set

We could also use the raw probabilities to evaluate the language model, but the perpeplixity is defined as the inverse probability of the test set, normalized by the number of words. For example, for a bi-gram model, the perpeplexity (noted PP) is defined as:

The lower the perplexity, the better

3. Limits of language models

Language models are trained on a closed vocabulary. Hence, when a new unknown word is met, it is said to be Out of Vocabulary (OOV).

4. Deep learning language models

More recently in Natural Language Processing, neural network-based language models have become more and more popular. Word embeddings project words into a continuous space \(R^d\), and respect topological properties (semantics and morpho-syntaxic).

Recurrent neural networks and LSTMs are natural candidates when learning such language models.

The training is now done. The final step to cover is the decoding, i.e. the predictions to make when we collect audio features and want to produce transcript.

We need to find:

However, exploring the whole spact, especially since the Language Model \(P(W)\) has a really large scale factor, can be incredibly long.

One of the solutions is to explore the Beam Search . The Beam Search algorithm greatly reduces the scale factor within a language model (whether N-gram based or Neural-network-based). In Beam Search, we:

  • identify the probability of each word in the vocabulary for the first position, and keep the top K ones (K is called the Beam width)
  • for each of the K words, we compute the conditional probability of observing each of the second words of the vocabulary
  • among all produced probabilities, we keep only the top K ones
  • and we move on to the third word…

Let us illustrate this process the following way. We want to evaluate the sequence that is the most likely. We first compute the probability of the different words of the vocabulary to be the starting word of the sentence:

image

Here, we fix the beam width to 2, meaning that we only select the 2 most likely words to start with. Then, we move on to the next word, and compute the probability of observing it using conditional probability in the language model: \(P(w_2, w_1 \mid W) = P(w_1 \mid W) P(w_2 \mid w_1, W)\). We might see that a potential candidate, e.g. “The”, when selecting the top 2 candidates second words among all possible words, is not a possible path anymore. In that case, we narrow the search, since we know that the first must must be “a”.

image

And so on… Another approach to decoding is the Weighted Finite State Transducers (I’ll make an article on that).

Summary of the ASR pipeline

In their paper “Word Embeddings for Speech Recognition” , Samy Bengio and Georg Heigold present a good summary of a modern ASR architecture:

  • Words are represented through lexicons as phonemes
  • Typically, for context, we cluster triphones
  • We then assume that these triphones states were in fact HMM states
  • And the the observations each HMM state generates are produced by DNNs or GMMs

image

End-to-end approach

Alright, this article is already long, but we’re almost done. So far, we mostly covered historical statistical approaches. These approaches work very well. However, most recent papers and implementations focus on end-to-end approaches, where:

  • we encode \(X\) as a sequence of contexts \(C\)
  • we decode \(C\) into a sequence of words \(W\)

These approaches, also called encoder-decoder, are part of sequence-to-sequence models. Sequence to sequence models learn to map a sequence of inputs to a sequence of outputs, even though their length might differ. This is widely used in Machine Translation for example.

As illustrated below, the Encoder reduces the input sequence to a encoder vector through a stack of RNNs, and the decoder vector uses this vector as an input.

image

I will write more about End-to-end models in another article.

This is all for this quite long introduction to automatic speech recognition. After a brief introduction to speech production, we covered historical approaches to speech recognition with HMM-GMM and HMM-DNN approaches. We also mentioned the more recent end-to-end approaches. If you want to improve this article or have a question, feel free to leave a comment below :)

References:

  • “Automatic speech recognition” by Gwénolé Lecorvé from the Research in Computer Science (SIF) master
  • EPFL Statistical Sequence Processing course
  • Stanford CS224S
  • Rasmus Robert HMM-DNN
  • A Tutorial on Pronunciation Modeling for Large Vocabulary Speech Recognition
  • N-gram Language Models, Stanford
  • Andrew Ng’s Beam Search explanation
  • Encoder Decoder model
  • Automatic Speech Recognition Introduction, University of Edimburgh

Winscribe end of life : Special migration offers available!

An introduction to speech recognition

Speech recognition

newsItem.txWebsitetemplateAuthor.name

A review of the development of speech recognition from its early inception, the increasing role of artificial intelligence and how it is impacting upon the day-to-day operations of today’s businesses.

An introduction to speech recognition

Speech recognition, also known as automatic speech recognition (ASR), computer speech recognition, or speech-to-text, is a capability which enables a program to identify human speech and convert it into readable text.

Whilst the more basic speech recognition software has a limited vocabulary, we are now seeing the emergence of more sophisticated software can handle natural speech, different accents and various languages, whilst also achieving much higher accuracy rates. We are also using speech recognition technology much more in our everyday lives, with an increasing number of people taking advantage of digital assistants like Google Home, Siri, and Amazon Alexa.

So, how has the technology evolved, how does it work and what are the opportunities for businesses and professionals across numerous industries and sectors to exploit speech recognition in the everyday work?

Here’s a quick overview of how speech recognition has developed from the early prototypes:

  • 1952 - The first-ever speech recognition system, known as “Audrey” was built by Bell Laboratories. It was capable of recognising the sound of a spoken digit – zero to nine – with more than 90% accuracy when uttered by a single voice (its developer HK David).
  • 1962 – IBM created the “Shoebox, a device that could recognise and differentiate between 16 spoken English words.
  • 1970s - As part of a US Department of Defence-funded program, Carnegie Mellon University developed the “Harpy” system that could recognise entire sentences and had a vocabulary of 1,011 words.
  • 1980s – IBM developed a voice-activated typewriter called Tangora which used a statistical prediction model for word identification with a vocabulary of 20,000 words.
  • 1996 – IBM were involved again, this time with VoiceType Simply Speaking, a speech recognition application that had a 42,000 word vocabulary, supported English and Spanish, and included a spelling dictionary of 100,000 words.
  • 2000s – With speech recognition now achieving close to an 80% accuracy rate, voice assistants (also commonly referred to as digital assistants) came to the fore, firstly Google Voice to be followed a few years after by Apple’s launch of Siri and Amazon coming out with Alexa.

How it works

A wide range of speech recognition applications and devices are available, with the more advanced solutions now use Artificial Intelligence (AI) and machine learning. They are typically based on the following models:

  • Acoustic models – making it possible to distinguish between the voice signal and the phonemes (the units of sound).
  • Pronunciation models – defining how the phonemes can be combined to make words.
  • Language models – matching sounds with word sequences in order to distinguish between words that sound the same.

Initially, the Hidden Markov Model (HMM) was widely adopted as an acoustic modelling approach. However, it has largely been replaced by deep neural networks. The use of deep learning in speech recognition has had the effect of significantly lowering the word error rate.

Word error rate

A key factor in speech recognition technology is its accuracy rate, commonly referred to as the word error rate (WER). A number of factors can impact upon the WER, for example different speech patterns, speaking styles, languages, dialects, accents and phrasings. The challenge for the software algorithms that process and organise audio into text are to address these effectively, whilst also being able to separate the spoken audio from background noise that often accompanies the signal.

The application of speech recognition

Thanks to laptops, tablets and smartphones, together with the rapid development of AI, speech recognition software has entered all aspects of our everyday life. Examples include:

Virtual assistants

These integrate with a range of different platforms and enable us to command our devices just by talking. At the personal level examples include Siri, Alexa and Google Assistant. In the office they can be used to complement the work of human employees by taking responsibility for repetitive, time-consuming tasks and allowing employees to focus their energy on more high-priority activities.

Voice search

Speech recognition technology is not only impacting the way businesses perform daily tasks but also how their customers are able to reach them. Voice search is typically used on devices such as smartphones, laptops and tablets, allowing users to input a voice-based search query instead of typing their query into a search engine. The differences between spoken and typed queries can cause different SERP (search engine results page) results since the way we speak creates new voice search keywords that are more conversational than typed keywords.

Speech to text solutions

And finally, the most significant area as far as business users are concerned is speech to text software. This area is growing rapidly, due in no small part to the availability of cloud-based solutions that are enabling users to access fully featured versions of speech to text apps from the smartphones or tablets irrespective of their locations. Furthermore, speech recognition technology can reduce repetitive tasks and free up professionals to use their time more productively, whilst also allowing businesses to save money by automating processes and doing administrative tasks more quickly.

Learn more about our speech to text service

These articles might also interest you

introduction to speech recognition

Philips SpeechLive has just joined forces with Nuance's Dragon Speech Recognition

introduction to speech recognition

Speech Recognition and Dictation on iPhone, Mac and Apple Watch

introduction to speech recognition

Cybersecurity threats in 2024 – what do they mean for your business?

introduction to speech recognition

What is digital dictation?

introduction to speech recognition

Using speech to text solutions to address common business challenges

How are law firms adapting to the new hybrid working opportunities

How are law firms adapting to the new hybrid working opportunities?

  • Search Menu
  • Browse content in Arts and Humanities
  • Browse content in Archaeology
  • Anglo-Saxon and Medieval Archaeology
  • Archaeological Methodology and Techniques
  • Archaeology by Region
  • Archaeology of Religion
  • Archaeology of Trade and Exchange
  • Biblical Archaeology
  • Contemporary and Public Archaeology
  • Environmental Archaeology
  • Historical Archaeology
  • History and Theory of Archaeology
  • Industrial Archaeology
  • Landscape Archaeology
  • Mortuary Archaeology
  • Prehistoric Archaeology
  • Underwater Archaeology
  • Urban Archaeology
  • Zooarchaeology
  • Browse content in Architecture
  • Architectural Structure and Design
  • History of Architecture
  • Residential and Domestic Buildings
  • Theory of Architecture
  • Browse content in Art
  • Art Subjects and Themes
  • History of Art
  • Industrial and Commercial Art
  • Theory of Art
  • Biographical Studies
  • Byzantine Studies
  • Browse content in Classical Studies
  • Classical History
  • Classical Philosophy
  • Classical Mythology
  • Classical Literature
  • Classical Reception
  • Classical Art and Architecture
  • Classical Oratory and Rhetoric
  • Greek and Roman Epigraphy
  • Greek and Roman Law
  • Greek and Roman Archaeology
  • Greek and Roman Papyrology
  • Late Antiquity
  • Religion in the Ancient World
  • Digital Humanities
  • Browse content in History
  • Colonialism and Imperialism
  • Diplomatic History
  • Environmental History
  • Genealogy, Heraldry, Names, and Honours
  • Genocide and Ethnic Cleansing
  • Historical Geography
  • History by Period
  • History of Agriculture
  • History of Education
  • History of Emotions
  • History of Gender and Sexuality
  • Industrial History
  • Intellectual History
  • International History
  • Labour History
  • Legal and Constitutional History
  • Local and Family History
  • Maritime History
  • Military History
  • National Liberation and Post-Colonialism
  • Oral History
  • Political History
  • Public History
  • Regional and National History
  • Revolutions and Rebellions
  • Slavery and Abolition of Slavery
  • Social and Cultural History
  • Theory, Methods, and Historiography
  • Urban History
  • World History
  • Browse content in Language Teaching and Learning
  • Language Learning (Specific Skills)
  • Language Teaching Theory and Methods
  • Browse content in Linguistics
  • Applied Linguistics
  • Cognitive Linguistics
  • Computational Linguistics
  • Forensic Linguistics
  • Grammar, Syntax and Morphology
  • Historical and Diachronic Linguistics
  • History of English
  • Language Acquisition
  • Language Variation
  • Language Families
  • Language Evolution
  • Language Reference
  • Lexicography
  • Linguistic Theories
  • Linguistic Typology
  • Linguistic Anthropology
  • Phonetics and Phonology
  • Psycholinguistics
  • Sociolinguistics
  • Translation and Interpretation
  • Writing Systems
  • Browse content in Literature
  • Bibliography
  • Children's Literature Studies
  • Literary Studies (Asian)
  • Literary Studies (European)
  • Literary Studies (Eco-criticism)
  • Literary Studies (Modernism)
  • Literary Studies (Romanticism)
  • Literary Studies (American)
  • Literary Studies - World
  • Literary Studies (1500 to 1800)
  • Literary Studies (19th Century)
  • Literary Studies (20th Century onwards)
  • Literary Studies (African American Literature)
  • Literary Studies (British and Irish)
  • Literary Studies (Early and Medieval)
  • Literary Studies (Fiction, Novelists, and Prose Writers)
  • Literary Studies (Gender Studies)
  • Literary Studies (Graphic Novels)
  • Literary Studies (History of the Book)
  • Literary Studies (Plays and Playwrights)
  • Literary Studies (Poetry and Poets)
  • Literary Studies (Postcolonial Literature)
  • Literary Studies (Queer Studies)
  • Literary Studies (Science Fiction)
  • Literary Studies (Travel Literature)
  • Literary Studies (War Literature)
  • Literary Studies (Women's Writing)
  • Literary Theory and Cultural Studies
  • Mythology and Folklore
  • Shakespeare Studies and Criticism
  • Browse content in Media Studies
  • Browse content in Music
  • Applied Music
  • Dance and Music
  • Ethics in Music
  • Ethnomusicology
  • Gender and Sexuality in Music
  • Medicine and Music
  • Music Cultures
  • Music and Religion
  • Music and Culture
  • Music and Media
  • Music Education and Pedagogy
  • Music Theory and Analysis
  • Musical Scores, Lyrics, and Libretti
  • Musical Structures, Styles, and Techniques
  • Musicology and Music History
  • Performance Practice and Studies
  • Race and Ethnicity in Music
  • Sound Studies
  • Browse content in Performing Arts
  • Browse content in Philosophy
  • Aesthetics and Philosophy of Art
  • Epistemology
  • Feminist Philosophy
  • History of Western Philosophy
  • Metaphysics
  • Moral Philosophy
  • Non-Western Philosophy
  • Philosophy of Science
  • Philosophy of Action
  • Philosophy of Law
  • Philosophy of Religion
  • Philosophy of Language
  • Philosophy of Mind
  • Philosophy of Perception
  • Philosophy of Mathematics and Logic
  • Practical Ethics
  • Social and Political Philosophy
  • Browse content in Religion
  • Biblical Studies
  • Christianity
  • East Asian Religions
  • History of Religion
  • Judaism and Jewish Studies
  • Qumran Studies
  • Religion and Education
  • Religion and Health
  • Religion and Politics
  • Religion and Science
  • Religion and Law
  • Religion and Art, Literature, and Music
  • Religious Studies
  • Browse content in Society and Culture
  • Cookery, Food, and Drink
  • Cultural Studies
  • Customs and Traditions
  • Ethical Issues and Debates
  • Hobbies, Games, Arts and Crafts
  • Lifestyle, Home, and Garden
  • Natural world, Country Life, and Pets
  • Popular Beliefs and Controversial Knowledge
  • Sports and Outdoor Recreation
  • Technology and Society
  • Travel and Holiday
  • Visual Culture
  • Browse content in Law
  • Arbitration
  • Browse content in Company and Commercial Law
  • Commercial Law
  • Company Law
  • Browse content in Comparative Law
  • Systems of Law
  • Competition Law
  • Browse content in Constitutional and Administrative Law
  • Government Powers
  • Judicial Review
  • Local Government Law
  • Military and Defence Law
  • Parliamentary and Legislative Practice
  • Construction Law
  • Contract Law
  • Browse content in Criminal Law
  • Criminal Procedure
  • Criminal Evidence Law
  • Sentencing and Punishment
  • Employment and Labour Law
  • Environment and Energy Law
  • Browse content in Financial Law
  • Banking Law
  • Insolvency Law
  • History of Law
  • Human Rights and Immigration
  • Intellectual Property Law
  • Browse content in International Law
  • Private International Law and Conflict of Laws
  • Public International Law
  • IT and Communications Law
  • Jurisprudence and Philosophy of Law
  • Law and Politics
  • Law and Society
  • Browse content in Legal System and Practice
  • Courts and Procedure
  • Legal Skills and Practice
  • Primary Sources of Law
  • Regulation of Legal Profession
  • Medical and Healthcare Law
  • Browse content in Policing
  • Criminal Investigation and Detection
  • Police and Security Services
  • Police Procedure and Law
  • Police Regional Planning
  • Browse content in Property Law
  • Personal Property Law
  • Study and Revision
  • Terrorism and National Security Law
  • Browse content in Trusts Law
  • Wills and Probate or Succession
  • Browse content in Medicine and Health
  • Browse content in Allied Health Professions
  • Arts Therapies
  • Clinical Science
  • Dietetics and Nutrition
  • Occupational Therapy
  • Operating Department Practice
  • Physiotherapy
  • Radiography
  • Speech and Language Therapy
  • Browse content in Anaesthetics
  • General Anaesthesia
  • Neuroanaesthesia
  • Browse content in Clinical Medicine
  • Acute Medicine
  • Cardiovascular Medicine
  • Clinical Genetics
  • Clinical Pharmacology and Therapeutics
  • Dermatology
  • Endocrinology and Diabetes
  • Gastroenterology
  • Genito-urinary Medicine
  • Geriatric Medicine
  • Infectious Diseases
  • Medical Oncology
  • Medical Toxicology
  • Pain Medicine
  • Palliative Medicine
  • Rehabilitation Medicine
  • Respiratory Medicine and Pulmonology
  • Rheumatology
  • Sleep Medicine
  • Sports and Exercise Medicine
  • Clinical Neuroscience
  • Community Medical Services
  • Critical Care
  • Emergency Medicine
  • Forensic Medicine
  • Haematology
  • History of Medicine
  • Browse content in Medical Dentistry
  • Oral and Maxillofacial Surgery
  • Paediatric Dentistry
  • Restorative Dentistry and Orthodontics
  • Surgical Dentistry
  • Medical Ethics
  • Browse content in Medical Skills
  • Clinical Skills
  • Communication Skills
  • Nursing Skills
  • Surgical Skills
  • Medical Statistics and Methodology
  • Browse content in Neurology
  • Clinical Neurophysiology
  • Neuropathology
  • Nursing Studies
  • Browse content in Obstetrics and Gynaecology
  • Gynaecology
  • Occupational Medicine
  • Ophthalmology
  • Otolaryngology (ENT)
  • Browse content in Paediatrics
  • Neonatology
  • Browse content in Pathology
  • Chemical Pathology
  • Clinical Cytogenetics and Molecular Genetics
  • Histopathology
  • Medical Microbiology and Virology
  • Patient Education and Information
  • Browse content in Pharmacology
  • Psychopharmacology
  • Browse content in Popular Health
  • Caring for Others
  • Complementary and Alternative Medicine
  • Self-help and Personal Development
  • Browse content in Preclinical Medicine
  • Cell Biology
  • Molecular Biology and Genetics
  • Reproduction, Growth and Development
  • Primary Care
  • Professional Development in Medicine
  • Browse content in Psychiatry
  • Addiction Medicine
  • Child and Adolescent Psychiatry
  • Forensic Psychiatry
  • Learning Disabilities
  • Old Age Psychiatry
  • Psychotherapy
  • Browse content in Public Health and Epidemiology
  • Epidemiology
  • Public Health
  • Browse content in Radiology
  • Clinical Radiology
  • Interventional Radiology
  • Nuclear Medicine
  • Radiation Oncology
  • Reproductive Medicine
  • Browse content in Surgery
  • Cardiothoracic Surgery
  • Gastro-intestinal and Colorectal Surgery
  • General Surgery
  • Neurosurgery
  • Paediatric Surgery
  • Peri-operative Care
  • Plastic and Reconstructive Surgery
  • Surgical Oncology
  • Transplant Surgery
  • Trauma and Orthopaedic Surgery
  • Vascular Surgery
  • Browse content in Science and Mathematics
  • Browse content in Biological Sciences
  • Aquatic Biology
  • Biochemistry
  • Bioinformatics and Computational Biology
  • Developmental Biology
  • Ecology and Conservation
  • Evolutionary Biology
  • Genetics and Genomics
  • Microbiology
  • Molecular and Cell Biology
  • Natural History
  • Plant Sciences and Forestry
  • Research Methods in Life Sciences
  • Structural Biology
  • Systems Biology
  • Zoology and Animal Sciences
  • Browse content in Chemistry
  • Analytical Chemistry
  • Computational Chemistry
  • Crystallography
  • Environmental Chemistry
  • Industrial Chemistry
  • Inorganic Chemistry
  • Materials Chemistry
  • Medicinal Chemistry
  • Mineralogy and Gems
  • Organic Chemistry
  • Physical Chemistry
  • Polymer Chemistry
  • Study and Communication Skills in Chemistry
  • Theoretical Chemistry
  • Browse content in Computer Science
  • Artificial Intelligence
  • Computer Architecture and Logic Design
  • Game Studies
  • Human-Computer Interaction
  • Mathematical Theory of Computation
  • Programming Languages
  • Software Engineering
  • Systems Analysis and Design
  • Virtual Reality
  • Browse content in Computing
  • Business Applications
  • Computer Security
  • Computer Games
  • Computer Networking and Communications
  • Digital Lifestyle
  • Graphical and Digital Media Applications
  • Operating Systems
  • Browse content in Earth Sciences and Geography
  • Atmospheric Sciences
  • Environmental Geography
  • Geology and the Lithosphere
  • Maps and Map-making
  • Meteorology and Climatology
  • Oceanography and Hydrology
  • Palaeontology
  • Physical Geography and Topography
  • Regional Geography
  • Soil Science
  • Urban Geography
  • Browse content in Engineering and Technology
  • Agriculture and Farming
  • Biological Engineering
  • Civil Engineering, Surveying, and Building
  • Electronics and Communications Engineering
  • Energy Technology
  • Engineering (General)
  • Environmental Science, Engineering, and Technology
  • History of Engineering and Technology
  • Mechanical Engineering and Materials
  • Technology of Industrial Chemistry
  • Transport Technology and Trades
  • Browse content in Environmental Science
  • Applied Ecology (Environmental Science)
  • Conservation of the Environment (Environmental Science)
  • Environmental Sustainability
  • Environmentalist Thought and Ideology (Environmental Science)
  • Management of Land and Natural Resources (Environmental Science)
  • Natural Disasters (Environmental Science)
  • Nuclear Issues (Environmental Science)
  • Pollution and Threats to the Environment (Environmental Science)
  • Social Impact of Environmental Issues (Environmental Science)
  • History of Science and Technology
  • Browse content in Materials Science
  • Ceramics and Glasses
  • Composite Materials
  • Metals, Alloying, and Corrosion
  • Nanotechnology
  • Browse content in Mathematics
  • Applied Mathematics
  • Biomathematics and Statistics
  • History of Mathematics
  • Mathematical Education
  • Mathematical Finance
  • Mathematical Analysis
  • Numerical and Computational Mathematics
  • Probability and Statistics
  • Pure Mathematics
  • Browse content in Neuroscience
  • Cognition and Behavioural Neuroscience
  • Development of the Nervous System
  • Disorders of the Nervous System
  • History of Neuroscience
  • Invertebrate Neurobiology
  • Molecular and Cellular Systems
  • Neuroendocrinology and Autonomic Nervous System
  • Neuroscientific Techniques
  • Sensory and Motor Systems
  • Browse content in Physics
  • Astronomy and Astrophysics
  • Atomic, Molecular, and Optical Physics
  • Biological and Medical Physics
  • Classical Mechanics
  • Computational Physics
  • Condensed Matter Physics
  • Electromagnetism, Optics, and Acoustics
  • History of Physics
  • Mathematical and Statistical Physics
  • Measurement Science
  • Nuclear Physics
  • Particles and Fields
  • Plasma Physics
  • Quantum Physics
  • Relativity and Gravitation
  • Semiconductor and Mesoscopic Physics
  • Browse content in Psychology
  • Affective Sciences
  • Clinical Psychology
  • Cognitive Neuroscience
  • Cognitive Psychology
  • Criminal and Forensic Psychology
  • Developmental Psychology
  • Educational Psychology
  • Evolutionary Psychology
  • Health Psychology
  • History and Systems in Psychology
  • Music Psychology
  • Neuropsychology
  • Organizational Psychology
  • Psychological Assessment and Testing
  • Psychology of Human-Technology Interaction
  • Psychology Professional Development and Training
  • Research Methods in Psychology
  • Social Psychology
  • Browse content in Social Sciences
  • Browse content in Anthropology
  • Anthropology of Religion
  • Human Evolution
  • Medical Anthropology
  • Physical Anthropology
  • Regional Anthropology
  • Social and Cultural Anthropology
  • Theory and Practice of Anthropology
  • Browse content in Business and Management
  • Business Strategy
  • Business History
  • Business Ethics
  • Business and Government
  • Business and Technology
  • Business and the Environment
  • Comparative Management
  • Corporate Governance
  • Corporate Social Responsibility
  • Entrepreneurship
  • Health Management
  • Human Resource Management
  • Industrial and Employment Relations
  • Industry Studies
  • Information and Communication Technologies
  • International Business
  • Knowledge Management
  • Management and Management Techniques
  • Operations Management
  • Organizational Theory and Behaviour
  • Pensions and Pension Management
  • Public and Nonprofit Management
  • Strategic Management
  • Supply Chain Management
  • Browse content in Criminology and Criminal Justice
  • Criminal Justice
  • Criminology
  • Forms of Crime
  • International and Comparative Criminology
  • Youth Violence and Juvenile Justice
  • Development Studies
  • Browse content in Economics
  • Agricultural, Environmental, and Natural Resource Economics
  • Asian Economics
  • Behavioural Finance
  • Behavioural Economics and Neuroeconomics
  • Econometrics and Mathematical Economics
  • Economic Systems
  • Economic Methodology
  • Economic History
  • Economic Development and Growth
  • Financial Markets
  • Financial Institutions and Services
  • General Economics and Teaching
  • Health, Education, and Welfare
  • History of Economic Thought
  • International Economics
  • Labour and Demographic Economics
  • Law and Economics
  • Macroeconomics and Monetary Economics
  • Microeconomics
  • Public Economics
  • Urban, Rural, and Regional Economics
  • Welfare Economics
  • Browse content in Education
  • Adult Education and Continuous Learning
  • Care and Counselling of Students
  • Early Childhood and Elementary Education
  • Educational Equipment and Technology
  • Educational Strategies and Policy
  • Higher and Further Education
  • Organization and Management of Education
  • Philosophy and Theory of Education
  • Schools Studies
  • Secondary Education
  • Teaching of a Specific Subject
  • Teaching of Specific Groups and Special Educational Needs
  • Teaching Skills and Techniques
  • Browse content in Environment
  • Applied Ecology (Social Science)
  • Climate Change
  • Conservation of the Environment (Social Science)
  • Environmentalist Thought and Ideology (Social Science)
  • Natural Disasters (Environment)
  • Social Impact of Environmental Issues (Social Science)
  • Browse content in Human Geography
  • Cultural Geography
  • Economic Geography
  • Political Geography
  • Browse content in Interdisciplinary Studies
  • Communication Studies
  • Museums, Libraries, and Information Sciences
  • Browse content in Politics
  • African Politics
  • Asian Politics
  • Chinese Politics
  • Comparative Politics
  • Conflict Politics
  • Elections and Electoral Studies
  • Environmental Politics
  • European Union
  • Foreign Policy
  • Gender and Politics
  • Human Rights and Politics
  • Indian Politics
  • International Relations
  • International Organization (Politics)
  • International Political Economy
  • Irish Politics
  • Latin American Politics
  • Middle Eastern Politics
  • Political Methodology
  • Political Communication
  • Political Philosophy
  • Political Sociology
  • Political Theory
  • Political Behaviour
  • Political Economy
  • Political Institutions
  • Politics and Law
  • Public Administration
  • Public Policy
  • Quantitative Political Methodology
  • Regional Political Studies
  • Russian Politics
  • Security Studies
  • State and Local Government
  • UK Politics
  • US Politics
  • Browse content in Regional and Area Studies
  • African Studies
  • Asian Studies
  • East Asian Studies
  • Japanese Studies
  • Latin American Studies
  • Middle Eastern Studies
  • Native American Studies
  • Scottish Studies
  • Browse content in Research and Information
  • Research Methods
  • Browse content in Social Work
  • Addictions and Substance Misuse
  • Adoption and Fostering
  • Care of the Elderly
  • Child and Adolescent Social Work
  • Couple and Family Social Work
  • Developmental and Physical Disabilities Social Work
  • Direct Practice and Clinical Social Work
  • Emergency Services
  • Human Behaviour and the Social Environment
  • International and Global Issues in Social Work
  • Mental and Behavioural Health
  • Social Justice and Human Rights
  • Social Policy and Advocacy
  • Social Work and Crime and Justice
  • Social Work Macro Practice
  • Social Work Practice Settings
  • Social Work Research and Evidence-based Practice
  • Welfare and Benefit Systems
  • Browse content in Sociology
  • Childhood Studies
  • Community Development
  • Comparative and Historical Sociology
  • Economic Sociology
  • Gender and Sexuality
  • Gerontology and Ageing
  • Health, Illness, and Medicine
  • Marriage and the Family
  • Migration Studies
  • Occupations, Professions, and Work
  • Organizations
  • Population and Demography
  • Race and Ethnicity
  • Social Theory
  • Social Movements and Social Change
  • Social Research and Statistics
  • Social Stratification, Inequality, and Mobility
  • Sociology of Religion
  • Sociology of Education
  • Sport and Leisure
  • Urban and Rural Studies
  • Browse content in Warfare and Defence
  • Defence Strategy, Planning, and Research
  • Land Forces and Warfare
  • Military Administration
  • Military Life and Institutions
  • Naval Forces and Warfare
  • Other Warfare and Defence Issues
  • Peace Studies and Conflict Resolution
  • Weapons and Equipment

The Oxford Handbook of Computational Linguistics (2nd edn)

  • < Previous chapter
  • Next chapter >

33 Speech Recognition

Lori Lamel is a Senior Research Scientist at LIMSI CNRS, which she joined as a permanent researcher in October 1991. She holds a PhD degree in EECS from MIT, and an HDR in CS from the University of Paris XI. She also has over 330 peer-reviewed publications. Her research covers a range of topics in the field of spoken language processing and corpus-based linguistics. She is a Fellow of ISCA and IEEE and serves on the ISCA board.

Jean-Luc Gauvain is a Senior Research Scientist at the CNRS and Spoken Language Processing Group Head at LISN. He received a doctorate in electronics and a computer science HDR degree from Paris-Sud University. His research centres on speech technologies, including speech recognition, audio indexing, and language and speaker recognition. He has contributed over 300 publications to this field and was awarded a CNRS silver medal in 2007. He served as co-editor-in-chief for Speech Communication Journal in 2006–2008, and as scientific coordinator for the Quaero research programme in 2008–2013. He is an ISCA Fellow.

  • Published: 10 December 2015
  • Cite Icon Cite
  • Permissions Icon Permissions

Speech recognition is concerned with converting the speech waveform, an acoustic signal, into a sequence of words. Today’s best-performing approaches are based on a statistical modelization of the speech signal. This chapter provides an overview of the main topics addressed in speech recognition: that is, acoustic-phonetic modelling, lexical representation, language modelling, decoding, and model adaptation. The focus is on methods used in state-of-the-art, speaker-independent, large-vocabulary continuous speech recognition (LVCSR). Some of the technology advances over the last decade are highlighted. Primary application areas for such technology initially addressed dictation tasks and interactive systems for limited domain information access (usually referred to as spoken language dialogue systems). The last decade has witnessed a wider coverage of languages, as well as growing interest in transcription systems for information archival and retrieval, media monitoring, automatic subtitling and speech analytics. Some outstanding issues and directions of future research are discussed.

33.1 Introduction

Speech recognition is principally concerned with the problem of transcribing the speech signal as a sequence of words. Today’s best-performing systems use statistical models (Chapter 12 ) of speech. From this point of view, speech generation is described by a language model which provides estimates of Pr( w ) for all word strings w independently of the observed signal, and an acoustic model that represents, by means of a probability density function f(x|w ), the likelihood of the signal x given the message w . The goal of speech recognition is to find the most likely word sequence given the observed acoustic signal. The speech-decoding problem thus consists of maximizing the probability of w given the speech signal x , or equivalently, maximizing the product Pr( w)f(x|w ).

The principles on which these systems are based have been known for many years now, and include the application of information theory to speech recognition ( Bahl et al. 1976 ; Jelinek 1976 ), the use of a spectral representation of the speech signal ( Dreyfus-Graf 1949 ; Dudley and Balashek 1958 ), the use of dynamic programming for decoding ( Vintsyuk 1968 ), and the use of context-dependent acoustic models ( Schwartz et al. 1984 ). Despite the fact that some of these techniques were proposed well over two decades agos, considerable progress has been made in recent years in part due to the availability of large speech and text corpora (Chapters 19 and 20 ) and improved processing power, which have allowed more complex models and algorithms to be implemented. Compared with the state-of-the-art technology a decade ago, advances in acoustic modelling have enabled reasonable transcription performance for various data types and acoustic conditions.

The main components of a generic speech recognition system are shown in Figure 33.1 . The elements shown are the main knowledge sources (speech and textual training materials and the pronunciation lexicon), the feature analysis (or parameterization), the acoustic and language models which are estimated in a training phase, and the decoder. The next four sections are devoted to discussing these main components. The last two sections provide some indicative measures of state-of-the-art performance on some common tasks as well as some perspectives for future research.

System diagram of a generic speech recognizer based using statistical models, including training and decoding processes

33.2 Acoustic Parameterization and Modelling

Acoustic parameterization is concerned with the choice and optimization of acoustic features in order to reduce model complexity while trying to maintain the linguistic information relevant for speech recognition. Acoustic modelling must take into account different sources of variability present in the speech signal: those arising from the linguistic context and those associated with the non-linguistic context, such as the speaker (e.g. gender, age, emotional state, human non-speech sounds, etc.) and the acoustic environment (e.g. background noise, music) and recording channel (e.g. direct microphone, telephone). Most state-of-the-art systems make use of hidden Markov models (HMMs) for acoustic modelling, which consists of modelling the probability density function of a sequence of acoustic feature vectors. In this section, common parameterizations are described, followed by a discussion of acoustic model estimation and adaptation.

33.2.1 Acoustic Feature Analysis

The first step of the acoustic feature analysis is digitization, where the continuous speech signal is converted into discrete samples. The most commonly used sampling rates are 16 kHz and 10 kHz for direct microphone input and 8 kHz for telephone signals. The next step is feature extraction (also called parameterization or front-end analysis), which has the goal of representing the audio signal in a more compact manner by trying to remove redundancy and reduce variability, while keeping the important linguistic information ( Hunt 1996 ).

A widely accepted assumption is that although the speech signal is continually changing, due to physical constraints on the rate at which the articulators can move, the signal can be considered quasi-stationary for short periods (on the order of 10 ms to 20 ms). Therefore most recognition systems use short-time spectrum-related features based either on a Fourier transform or a linear prediction model. Among these features, cepstral parameters are popular because they are a compact representation, and are less correlated than direct spectral components. This simplifies estimation of the HMM parameters by reducing the need for modelling the feature dependency.

The two most popular sets of features are cepstrum coefficients obtained with a Mel Frequency Cepstral (MFC) analysis ( Davis and Mermelstein 1980 ) or with a Perceptual Linear Prediction (PLP) analysis ( Hermansky 1990 ). In both cases, a Mel scale short-term power spectrum is estimated on a fixed window (usually in the range of 20 to 30 ms). In order to avoid spurious high-frequency components in the spectrum due to discontinuities caused by windowing the signal, it is common to use a tapered window such as a Hamming window. The window is then shifted (usually a third or a half the window size) and the next feature vector computed. The most commonly used offset is 10 ms. The Mel scale approximates the frequency resolution of the human auditory system, being linear in the low-frequency range (below 1,000 Hz) and logarithmic above 1,000 Hz. The cepstral parameters are obtained by taking an inverse transform of the log of the filterbank parameters. In the case of the MFC coefficients, a cosine transform is applied to the log power spectrum, whereas a root Linear Predictive Coding (LPC) analysis is used to obtain the PLP cepstrum coefficients. Both sets of features have been used with success for large-vocabulary continuous speech recognition (LVCSR), but PLP analysis has been found for some systems to be more robust in the presence of background noise. The set of cepstral coefficients associated with a windowed portion of the signal is referred to as a frame or a parameter vector . Cepstral mean removal (subtraction of the mean from all input frames) is commonly used to reduce the dependency on the acoustic recording conditions. Computing the cepstral mean requires that all of the signal is available prior to processing, which is not the case for certain applications where processing needs to be synchronous with recording. In this case, a modified form of cepstral subtraction can be carried out where a running mean is computed from the N last frames (N is often on the order of 100, corresponding to 1 s of speech). In order to capture the dynamic nature of the speech signal, it is common to augment the feature vector with ‘delta’ parameters. The delta parameters are computed by taking the first and second differences of the parameters in successive frames. Over the last decade there has been a growing interest in capturing longer-term dynamics of speech than of the standard cepstral features. A variety of techniques have been proposed from simple concatenation of sequential frames to the use of TempoRAl Patterns (TRAPs) ( Hermansky and Sharma 1998 ). In all cases the wider context results in a larger number of parameters that consequently need to be reduced. Discriminative classifiers such as Multi-Layer Perceptrons (MLPs), a type of neural network, are efficient methods for discriminative feature estimation. Over the years, several groups have developed mature techniques for extracting probabilistic MLP features and incorporating them in speech-to-text systems ( Zhu et al. 2005 ; Stolcke et al. 2006 ). While probabilistic features have not been shown to consistently outperform cepstral features in LVCSR, being complementary they have been shown to significantly improve performance when used together ( Fousek et al. 2008 ).

33.2.2 Acoustic Models

Hidden Markov models are widely used to model the sequences of acoustic feature vectors ( Rabiner and Juang 1986 ). These models are popular as they are per-formant and their parameters can be efficiently estimated using well-established techniques. They are used to model the production of speech feature vectors in two steps. First, a Markov chain is used to generate a sequence of states, and then speech vectors are drawn using a probability density function (PDF) associated with each state. The Markov chain is described by the number of states and the transitions probabilities between states.

The most widely used elementary acoustic units in LVCSR systems are phone-based where each phone is represented by a Markov chain with a small number of states, where phones usually correspond to phonemes. Phone-based models offer the advantage that recognition lexicons can be described using the elementary units of the given language, and thus benefit from many linguistic studies. It is of course possible to perform speech recognition without using a phonemic lexicon, either by use of ‘word models’ (as was the more commonly used approach 20 years ago) or a different mapping such as the fenones ( Bahl et al. 1988 ). Compared with larger units (such as words, syllables, demisyllables), small subword units reduce the number of parameters, enable cross-word modelling, facilitate porting to new vocabularies, and most importantly, can be associated with back-off mechanisms to model rare contexts. Fenones offer the additional advantage of automatic training, but lack the ability to include a priori linguistic models. For some languages, most notably tonal languages such as Chinese, longer units corresponding to syllables or demisyllables (also called onsets and offsets or initials and finals) have been explored. While the use of larger units remains relatively limited to phone units, they may better capture tone information and may be well suited to casual speaking styles.

While different topologies have been proposed, all make use of left-to-right state sequences in order to capture the spectral change across time. The most commonly used configurations have between three and five emitting states per model, where the number of states imposes a minimal time duration for the unit. Some configurations allow certain states to be skipped, so as to reduce the required minimal duration. The probability of an observation (i.e. a speech vector) is assumed to be dependent only on the state, which is known as a first-order Markov assumption.

Strictly speaking, given an n-state HMM with parameter vector λ, the HMM stochastic process is described by the following joint probability density function f ( x ,   s | λ ) of the observed signal x   =   ( x 1 ,     … , x T ) and the unobserved state sequence s   =   ( s 0 ,     … ,   s T ) ⁠ ,

where π i is the initial probability of state i , a ij is the transition probability from state i to state j , and f(|s) is the emitting PDF associated with each state s. Figure 33.2 shows a three-state HMM with the associated transition probabilities and observation PDFs.

A typical three-state phone HMM with no skip state (top) which generates feature vectors ( x 1 … x n ) representing speech segments

A given HMM can represent a phone without consideration of its neighbours (context-independent or monophone model) or a phone in a particular context (context-dependent model). The context may or may not include the position of the phone within the word (word-position dependent), and word-internal and cross-word contexts may be merged or considered separated models. The use of cross-word contexts complicates decoding (see section 33.5 ). Different approaches are used to select the contextual units based on frequency or using clustering techniques, or decision trees, and different context types have been investigated: single-phone contexts, triphones, generalized triphones, quadphones and quinphones, with and without position dependency (within-word or cross-word). The model states are often clustered so as to reduce the model size, resulting in what are referred to as ‘tied-state’ models.

Acoustic model training consists of estimating the parameters of each HMM. For continuous density Gaussian mixture HMMs, this requires estimating the means and covariance matrices, the mixture weights and the transition probabilities. The most popular approaches make use of the Maximum Likelihood (ML) criterion, ensuring the best match between the model and the training data (assuming that the size of the training data is sufficient to provide robust estimates).

Estimation of the model parameters is usually done with the Expectation Maximization (EM) algorithm ( Dempster et al. 1977 ) which is an iterative procedure starting with an initial set of model parameters. The model states are then aligned to the training data sequences and the parameters are re-estimated based on this new alignment using the Baum–Welch re-estimation formulas ( Baum et al. 1970 ; Liporace 1982 ; Juang 1985 ). This algorithm guarantees that the likelihood of the training data given the model’s increases at each iteration. In the alignment step a given speech frame can be assigned to multiple states (with probabilities summing to 1) using the forward-backward algorithm or to a single state (with probability 1) using the Viterbi algorithm. This second approach yields a slightly lower likelihood but in practice there is very little difference in accuracy especially when large amounts of data are available. It is important to note that the EM algorithm does not guarantee finding the true ML parameter values, and even when the true ML estimates are obtained they may not be the best ones for speech recognition. Therefore, some implementation details such as a proper initialization procedure and the use of constraints on the parameter values can be quite important.

Since the goal of training is to find the best model to account for the observed data, the performance of the recognizer is critically dependent upon the representativity of the training data. Some methods to reduce this dependency are discussed in the next subsection. Speaker independence is obtained by estimating the parameters of the acoustic models on large speech corpora containing data from a large speaker population. There are substantial differences in speech from male and female talkers arising from anatomical differences (on average females have a shorter vocal tract length resulting in higher formant frequencies, as well as a higher fundamental frequency) and social ones (female voice is often ‘breathier’ caused by incomplete closure of the vocal folds). It is thus common practice to use separate models for male and female speech in order to improve recognition performance, which requires automatic identification of the gender.

Previously only used for small-vocabulary tasks ( Bahl et al. 1986 ), discriminative training of acoustic models for large-vocabulary speech recognition using Gaussian mixture hidden Markov models was introduced in Povey and Woodland (2000) . Different criteria have been proposed, such as maximum mutual information estimation (MMIE), criterion minimum classification error (MCE), minimum word error (MWE), and minimum phone error (MPE). Such methods can be combined with the model adaptation techniques described in the next section.

33.2.3 Adaptation

The performances of speech recognizers drop substantially when there is a mismatch between training and testing conditions. Several approaches can be used to minimize the effects of such a mismatch, so as to obtain a recognition accuracy as close as possible to that obtained under matched conditions. Acoustic model adaptation can be used to compensate for mismatches between the training and testing conditions, such as differences in acoustic environment, microphones and transmission channels, or particular speaker characteristics. The techniques are commonly referred to as noise compensation, channel adaptation, and speaker adaptation, respectively. Since in general no prior knowledge of the channel type, the background noise characteristics, or the speaker is available, adaptation is performed using only the test data in an unsupervised mode.

The same tools can be used in acoustic model training in order to compensate for sparse data, as in many cases only limited representative data are available. The basic idea is to use a small amount of representative data to adapt models trained on other large sources of data. Some typical uses are to build gender-specific, speaker-specific, or task-specific models, and to use speaker adaptive training (SAT) to improve performance. When used for model adaption during training, it is common to use the true transcription of the data, known as supervised adaptation.

Three commonly used schemes to adapt the parameters of an HMM can be distinguished: Bayesian adaptation ( Gauvain and Lee 1994 ); adaptation based on linear transformations ( Leggetter and Woodland 1995 ); and model composition techniques ( Gales and Young 1995 ). Bayesian estimation can be seen as a way to incorporate prior knowledge into the training procedure by adding probabilistic constraints on the model parameters. The HMM parameters are still estimated with the EM algorithm but using maximum a posteriori (MAP) re-estimation formulas ( Gauvain and Lee 1994 ). This leads to the so-called MAP adaptation technique where constraints on the HMM parameters are estimated based on parameters of an existing model. Speaker-independent acoustic models can serve as seed models for gender adaptation using the gender-specific data. MAP adaptation can be used to adapt to any desired condition for which sufficient labelled training data are available. Linear transforms are powerful tools to perform unsupervised speaker and environmental adaptation. Usually these transformations are ML-trained and are applied to the HMM Gaussian means, but can also be applied to the Gaussian variance parameters. This ML linear regression (MLLR) technique is very appropriate to unsupervised adaptation because the number of adaptation parameters can be very small. MLLR adaptation can be applied to both the test data and training data. Model composition is mostly used to compensate for additive noise by explicitly modelling the background noise (usually with a single Gaussian) and combining this model with the clean speech model. This approach has the advantage of directly modelling the noisy channel as opposed to the blind adaptation performed by the MLLR technique when applied to the same problem.

The chosen adaptation method depends on the type of mismatch and on the amount of available adaptation data. The adaptation data may be part of the training data, as in adaptation of acoustic seed models to a new corpus or a subset of the training material (specific to gender, dialect, speaker, or acoustic condition) or can be the test data (i.e. the data to be transcribed). In the former case, supervised adaptation techniques can be applied, as the reference transcription of the adaptation data can be readily available. In the latter case, only unsupervised adaptation techniques can be applied.

33.2.4 Deep Neural Networks

In addition to using MLPs for feature extraction, neural networks (NNs) can also be used to estimate the HMM state likelihoods in place of using Gaussian mixtures. This approach relying on very large MLPs (the so-called deep neural networks or DNNs) has been very successful in recent years, leading to some significant reduction of the error rates ( Hinton et al. 2012 ). In this case, the neural network outputs correspond to the states of the acoustic model and they are used to predict the state posterior probabilities. The NN output probabilities are divided by the state prior probabilities to get likelihoods that can be used to replace the GMM likelihoods. Given the large number of context-dependent HMM states used in state-of-the-art systems, the number of targets can be over 10,000, which leads to an MLP with more than 10 million weights.

33.3 Lexical and Pronunciation Modelling

The lexicon is the link between the acoustic-level representation and the word sequence output by the speech recognizer. Lexical design entails two main parts: definition and selection of the vocabulary items and representation of each pronunciation entry using the basic acoustic units of the recognizer. Recognition performance is obviously related to lexical coverage, and the accuracy of the acoustic models is linked to the consistency of the pronunciations associated with each lexical entry.

The recognition vocabulary is usually selected to maximize lexical coverage for a given size lexicon. Since on average, each out-of-vocabulary (OOV) word causes more than a single error (usually between 1.5 and two errors), it is important to judiciously select the recognition vocabulary. Word list selection is discussed in section 33.4 . Associated with each lexical entry are one or more pronunciations, described using the chosen elementary units (usually phonemes or phone-like units). This set of units is evidently language-dependent. For example, some commonly used phone set sizes are about 45 for English, 49 for German, 35 for French, and 26 for Spanish. In generating pronunciation baseforms, most lexicons include standard pronunciations and do not explicitly represent allophones. This representation is chosen as most allophonic variants can be predicted by rules, and their use is optional. More importantly, there is often a continuum between different allophones of a given phoneme and the decision as to which occurred in any given utterance is subjective. By using a phonemic representation, no hard decision is imposed, and it is left to the acoustic models to represent the observed variants in the training data. While pronunciation lexicons are usually (at least partially) created manually, several approaches to automatically learn and generate word pronunciations have been investigated ( Cohen 1989 ; Riley and Ljojle 1996 ).

There are a variety of words for which frequent alternative pronunciation variants are observed that are not allophonic differences. An example is the suffixization which can be pronounced with a diphthong (/ a i /) or a schwa (/ə/). Alternate pronunciations are also needed for homographs (words spelled the same, but pronounced differently) which reflect different parts of speech (verb or noun) such as excuse, record, produce . Some common three-syllable words such as interest and company are often pronounced with only two syllables. Figure 33.3 shows two examples of the word interest by different speakers reading the same text prompt: ‘ In reaction to the news, interest rates plunged … ’. The pronunciations are those chosen by the recognizer during segmentation using forced alignment. In the example on the left, the /t/ is deleted, and the /n/ is produced as a nasal flap. In the example on the right, the speaker said the word with two syllables, the second starting with a /tr/ cluster. Segmenting the training data without pronunciation variants is illustrated in the middle. Whereas no /t/ is observed in the first example, two /t/ segments were aligned. An optimal alignment with a pronunciation dictionary including all required variants is shown on the bottom. Better alignment results in more accurate acoustic phone models. Careful lexical design improves speech recognition performance.

In speech from fast speakers or speakers with relaxed speaking styles it is common to observe poorly articulated (or skipped) unstressed syllables, particularly in long words with sequences of unstressed syllables. Although such long words are typically well recognized, often a nearby function word is deleted. To reduce these kinds of errors, alternate pronunciations for long words such as positioning (/pǝzIʃǝnɨŋ/ or /pǝzIʃnɨŋ/), can be included in the lexicon allowing schwa deletion or syllabic consonants in unstressed syllables. Compound words have also been used as a way to represent reduced forms for common word sequences such as ‘did you’ pronounced as ‘dija’ or ‘going to’ pronounced as ‘gonna’. Alternatively, such fluent speech effects can be modelled using phonological rules ( Oshika et al. 1975 ). The principle behind the phonological rules is to modify the allowable phone sequences to take into account such variations. These rules are optionally applied during training and recognition. Using phonological rules during training results in better acoustic models, as they are less ‘polluted’ by wrong transcriptions. Their use during recognition reduces the number of mismatches. The same mechanism has been used to handle liaisons, mute-e, and final consonant cluster reduction for French. Most of today’s state-of-the-art systems include pronunciation variants in the dictionary, associating pronunciation probabilities with the variants ( Bourlard et al. 1999 ; Fosler-Lussier et al. 2005 ).

 Spectrograms of the word interest with pronunciation variants: /InɝIs/ (left) and /IntrIs/ (right) taken from the WSJ corpus (sentences 20tc0106, 40lc0206). The grid is 100 ms by 1 kHz. Segmentation of these utterances with a single pronunciation of interest /IntrIst/ (middle) and with multiple variants /IntrIst/ /IntrIs/ /InɝIs/ (bottom).

Spectrograms of the word interest with pronunciation variants: /InɝIs/ (left) and /IntrIs/ (right) taken from the WSJ corpus (sentences 20tc0106, 40lc0206). The grid is 100 ms by 1 kHz. Segmentation of these utterances with a single pronunciation of interest /IntrIst/ (middle) and with multiple variants /IntrIst/ /IntrIs/ /InɝIs/ (bottom).

As speech recognition research has moved from read speech to spontaneous and conversational speech styles, the phone set has been expanded to include non-speech events. These can correspond to noises produced by the speaker (breath noise, coughing, sneezing, laughter, etc.) or can correspond to external sources (music, motor, tapping, etc.). There has also been growing interest in exploring multilingual modelling at the acoustic level, with IPA or Unicode representations of the underlying units (see Gales et al. 2015 ; Dalmia et al. 2018 ).

33.4 Language Modelling

Language models (LMs) are used in speech recognition to estimate the probability of word sequences. Grammatical constraints can be described using a context-free grammar (for small to medium-size vocabulary tasks these are usually manually elaborated) or can be modelled stochastically, as is common for LVCSR. The most popular statistical methods are n -gram models, which attempt to capture the syntactic and semantic constraints by estimating the frequencies of sequences of n words. The assumption is made that the probability of a given word string ( w 1 , w 2 ,   … , w k ) can be approximated by Π i = 1 k Pr ( w i | w i − n + 1 ,     ​ ​ … , w i − 2 , w i − 1 ) ⁠ , therefore reducing the word history to the preceding n − 1 words.

A back-off mechanism is generally used to smooth the estimates of the probabilities of rare n -grams by relying on a lower-order n -gram when there is insufficient training data, and to provide a means of modelling unobserved word sequences ( Katz 1987 ). For example, if there are not enough observations for a reliable ML estimate of a 3-gram probability, it is approximated as follows: P r ( w i | w i − 1 , w i − 2 ) ≃ P r ( w i | w i − 1 ) B ( w i − 1 , w i − 2 ) ⁠ , where B ( w i , w i − 1 ) is a back-off coefficient needed to ensure that the total probability mass is still 1 for a given context. Based on this equation, many methods have been proposed to implement this smoothing.

While trigram LMs are the most widely used, higher-order (n>3) and word-class-based (counts are based on sets of words rather than individual lexical items) n -grams and adapted LMs are recent research areas aiming to improve LM accuracy. Neural network language models have been used to address the data sparseness problem by performing the estimation in a continuous space ( Bengio et al. 2001 ).

Given a large text corpus it may seem relatively straightforward to construct n -gram language models. Most of the steps are pretty standard and make use of tools that count word and word sequence occurrences. The main differences arise in the choice of the vocabulary and in the definition of words, such as the treatment of compound words or acronyms, and the choice of the back-off strategy. There is, however, a significant amount of effort needed to process the texts before they can be used.

A common motivation for normalization in all languages is to reduce lexical variability so as to increase the coverage for a fixed-size-task vocabulary. Normalization decisions are generally language-specific. Much of the speech recognition research for American English has been supported by ARPA and has been based on text materials which were processed to remove upper/lower-case distinction and compounds. Thus, for instance, no lexical distinction is made between Gates, gates or Green, green . In the French Le Monde corpus, capitalization of proper names is distinctive with different lexical entries for Pierre, pierre or Roman, roman .

The main conditioning steps are text mark-up and conversion. Text mark-up consists of tagging the texts (article, paragraph, and sentence markers) and garbage bracketing (which includes not only corrupted text materials, but all text material unsuitable for sentence-based language modelling, such as tables and lists). Numerical expressions are typically expanded to approximate the spoken form ($150 → one hundred and fifty dollars). Further semi-automatic processing is necessary to correct frequent errors inherent in the texts (such as obvious mispellings milllion, officals ) or arising from processing with the distributed text processing tools. Some normalizations can be considered as ‘decompounding’ rules in that they modify the word boundaries and the total number of words. These concern the processing of ambiguous punctuation markers (such as hyphen and apostrophe), the processing of digit strings, and treatment of abbreviations and acronyms (ABCD → A. B. C. D.). Other normalizations (such as sentence-initial capitalization and case distinction) keep the total number of words unchanged, but reduce graphemic variability. In general, the choice is a compromise between producing an output close to correct standard written form of the language and lexical coverage, with the final choice of normalization being largely application-driven.

Better language models can be obtained using texts transformed to be closer to the observed reading style, where the transformation rules and corresponding probabilities are automatically derived by aligning prompt texts with the transcriptions of the acoustic data. For example, the word hundred followed by a number can be replaced by hundred and 50% of the time; 50% of the occurences of one eighth are replaced by an eighth , and 15% of million dollars are replaced with simply million .

In practice, the selection of words is done so as to minimize the system’s OOV rate by including the most useful words. By useful we mean that the words are expected as an input to the recognizer, but also that the LM can be trained given the available text corpora. In order to meet the latter condition, it is common to choose the N most frequent words in the training data. This criterion does not, however, guarantee the usefulness of the lexicon, since no consideration of the expected input is made. Therefore, it is common practice to use a set of additional development data to select a word list adapted to the expected test conditions.

There is sometimes the conflicting need for sufficient amounts of text data to estimate LM parameters and assuring that the data is representative of the task. It is also common that different types of LM training material are available in differing quantities. One easy way to combine training material from different sources is to train a language model for each source and to interpolate them. The interpolation weights can be directly estimated on some development data with the EM algorithm. An alternative is to simply merge the n -gram counts and train a single language model on these counts. If some data sources are more representative than others for the task, the n -gram counts can be empirically weighted to minimize the perplexity on a set of development data. While this can be effective, it has to be done by trial and error and cannot easily be optimized. In addition, weighting the n -gram counts can pose problems in properly estimating the back-off coefficients. For these reasons, the language models in most of today’s state-of-the-art systems are obtained via the interpolation methods, which can also allow for task adaptation by simply modifying the interpolation coefficients ( Chen et al. 2004 ; Liu et al. 2008 ).

The relevance of a language model is usually measured in terms of test set perplexity defined as Px = Pr ( text | LM ) − 1 n ⁠ , where n is the number of words in the text. The perplexity is a measure of the average branching factor, i.e. the vocabulary size of a memoryless uniform language model with the same entropy as the language model under consideration.

33.5 Decoding

In this section we discuss the LVCSR decoding problem, which is the design of an efficient search algorithm to deal with the huge search space obtained by combining the acoustic and language models. Strictly speaking, the aim of the decoder is to determine the word sequence with the highest likelihood, given the lexicon and the acoustic and language models. In practice, however, it is common to search for the most likely HMM state sequence, i.e. the best path through a trellis (the search space) where each node associates an HMM state with given time. Since it is often prohibitive to exhaustively search for the best path, techniques have been developed to reduce the computational load by limiting the search to a small part of the search space. Even for research purposes, where real-time recognition is not needed, there is a limit on computing resources (memory and CPU time) above which the development process becomes too costly. The most commonly used approach for small and medium vocabulary sizes is the one-pass frame-synchronous Viterbi beam search ( Ney 1984 ) which uses a dynamic programming algorithm. This basic strategy has been extended to deal with large vocabularies by adding features such as dynamic decoding, multi-pass search, and N-best rescoring.

Dynamic decoding can be combined with efficient pruning techniques in order to obtain a single-pass decoder that can provide the answer using all the available information (i.e. that in the models) in a single forward decoding pass over the speech signal. This kind of decoder is very attractive for real-time applications. Multi-pass decoding is used to progressively add knowledge sources in the decoding process and allows the complexity of the individual decoding passes to be reduced. For example, a first decoding pass can use a 2-gram language model and simple acoustic models, and later passes will make use of 3-gram and 4-gram language models with more complex acoustic models. This multiple-pass paradigm requires a proper interface between passes in order to avoid losing information and engendering search errors. Information is usually transmitted via word graphs, although some systems use N-best hypotheses (a list of the most likely word sequences with their respective scores). This approach is not well suited to real-time applications since no hypothesis can be returned until the entire utterance has been processed.

It can sometimes be difficult to add certain knowledge sources into the decoding process especially when they do not fit in the Markovian framework (i.e. short-distance dependency modelling). For example, this is the case when trying to use segmental information or to use grammatical information for long-term agreement. Such information can be more easily integrated in multi-pass systems by rescoring the recognizer hypotheses after applying the additional knowledge sources.

Mangu, Brill, and Stolcke (2000) proposed the technigue of confusion network decoding (also called consensus decoding) which minimizes an approximate WER, as opposed to MAP decoding which minimizes the sentence error rate (SER). This technique has since been adopted in most state-of-the-art systems, resulting in lower WERs and better confidence scores. Confidence scores are a measure of the reliability of the recognition hypotheses, and give an estimate of the word error rate (WER). For example, an average confidence of 0.9 will correspond to a word error rate of 10% if deletions are ignored. Jiang (2004) provides an overview of confidence measures for speech recognition, commenting on the capacity and limitations of the techniques.

33.6 State-of-the-Art Performance

The last decade has seen large performance improvements in speech recognition, particularly for large-vocabulary, speaker-independent, continuous speech. This progress has been substantially aided by the availability of large speech and text corpora and by significant increases in computer processing capabilities which have facilitated the implementation of more complex models and algorithms. 1 In this section we provide some illustrative results for different LVCSR tasks, but make no attempt to be exhaustive.

The commonly used metric for speech recognition performance is the ‘word error’ rate, which is a measure of the average number of errors taking into account three error types with respect to a reference transcription: substitutions (one word is replaced by another word), insertions (a word is hypothesized that was not in the reference), and deletions (a word is missed). The word error rate is defined as #subs+#ins+#del #reference   words ⁠ , and is typically computed after a dynamic programming alignment of the reference and hypothesized transcriptions. Note that given this definition the word error can be more than 100%.

Three types of tasks can be considered: small-vocabulary tasks, such as isolated command words, digits or digit strings; medium-size (1,000–3,000-word) vocabulary tasks such as are typically found in spoken dialogue systems (Chapter 44 ); and large-vocabulary tasks (typically over 100,000 words). Another dimension is the speaking style which can be read, prepared, spontaneous, or conversational. Very low error rates have been reported for small-vocabulary tasks, below 1% for digit strings, which has led to some commercial products, most notably in the telecommunications domain. Early benchmark evaluations focused on read speech tasks: the state of the art in speaker-independent, continuous speech recognition in 1992 is exemplified by the Resource Management task (1,000-word vocabulary, word-pair grammar, four hours of acoustic training data) with a word error rate of 3%. In 1995, on read newspaper texts (the Wall Street Journal task, 160 hours of acoustic training data and 400 million words of language model texts) word error rates around 8% were obtained using a 65,000-word vocabulary. The word errors roughly doubled for speech in the presence of noise, or on texts dictated by journalists. The maturity of the technology led to the commercialization of speaker-dependent continuous speech dictation systems for which comparable benchmarks are not publicly available.

Over the last decade the research has focused on ‘found speech’, originating with the transcription of radio and television broadcasts and moving to any audio found on the Internet (podcasts). This was a major step for the community in that the test data is taken from a real task, as opposed to consisting of data recorded for evaluation purposes. The transcription of such varied data presents new challenges as the signal is one continuous audio stream that contains segments of different acoustic and linguistic natures. Today well-trained transcription systems for broadcast data have been developed for at least 25 languages, achieving word error rates on the order of under 20% on unrestricted broadcast news data. The performance on studio-quality speech from announcers is often comparable to that obtained on WSJ read speech data.

Word error rates of under 20% have been reported for the transcription of conversational telephone speech (CTS) in English using the Switchboard corpus, with substantially higher WERs (30–40%) on the multiple language Callhome (Spanish, Arabic, Mandarin, Japanese, German) data and on data from the IARPA Babel Program (< http://www.iarpa.gov/index.php/research-programs/babel >; Sainath et al. 2013 ). A wide range of word error rates have been reported for the speech recognition components of spoken dialogue systems (Chapters 8 , 44 , and 45 ), ranging from under 5% for simple travel information tasks using close-talking microphones to over 25% for telephone-based information retrieval systems. It is quite difficult to compare results across systems and tasks as different transcription conventions and text normalizations are often used.

Speech-to-text (STT) systems historically produce a case-insensitive, unpunctuated output. Recently there have been a number of efforts to produce STT outputs with correct case and punctuation, as well as conversion of numbers, dates, and acronymns to a standard written form. This is essentially the reverse process of the text normalization steps described in section 33.4 . Both linguistic and acoustic information (essentially pause and breath noise cues) are used to add punctuation marks in the speech recognizer output. An efficient method is to rescore word lattices that have been expanded to permit punctuation marks after each word, sentences boundaries at each pause, with a specialized case-sensitive, punctuated language model.

33.7 Discussion and Perspectives

Despite the numerous advances made over the last decade, speech recognition is far from a solved problem. Current research topics aim to develop generic recognition models with increased use of data perturbation and augmentation techniques for both acoustic and language modelling ( Ko et al. 2015 ; Huang et al. 2017 ; Park et al. 2019 ) and to use unannotated data for training purposes, in an effort to reduce the reliance on manually annotated training corpora. There has also been growing interest in End-to-End neural network models for speech recognition (< http://iscslp2018.org/Tutorials.html >, as well as tutorials at Interspeech 2019–2021, some of which also describe freely available toolkits) which aim to simultaneously train all automatic speech recognition (ASR) components optimizing the targeted evaluation metric (usually the WER), as opposed to the more traditional training described in this chapter.

Much of the progress in LVCSR has been fostered by supporting infrastructure for data collection, annotation, and evaluation. The Speech Group at the National Institute of Standards and Technology (NIST) has been organizing benchmark evaluations for a range of human language technologies (speech recognition, speaker and language recognition, spoken document retrieval, topic detection and tracking, automatic content extraction, spoken term detection) for over 20 years, recently extended to also include related multimodal technologies. 2 In recent years there has been a growing number of challenges and evaluations, often held in conjunction with major conferences, to promote research on a variety of topics. These challenges typically provide common training and testing data sets allowing different methods to be compared on a common basis.

While the performance of speech recognition technology has dramatically improved for a number of ‘dominant’ languages (English, Mandarin, Arabic, French, Spanish … ), generally speaking technologies for language and speech processing are available only for a small proportion of the world’s languages. By several estimations there are over 7,000 spoken languages in the world, but only about 15% of them are also written. Text corpora, which can be useful for training the language models used by speech recognizers, are becoming more and more readily available on the Internet. The site < http://www.omniglot.com > lists about 800 languages that have a written form.

It has often been observed that there is a large difference in recognition performance for the same system between the best and worst speakers. Unsupervised adaption techniques do not necessarily reduce this difference—in fact, they often improve performance on good speakers more than on bad ones. Interspeaker differences are not only at the acoustic level, but also the phonological and word levels. Today’s modelling techniques are not able to take into account speaker-specific lexical and phonological choices.

Today’s systems often also provide additional information which is useful for structuring audio data. In addition to the linguistic message, the speech signal encodes information about the characteristics of the speaker, the acoustic environment, the recording conditions, and the transmission channel. Acoustic meta-data can be extracted from the audio to provide a description, including the language(s) spoken, the speaker’s (or speakers’) accent(s), acoustic background conditions, the speaker’s emotional state, etc. Such information can be used to improve speech recognition performance, and to provide an enriched text output for downstream processing. The automatic transcription can also be used to provide information about the linguistic content of the data (topic, named entities, speech style … ). By associating each word and sentence with a specific audio segment, an automatic transcription can allow access to any arbitrary portion of an audio document. If combined with other meta-data (language, speaker, entities, topics), access via other attributes can be facilitated.

A wide range of potential applications can be envisioned based on automatic annotation of broadcast data, particularly in light of the recent explosion of such media, which required automated processing for indexation and retrieval (Chapters 37 , 38 , and 40 ), machine translation (Chapters 35 and 36 ), and question answering (Chapter 39 ). Important future research will address keeping vocabulary up-to-date, language model adaptation, automatic topic detection and labelling, and enriched transcriptions providing annotations for speaker turns, language, acoustic conditions, etc. Another challenging problem is recognizing spontaneous speech data collected with far-field microphones (such as meetings and interviews), which have difficult acoustic conditions (reverberation, background noise) and often have overlapping speech from different speakers.

Further Reading and Relevant Resources

An excellent reference is Corpus Based Methods in Language and Speech Processing , edited by Young and Bloothooft (1997) . This book provides an overview of currently used statistically based techniques, their basic principles and problems. A theoretical presentation of the fundamentals of the subject is given in the book Statistical Methods for Speech Recognition by Jelinek (1997) . A general introductory tutorial on HMMs can be found in Rabiner (1989) . Pattern Recognition in Speech and Language Processing by Chou and Juang (2003) , Spoken Language Processing: A Guide to Theory, Algorithm, and System Development by Huang, Acero, and Hon (2001) , and Multilingual Speech Processing by Schultz and Kirchhoff (2006) provide more advanced reading. Two recent books, The Voice in the Machine: Building Computers That Understand Speech by Roberto Pieraccini (2012) which targets general audiences and Automatic Speech Recognition: A Deep Learning Approach by Dong Yu and Li Deng (2015) provide an overview of the recent advance in the field. For general speech processing reference, the classical book Digital Processing of Speech Signals ( Rabiner and Shafer 1978 ) remains relevant. The most recent work in speech recognition can be found in the proceedings of major conferences (IEEE ICASSP, ISCA Interspeech) and workshops (most notably DARPA/IARPA, ISCA ITRWs, IEEE ASRU, SLT), as well as the journals Speech Communication and Computer Speech and Language .

Several websites of interest are:

European Language Resources Association (ELRA), < http://www.elda.fr/en/ >.

International Speech Communication Association (ISCA) < http://www.isca-speech.org >.

Linguistic Data Consortium (LDC), < http://www.ldc.upenn.edu/ >.

NIST Spoken Natural-Language Processing, < http://www.itl.nist.gov/iad/mig/tests >.

Survey of the State of the Art in Human Language Technology, < http://www.cslu.ogi.edu/HLTsurvey >.

Languages of the world, < http://www.omniglot.com >.

OLAC: Open Language Archives Community, < http://www.language-archives.org >.

Speech recognition software, < http://en.wikipedia.org/wiki/List_of_speech_recognition_software >.

Bahl, Lalit , James Baker , Paul Cohen , N. Rex Dixon , Frederick Jelinek , Robert Mercer , and Harvey Silverman (1976). ‘Preliminary Results on the Performance of a System for the Automatic Recognition of Continuous Speech’. In Proceedings of the IEEE Conference on Acoustics Speech and Signal Processing (ICASSP-76) , Philadelphia, 425–429. IEEE.

Bahl, Lalit , Peter Brown , Peter de Souza , and Robert Mercer (1986). ‘Maximum Mutual Information Estimation of Hidden Markov Model Parameters for Speech Recognition’. In Proceedings of the IEEE Conference on Acoustics Speech and Signal Processing (ICASSP-86) , Tokyo, 49–52. IEEE Press.

Bahl, Lalit , Peter Brown , Peter de Souza , Robert Mercer , and Michael Picheny (1988). ‘Acoustic Markov Models Used in the Tangora Speech Recognition System’. In Proceedings of the IEEE Conference on Acoustics Speech and Signal Processing (ICASSP-88) , New York, 497–500. IEEE Press.

Baum, Leonard , Ted Petrie , Georges Soules , and Norman Weiss ( 1970 ). ‘ A Maximization Technique Occurring in the Statistical Analysis of Probabilistic Functions of Markov Chains ’, Annals of Mathematical Statistics 41: 164–171.

Google Scholar

Bengio, Yoshua , Réjean Ducharme , and Pascal Vincent ( 2001 ). ‘A Neural Probabilistic Language Model’. In T. Leen , T. Dietterich , and V. Tresp (eds), Advances in Neural Information Processing Systems 13 (NIPS ‘00) , 932–938. Cambridge, MA: MIT Press.

Google Preview

Bourlard, Hervé , Sadaoki Furui , Nelson Morgan , and Helmer Strik (eds) ( 1999 ). Special issue on ‘Modeling Pronunciation Variation for Automatic Speech Recognition ’, Speech Communication 29(2–4), November.

Chen, Langzhou , Jean-Luc Gauvain , Lori Lamel , and Gilles Adda (2004). ‘Dynamic Language Modeling for Broadcast News’. In 8th International Conference on Spoken Language Processing, ICSLP-04 , Jeju Island, Korea, 1281–1284. ISCA (International Speech Communication Association).

Chou, Wu and Juang, Biing-Hwang ( 2003 ). Pattern Recognition in Speech and Language Processing . CRC Press.

Cohen, Michael (1989). ‘Phonological Structures for Speech Recognition’. PhD thesis, University of California, Berkeley.

Dalmia, Siddharth , Ramon Sanabria , Florian Metze , and Alan W. Black (2018). ‘Sequence-based Multi-lingual Low Resource Speech Recognition’. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , 4909–4913. IEEE.

Davis, Steven and Paul Mermelstein ( 1980 ). ‘ Comparison of Parametric Representations of Monosyllabic Word Recognition in Continuously Spoken Sentences ’, IEEE Transactions on Acoustics, Speech, and Signal Processing 28(4): 357–366.

Dempster, Arthur , Nan Laird , and Donald Rubin ( 1977 ). ‘ Maximum Likelihood from Incomplete Data via the EM Algorithm ’, Journal of the Royal Statistical Society Series B (Methodological) 39(1): 1–38.

Dreyfus-Graf, Jean ( 1949 ). ‘ Sonograph and Sound Mechanics ’, Journal of the Acoustic Society of America 22: 731–739.

Dudley, Homer and S. Balashek ( 1958 ). ‘ Automatic Recognition of Phonetic Patterns in Speech ’, Journal of the Acoustical Society of America 30: 721–732.

Fosler-Lussier, Eric , William Byrne , and Dan Jurafsky (eds) ( 2005 ). Special issue on ‘Pronunciation Modeling and Lexicon Adaptation ’, Speech Communication 46(2), June.

Fousek Petr , Lori Lamel , and Jean-Luc Gauvain ( 2008 ). ‘On the Use of MLP Features for Broadcast News Transcription’. In P. Sojka et al. (eds), TSD ’08 , Lecture Notes in Computer Science 5246, 303–310. Berlin and Heidelberg: Springer-Verlag.

Frederick Jelinek ( 1997 ). Statistical Methods for Speech Recognition. Language, Speech and Communication series . Cambridge, MA: MIT Press (Bradford Book).

Gales, Mark J. F. , Kate M. Knill , and Anton Ragni (2015). ‘Unicode-based Graphemic Systems for Limited Resource Languages’. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , 5186–5190. IEEE, 2015.

Gales, Mark and Steven Young ( 1995 ). ‘ Robust Speech Recognition in Additive and Convolutional Noise Using Parallel Model Combination ’, Computer Speech and Language 9(4): 289–307.

Gauvain, Jean-Luc and Chin-Hui Lee ( 1994 ). ‘Maximum a posteriori Estimation for Multivariate Gaussian Mixture Observations of Markov Chains’, IEEE Transactions on Speech and Audio Processing 2(2): 291–298.

Hermansky, Hynek ( 1990 ) ‘ Perceptual Linear Predictive (PLP) Analysis of Speech ’, Journal of the Acoustic Society America 87(4): 1738–1752.

Hermansky, Hynek and Sangita Sharma ( 1998 ). ‘TRAPs—Classifiers of Temporal Patterns’. In Robert H. Mannell and Jordi Robert-Ribes (eds), 5th International Conference on Spoken Language Processing, ICSLP ’98 , Sydney, 1003–1006. Australian Speech Science and Technology Association (ASSTA).

Hinton, Geoffrey , Li Deng , Dong Yu , George Dahl , Abdelrahman Mohamed , Navdeep Jaitly , Andrew Senior , Vincent Vanhoucke , Patrick Nguyen , Tara Sainath , and Brian Kingsbury ( 2012 ). ‘ Deep Neural Networks for Acoustic Modeling in Speech Recognition ’, IEEE Signal Processing Magazine 29(6): 82–97.

Huang, Xuedong , Acero, Alex and Hon, Hsiao-Wuen ( 2001 ). Spoken Language Processing: A Guide to Theory, Algorithm, and System Development . London: Prentice Hall.

Huang, Guangpu , Thiago Fraga Da Silva , Lori Lamel , Jean-Luc Gauvain , Arseniy Gorin , Antoine Laurent , Rasa Lileikyté , and Abdelkhalek Messaoudi ( 2017 ). ‘ An Investigation into Language Model Data Augmentation for Low-resourced STT and KWS ’. In Proceedings of the IEEE-ICASSP , 5790–5794. New Orleans.

Hunt, Melvin ( 1996 ). ‘Signal Representation’. In Ron Cole et al. (eds), Survey of the State of the Art in Human Language Technology , 10–15 (ch. 1.3). Cambridge: Cambridge University Press and Giardini, < http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.50.7794&rep=rep1&type=pdf >.

Jelinek, Frederik (1976). ‘ Continuous Speech Recognition by Statistical Methods ’, Proceed­ings of the IEEE: Special Issue on Man–Machine Communication by Voice 64(4), April: 532–556.

Jiang, Hui ( 2004 ). ‘ Confidence Measures for Speech Recognition: A Survey ’, Speech Communi­cation 45(4): 455–470.

Juang, Biing-Hwang ( 1985 ). ‘ Maximum-Likelihood Estimation for Mixture Multivariate Stochastic Observations of Markov Chains ’, AT&T Technical Journal 64(6), July–August: 1235–1249.

Katz, Slava ( 1987 ). ‘ Estimation of Probabilities from Sparse Data for the Language Model Component of a Speech Recognizer ’, IEEE Transactions on Acoustics, Speech, and Signal Processing 35(3): 400–401.

Ko, Tom , Vijayaditya Peddinti , Daniel Povey , and Sanjeev Khudanpur (2015). ‘Audio Augmentation for Speech Recognition’. In Sixteenth Annual Conference of the International Speech Communication Association .

Leggetter, Chris and Philip Woodland ( 1995 ). ‘ Maximum Likelihood Linear Regression for Speaker Adaptation of Continuous Density Hidden Markov Models ’, Computer Speech and Language 9: 171–185.

Liporace, Louis ( 1982 ). ‘ Maximum Likelihood Estimation for Multivariate Observations of Markov Sources ’, IEEE Transactions on Information Theory 28(5): 729–734.

Liu, Xunying , Mark Gales , and Philip Woodland (2008). ‘Context-Dependent Language Model Adaptation’. In Interspeech ’08: 9th Annual Conference of the International Speech Communication Association , Brisbane, 837–840. International Speech Communication Association.

Mangu, Lidia , Eric Brill , and Andreas Stolcke ( 2000 ). ‘ Finding Consensus in Speech Recognition: Word Error Minimization and Other Applications of Confusion Networks ’, Computer Speech and Language 144: 373–400.

Ney, Hermann (1984). ‘ The Use of a One-Stage Dynamic Programming Algorithm for Connected Word Recognition ’, IEEE Transactions on Acoustics, Speech, and Signal Processing 32(2), April: 263–271.

Oshika, Beatrice , Victor Zue , Rollin Weeks , Helene Neu , and Joseph Aurbach ( 1975 ). ‘ The Role of Phonological Rules in Speech Understanding Research ’, IEEE Transactions on Acoustics, Speech, and Signal Processing 23: 104–112.

Park, Daniel S. , William Chan , Yu Zhang , Chung-Cheng Chiu , Barret Zoph , Ekin D. Cubuk , and Quoc V. Le (2019). ‘Specaugment: A Simple Data Augmentation Method for Automatic Speech Recognition’. arXiv preprint arXiv:1904.08779.

Povey, Daniel and Philip Woodland (2000). ‘Large-Scale MMIE Training for Conversational Telephone Speech Recognition’, in Proceedings of the NIST Speech Transcription Workshop , College Park, MD.

Rabiner, Lawrence ( 1989 ). ‘ A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition ’, Proceedings of the IEEE 77(2), February: 257–286.

Rabiner, Lawrence and Biing-Hwang Juang ( 1986 ). ‘ An Introduction to Hidden Markov Models ’, IEEE Acoustics, Speech, and Signal Processing Magazine (ASSP) 3(1), January: 4–16.

Rabiner, Lawrence and Ronald Schafer ( 1978 ). Digital Processing of Speech Signals . Englewood Cliffs, NJ and London: Prentice-Hall.

Riley, Michael and Andrej Ljojle ( 1996 ). ‘Automatic Generation of Detailed Pronunciation Lexicons’. In Chin-Hui Lee , Frank K. Soong , and Kuldip K. Paliwal (eds), Automatic Speech and Speaker Recognition , 285–301. Dordrecht: Kluwer Academic Publishers.

Roberto Pieraccini ( 2012 ). The Voice in the Machine: Building Computers that Understand Speech . Cambridge, MA: MIT Press.

Sainath, Tara , Brian Kingsbury , Florian Metze , Nelson Morgan , and Stavros Tsakalidis ( 2013 ). ‘ An Overview of the Base Period of the Babel Program ’, SLTC Newsletter, November, < http://www.signalprocessingsociety.org/technical-committees/list/sl-tc/spl-nl/2013-11/BabelBaseOverview >.

Schultz, Tanja and Katrin Kirchhoff ( 2006 ). Multilingual Speech Processing . London: Academic Press.

Schwartz, Richard , Yen-Lu Chow , Salim Roucos , Michael Krasner , and John Makhoul (1984). ‘Improved Hidden Markov Modeling of Phonemes for Continuous Speech Recognition’. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP ’84) , San Diego, 35.6.1–35.6.4. IEEE Press.

Stolcke, Andreas , Barry Chen , Horacio Franco , Venkata Gadde , M. Graciarena , Mei-Yuh Hwang , Katrin Kirchhoff , Arindam Mandal , Nelson Morgan , Xin Lei , Tim Ng , Mari Ostendorf , Kemal Sonmez , Anand Venkataraman , Dimitra Vergyri , Wen Wang , Jing Zheng , and Qifeng Zhu ( 2006 ). ‘ Recent Innovations in Speech-to-Text Transcription at SRI-ICSI-UW ’, IEEE Transactions on Audio, Speech, and Language Processing 14(5): 1729–1744.

Vintsyuk, Taras ( 1968 ). ‘ Speech Discrimination by Dynamic Programming ’, Kibnernetika 4: 81–88.

Young, Steve and Gerrit Bloothooft (eds) ( 1997 ). Corpus-Based Methods in Language and Speech Processing . Text, Speech and Language Technology Series. Amsterdam: Springer Netherlands.

Yu, Dong and Li Deng ( 2015 ). Automatic Speech Recognition: A Deep Learning Approach . Amsterdam: Springer.

Zhu, Qifeng , Andreas Stolcke , Barry Chen , and Nelson Morgan (2005). ‘Using MLP Features in SRI’s Conversational Speech Recognition System’. In Interspeech ’05: Proceedings of the 9th European Conference on Speech Communication and Technology , Lisbon, 2141–2144.

These advances can be clearly seen in the context of DARPA-supported benchmark evaluations. This framework, known in the community as the DARPA evaluation paradigm, has provided the training materials (transcribed audio and textual corpora for training acoustic and language models), test data, and a common evaluation framework. The data have been generally provided by the Linguistics Data Consortium (LDC) and the evaluations organized by the National Institute of Standards and Technology (NIST) in collaboration with representatives from the participating sites and other government agencies.

See < http://www.nist.gov/speech/tests >.

  • About Oxford Academic
  • Publish journals with us
  • University press partners
  • What we publish
  • New features  
  • Open access
  • Institutional account management
  • Rights and permissions
  • Get help with access
  • Accessibility
  • Advertising
  • Media enquiries
  • Oxford University Press
  • Oxford Languages
  • University of Oxford

Oxford University Press is a department of the University of Oxford. It furthers the University's objective of excellence in research, scholarship, and education by publishing worldwide

  • Copyright © 2024 Oxford University Press
  • Cookie settings
  • Cookie policy
  • Privacy policy
  • Legal notice

This Feature Is Available To Subscribers Only

Sign In or Create an Account

This PDF is available to Subscribers Only

For full access to this pdf, sign in to an existing account, or purchase an annual subscription.

Help | Advanced Search

Electrical Engineering and Systems Science > Audio and Speech Processing

Title: introduction to speech recognition.

Abstract: This document contains lectures and practical experimentations using Matlab and implementing a system which is actually correctly classifying three words (one, two and three) with the help of a very small database. To achieve this performance, it uses speech modeling specificities, powerful computer algorithms (dynamic time warping and Dijktra's algorithm) and machine learning (nearest neighbor). This document introduces also some machine learning evaluation metrics.

Submission history

Access paper:.

  • HTML (experimental)
  • Other Formats

license icon

References & Citations

  • Google Scholar
  • Semantic Scholar

BibTeX formatted citation

BibSonomy logo

Bibliographic and Citation Tools

Code, data and media associated with this article, recommenders and search tools.

  • Institution

arXivLabs: experimental projects with community collaborators

arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.

Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.

Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs .

  • About AssemblyAI

What is Automatic Speech Recognition? A Comprehensive Overview of ASR Technology

This article aims to answer the question: What is ASR?, and provide a comprehensive overview of Automatic Speech Recognition technology.

What is Automatic Speech Recognition? A Comprehensive Overview of ASR Technology

Growth at AssemblyAI

Automatic Speech Recognition, also known as ASR, is the use of Machine Learning or Artificial Intelligence (AI) technology to process human speech into readable text. The field has grown exponentially over the past decade, with ASR systems popping up in popular applications we use every day such as TikTok and Instagram for real-time captions, Spotify for podcast transcriptions, Zoom for meeting transcriptions, and more.

As ASR quickly approaches human accuracy levels, there will be an explosion of applications taking advantage of ASR technology in their products to make audio and video data more accessible. Already, Speech-to-Text APIs like AssemblyAI are making ASR technology more affordable, accessible, and accurate.

This article aims to answer the question: What is Automatic Speech Recognition (ASR)?, and to provide a comprehensive overview of Automatic Speech Recognition technology, including:

What is Automatic Speech Recognition (ASR)? - A Brief History

How asr works, asr key terms and features, key applications of asr, challenges of asr today, on the horizon for asr.

ASR as we know it extends back to 1952 when the infamous Bell Labs created “Audrey,” a digit recognizer. Audrey could only transcribe spoken numbers, but a decade later, researchers improved upon Audrey so that it could transcribe rudimentary spoken words like “hello”.

For most of the past fifteen years, ASR has been powered by classical Machine Learning technologies like Hidden Markov Models. Though once the industry standard, accuracy of these classical models had plateaued in recent years, opening the door for new approaches powered by advanced Deep Learning technology that’s also been behind the progress in other fields such as self-driving cars.

In 2014, Baidu published the paper, Deep Speech: Scaling up end-to-end speech recognition . In this paper, the researchers demonstrated the strength of applying Deep Learning research to power state-of-the-art, accurate speech recognition models. The paper kicked off a renaissance in the field of ASR, popularizing the Deep Learning approach and pushing model accuracy past the plateau and closer to human level.

Not only has accuracy skyrocketed, but access to ASR technology has also improved dramatically. Ten years ago, customers would have to engage in lengthy, expensive enterprise speech recognition software contracts to license ASR technology. Today, developers, startup companies, and Fortune 500s have access to state-of-the-art ASR technology via simple APIs like AssemblyAI’s Speech-to-Text API .

Let’s look more closely at these two dominant approaches to ASR.

Today, there are two main approaches to Automatic Speech Recognition: a traditional hybrid approach and an end-to-end Deep Learning approach.

Traditional Hybrid Approach

The traditional hybrid approach is the legacy approach to Speech Recognition and has dominated the field for the past fifteen years. Many companies still rely on this traditional hybrid approach simply because it’s the way it has always been done--there is more knowledge around how to build a robust model because of the extensive research and training data available, despite plateaus in accuracy.

Here’s how it works:

Traditional HMM and GMM systems

introduction to speech recognition

Traditional HMM (Hidden Markov Models) and GMM (Gaussian Mixture Models) require forced aligned data. Force alignment is the process of taking the text transcription of an audio speech segment and determining where in time particular words occur in the speech segment.

As you can see in the above illustration, this approach combines a lexicon model + an acoustic model + a language model to make transcription predictions.

Each step is defined in more detail below:

Lexicon Model

The lexicon model describes how words are pronounced phonetically. You usually need a custom phoneme set for each language, handcrafted by expert phoneticians.

Acoustic Model

The acoustic model (AM), models the acoustic patterns of speech. The job of the acoustic model is to predict which sound or phoneme is being spoken at each speech segment from the forced aligned data. The acoustic model is usually of an HMM or GMM variant.

Language Model

The language model (LM) models the statistics of language. It learns which sequences of words are most likely to be spoken, and its job is to predict which words will follow on from the current words and with what probability.

Decoding is a process of utilizing the lexicon, acoustic, and language model to produce a transcript.

Downsides of Using the Traditional Hybrid Approach

Though still widely used, the traditional hybrid approach to Speech Recognition does have a few drawbacks. Lower accuracy, as discussed previously, is the biggest. In addition, each model must be trained independently, making them time and labor intensive. Forced aligned data is also difficult to come by and a significant amount of human labor is needed, making them less accessible. Finally, experts are needed to build a custom phonetic set in order to boost the model’s accuracy.

End-to-End Deep Learning Approach

An end-to-end Deep Learning approach is a newer way of thinking about ASR, and how we approach ASR here at AssemblyAI.

How End-to-End Deep Learning Models Work

With an end-to-end system, you can directly map a sequence of input acoustic features into a sequence of words. The data does not need to be force-aligned. Depending on the architecture, a Deep Learning system can be trained to produce accurate transcripts without a lexicon model and language model, although language models can help produce more accurate results.

CTC, LAS, and RNNT

CTC, LAS, and RNNTs are popular Speech Recognition end-to-end Deep Learning architectures. These systems can be trained to produce super accurate results without needing force aligned data, lexicon models, and language models.

Advantages of End-to-End Deep Learning Models

End-to-end Deep Learning models are easier to train and require less human labor than a traditional approach. They are also more accurate than the traditional models being used today.

The Deep Learning research community is actively searching for ways to constantly improve these models using the latest research as well, so there’s no concern of accuracy plateaus any time soon--in fact, we’ll see Deep Learning models reach human level accuracy in the next few years.

Acoustic Model: The acoustic model takes in audio waveforms and predicts what words are present in the waveform.

Language Model : The language model can be used to help guide and correct the acoustic model's predictions.

Word Error Rate : The industry standard measurement of how accurate an ASR transcription is, as compared to a human transcription.

Speaker Diarization : Answers the question, who spoke when? Also referred to as speaker labels.

Custom Vocabulary : Also referred to as Word Boost, custom vocabulary boosts accuracy for a list of specific keywords or phrases when transcribing an audio file.

Sentiment Analysis : The sentiment, typically positive, negative, or neutral, of specific speech segments in an audio or video file.

See more models specific to AssemblyAI .

The immense advances in the field of ASR has seen a correlation of growth in Speech-to-Text APIs. Companies are using ASR technology for Speech-to-Text applications across a diverse range of industries. Some examples include:

Telephony: Call tracking , cloud phone solutions, and contact centers need accurate transcriptions, as well as innovative analytical features like Conversation Intelligence , call analytics, speaker diarization, and more.

Video Platforms: Real-time and asynchronous video captioning are industry standard. Video editing platforms (and video editors alike) also need content categorization and content moderation to improve accessibility and search.

Media Monitoring : Speech-to-Text APIs can help broadcast TV, podcasts, radio, and more quickly and accurately detect brand and other topic mentions for better advertising.

Virtual Meetings: Meeting platforms like Zoom, Google Meet, WebEx, and more need accurate transcriptions and the ability to analyze this content to drive key insights and action.

Choosing a Speech-to-Text API

With more APIs on the market, how do you know which Speech-to-Text API is best for your application ?

Key considerations to keep in mind include:

  • How accurate the API is
  • What additional models are offered
  • What kind of support you can expect
  • Pricing and documentation transparency
  • Data security
  • Company innovation

What Can I Build with Automatic Speech Recognition?

Automatic Speech Recognition models serve as a key component of any AI stack for companies that need to process and analyze spoken data.

For example, a Contact Center as a Service company is using highly accurate ASR to power smart transcription and speed up QA for its customers.

A call tracking company doubled its Conversational Intelligence customers by integrating AI-powered ASR into its platform and building powerful Generative AI products on top of the transcription data.

A qualitative data analysis platform added AI transcription to build a suite of AI-powered tools and features that resulted in 60% less time analyzing research data for its customers.

One of the main challenges of ASR today is the continual push toward human accuracy levels. While both ASR approaches--traditional hybrid and end-to-end Deep Learning--are significantly more accurate than ever before, neither can claim 100% human accuracy. This is because there is so much nuance in the way we speak, from dialects to slang to pitch. Even the best Deep Learning models can’t be trained to cover this long tail of edge cases without significant effort.

Some think they can solve this accuracy problem with custom Speech-to-Text models . However, unless you have a very specific use case, like children’s speech, custom models are actually less accurate, harder to train, and more expensive in practice than a good end-to-end Deep Learning model.

Another top concern is Speech-to-Text privacy for APIs . Too many large ASR companies use customer data to train models without explicit permission, raising serious concerns over data privacy. Continual data storage in the cloud also raises concerns over potential security breaches, especially if raw audio or video files or transcription text contains Personally Identifiable Information.

As the field of ASR continues to grow, we can expect to see greater integration of Speech-to-Text technology into our everyday lives, as well as more widespread industry applications.

We’re already seeing advancements in ASR and related AI fields taking place at an accelerated rate, such as OpenAI’s ChatGPT, HuggingFace spaces and ML apps , and AssemblyAI's Conformer-2, a state-of-the-art speech recognition model , trained on 1.1M hours of audio data.

In regards to model building, we also expect to see a shift to a self-supervised learning system to solve some of the challenges with accuracy discussed above.

End-to-end Deep Learning models are data hungry. Our Conformer-2 model at AssemblyAI, for example, is trained on 1.1 million hours of raw audio and video training data for industry-best accuracy levels. However, obtaining human transcriptions for this same training data would be almost impossible given the time constraints associated with human processing speeds.

This is where self-supervised deep learning systems can help. Essentially, this is a way to get an abundance of unlabeled data and build a foundational model on top of it. Then, since we have statistical knowledge of the data, we can fine-tune it on downstream tasks with a smaller amount of data, making it a more accessible approach to model building. This is an exciting possibility with profound implications for the field.

If this transition occurs, expect ASR models to become even more accurate and affordable, making their use and acceptance more widespread.

Want to try ASR for free?

Play around with AssemblyAI's ASR and AI models in our no-code playground.

Popular posts

AI trends in 2024: Graph Neural Networks

AI trends in 2024: Graph Neural Networks

Marco Ramponi's picture

Developer Educator at AssemblyAI

AI for Universal Audio Understanding: Qwen-Audio Explained

AI for Universal Audio Understanding: Qwen-Audio Explained

Combining Speech Recognition and Diarization in one model

Combining Speech Recognition and Diarization in one model

How DALL-E 2 Actually Works

How DALL-E 2 Actually Works

Ryan O'Connor's picture

Speech Recognition in AI

Speech recognition is one technique that has advanced significantly in the field of artificial intelligence (AI) over the past few years. AI-based speech recognition has made it possible for computers to understand and recognize human speech, enabling frictionless interaction between humans and machines. Several sectors have been transformed by this technology, which also has the potential to have a big impact in the future.

Introduction

One of the most basic forms of human communication is speech. It serves as our main form of thought, emotion, and idea expression. The capacity of machines to analyze and comprehend human speech has grown in significance as technology develops. AI research in the area of speech recognition aims to make it possible for machines to understand and recognize human speech, enabling more efficient and natural communication.

Today, speech recognition in AI has numerous applications across various industries, from healthcare to telecommunications to media and marketing. The ability to recognize and interpret human speech has opened up new possibilities for machine-human interaction, enabling machines to perform tasks that were previously only possible through manual input. As technology continues to advance, we can expect to see even more exciting applications in the future.

What is Speech Recognition in AI?

Speech recognition is the process of identifying a human voice. Typically, businesses create these programs and integrate them into various hardware devices to identify speech. When the program hears your voice or receives your order, it will respond appropriately.

Numerous businesses create software that recognizes speech using cutting-edge technologies like artificial intelligence, machine learning, and neural networks. The way individuals utilize hardware and electrical devices has been changed by technologies like Siri, Amazon, Google Assistant, and Cortana. They include smartphones, devices for home security, cars, etc.

Remember that voice recognition and speech recognition are not the same. Speech recognition takes an audio file of a speaker, recognizes the words in the audio, and converts the words into text. Voice recognition, in contrast, only recognizes voice instructions that have been pre-programmed. The conversion of voice into text is the only similarity between these two methods.

How does Speech Recognition in AI Work?

  • Recording: The voice recorder that is built into the gadget is used to carry out the first stage. The user's voice is kept as an audio signal after being recorded.
  • Sampling: As you are aware, computers and other electronic gadgets use data in their discrete form. By basic physics, it is known that a sound wave is continuous. Therefore, for the system to understand and process it, it is converted to discrete values. This conversion from continuous to discrete is done at a particular frequency.
  • Transforming to Frequency Domain: The audio signal's time domain is changed to its frequency domain in this stage. This stage is very important because the frequency domain may be used to examine a lot of audio information. Time domain refers to the analysis of mathematical functions, physical signals, or time series of economic or environmental data, concerning time. Similarly, the frequency domain refers to the analysis of mathematical functions or signals concerning frequency, rather than time.

speech recognition

Information Extraction from Audio: Each voice recognition system's foundation is at this stage. At this phase, the audio is transformed into a vector format that may be used. For this conversion, many extraction methods, including PLP, MFCC, etc., are applied.

Recognition of Extracted Information: The idea of pattern matching is applied in this step. Recognition is performed by taking the extracted data and comparing it to some pre-defined data. Pattern matching is used to accomplish this comparing and matching. One of the most popular pieces of software for this is Google Speech API.

Speech Recognition AI and Natural Language Processing

Speech recognition AI and natural language processing (NLP) are two closely related fields that have enabled machines to understand and interpret human language. While speech recognition AI focuses on the conversion of spoken words into digital text or commands, NLP encompasses a broader range of applications, including language translation, sentiment analysis, and text summarization.

One of the primary goals of NLP is to enable machines to understand and interpret human language in a way that is similar to how humans understand language. This involves not only recognizing individual words but also understanding the context and meaning behind those words. For example, the phrase "I saw a bat" could be interpreted in different ways depending on the context. It could refer to the animal, or it could refer to a piece of sporting equipment.

natural language processing

Speech recognition AI is a subset of NLP that focuses specifically on the conversion of spoken words into digital text or commands. To accomplish this, speech recognition AI systems use complex algorithms to analyze and interpret speech patterns, mapping them to phonetic units and creating statistical models of speech sounds.

Some of the techniques used in AI for speech recognition are:

  • Hidden Markov Models (HMMs): HMMs are statistical models that are widely used in speech recognition AI. HMMs work by modelling the probability distribution of speech sounds, and then using these models to match input speech to the most likely sequence of sounds.
  • Deep Neural Networks (DNNs): DNNs are a type of machine learning model that is used extensively in speech recognition AI. DNNs work by using a hierarchy of layers to model complex relationships between the input speech and the corresponding text output.
  • Convolutional Neural Networks (CNNs): CNNs are a type of machine learning model that is commonly used in image recognition, but have also been applied to speech recognition AI. CNNs work by applying filters to input speech signals to identify relevant features.

Some recent advancements in speech recognition AI include:

  • Transformer-based models: Transformer-based models, such as BERT and GPT, have been highly successful in natural language processing tasks, and are now being applied to speech recognition AI.
  • End-to-end models: End-to-end models are designed to directly map speech signals to text, without the need for intermediate steps. These models have shown promise in improving the accuracy and efficiency of speech recognition AI.
  • Multimodal models: Multimodal models combine speech recognition AI with other modalities, such as vision or touch, to enable more natural and intuitive interactions between humans and machines.
  • Data augmentation: Data augmentation techniques, such as adding background noise or changing the speaking rate, can be used to generate more training data for speech recognition AI models, improving their accuracy and robustness.

Use Cases of Speech Recognition AI

Speech recognition across a wide range of fields and applications, artificial intelligence is employed as a commercial solution. AI is enabling more natural user interactions with technology and software, with higher data transcription accuracy than ever before, in everything from ATMs to call centres and voice-activated audio content assistants.

  • Call centres: One of the most common applications of speech AI in call centres is speech recognition. Using cloud models, this technology enables you to hear what customers are saying and respond properly. The use of voice patterns as identification or permission for access solutions or services without relying on passwords or other conventional techniques or models like fingerprints or eye scans is also possible with speech recognition technology. By doing this, business problems like lost passwords or compromised security codes can be resolved.
  • Banking: Speech AI applications are being used by banking and financial institutions to assist consumers with their business inquiries. If you want to know your account balance or the current interest rate on your savings account, for instance, you can ask a bank. As a result, customer support agents may respond to inquiries more quickly and provide better service because they no longer need to conduct extensive research or consult cloud data.
  • Telecommunications: Models for speech recognition technology provide more effective call analysis and management. Providing better customer service enables agents to concentrate on their most valuable activities. Consumers may now communicate with businesses in real-time, around-the-clock, via text messaging or voice transcription services, which improves their overall experience and helps them feel more connected to the firm.
  • Healthcare: Speech-enabled In the telecommunications sector, AI is a technology that is growing in popularity. Models for speech recognition technology provide more effective call analysis and management. Providing better customer service enables agents to concentrate on their most valuable activities.
  • Media and marketing: Speech recognition and AI are used in tools like dictation software to enable users to type or write more in a shorter amount of time. In general, copywriters and content writers may transcribe up to 3000–4000 words in as little as 30 minutes. Yet accuracy is a consideration. These tools cannot ensure 100% error-free transcription. Yet, they are quite helpful in assisting media and marketing professionals in creating their initial draughts.

Challenges in Working with Speech Recognition AI

Working with speech AI presents various difficulties.

Today, accuracy includes more than just word output precision. The degree of accuracy varies from case to case, depending on various factors. These elements—which are frequently tailored to a use case or a specific business need—include:

  • Background noise
  • Punctuation placement
  • Capitalization
  • Correct formatting
  • Timing of words
  • Domain-specific terminology
  • Speaker identification

Data Security and Privacy

Concerns regarding data security and privacy have significantly increased over the past year, rising from 5% to 42% . That might be the outcome of more daily interactions occurring online after the coronavirus pandemic caused a surge in remote work.

Voice technology, or any software for that matter, needs to be easy to deploy and integrate. Integration must be simple to perform and secure, regardless of whether a business needs deployment on-premises, in the cloud, or embedded. The process of integrating software can be time-consuming and expensive without the proper assistance or instructions. To circumvent this adoption hurdle, technology vendors must make installations and integrations as simple as feasible.

Language Coverage

There are gaps in the language coverage of several of the top voice technology companies. English is covered by the majority of providers, but when organizations wish to employ speech technology, the absence of language support creates a hurdle to adoption.

Even when a provider does offer more languages, accuracy problems with accent or dialect identification frequently persist. What occurs, for instance, when an American and a British person are speaking? Which accent type is being used? The issue is resolved by universal language packs, which include a variety of accents.

  • For commercial solutions, speech recognition enables computers, programs, and software to understand and convert spoken word input into text.
  • The speech recognition model works by analysing your voice and language using artificial intelligence (AI), understanding the words you are speaking, and then accurately reproducing those words as model content or text data on a screen.
  • Speech recognition in AI works by converting spoken words into digital signals that can be analyzed and interpreted by machines.
  • This process involves several steps, including signal processing, feature extraction, acoustic modeling, language modeling, and decoding.
  • Speech recognition AI is closely related to natural language processing (NLP). NLP involves the ability of machines to understand and interpret human language, enabling them to perform tasks such as text summarization, sentiment analysis, and language translation.
  • Speech recognition AI is a subset of NLP that focuses on the conversion of spoken words into digital text or commands.
  • Working with speech AI presents various difficulties. As an illustration, both technology and the cloud are recent and evolving quickly. As a result, it can be difficult to anticipate with any degree of accuracy how long it will take a company to develop a speech-enabled device.
  • Speech recognition AI has the potential to transform the way we communicate with machines and has numerous applications across various industries.

Q. How does speech recognition work?

A. Speech recognition works by using algorithms to analyze and interpret the acoustic signal produced by human speech, and then convert it into text or other forms of output. Here is a general overview of the process:

  • Pre-processing: The incoming audio signal is first processed to remove noise and enhance the signal quality. This may involve filtering out unwanted frequencies, adjusting the volume levels, or normalizing the audio to a standard format.
  • Feature Extraction: The processed audio signal is then broken down into smaller, more manageable pieces called "features." These features represent the frequency content and other characteristics of the speech.
  • Acoustic Modeling: In this step, a statistical model of the speech signal is created using machine learning techniques. This model takes the feature vectors as input and produces a set of probabilities over a predefined set of linguistic units (phonemes, words, or sentences).
  • Language Modeling: Once the acoustic model has been created, a language model is built. This model uses statistical analysis of language to predict the probability of a particular word or sentence based on its context within a larger body of text.
  • Decoding: In this step, the acoustic and language models are combined to determine the most likely sequence of words or sentences that match the input audio signal. This process is called decoding.
  • Post-processing: Finally, the output of the speech recognition system is post-processed to correct errors and improve the quality of the output. This may involve applying language-specific rules to correct grammar, punctuation, or spelling errors.

Q. What is the purpose of speech recognition AI?

A. The purpose of speech recognition AI is to enable computers and other devices to understand and process human speech. This has a wide range of potential applications, including:

  • Voice commands: Speech recognition can be used to enable the hands-free operation of devices, such as smartphones, smart home devices, and virtual assistants. This allows users to control these devices with their voice, which can be particularly useful in situations where they cannot physically interact with the device.
  • Transcription: Speech recognition can be used to automatically transcribe audio recordings into text. This can be particularly useful in industries such as healthcare, legal, and finance, where accurate transcription of audio recordings is necessary for documentation and record-keeping.
  • Accessibility: Speech recognition can be used to enable people with disabilities, such as those who are visually impaired or have mobility issues, to interact with computers and other devices more easily.
  • Translation: Speech recognition can be used to automatically translate spoken language from one language to another. This can be particularly useful in situations where language barriers exist, such as in international business or travel.
  • Customer service: Speech recognition can be used to automate customer service interactions, such as phone-based support or chatbots. This can help reduce wait times and improve the overall customer experience.

Q. What is speech communication in AI?

A. Speech communication in AI refers to the ability of machines to communicate with humans using spoken language. This involves the use of speech recognition, natural language processing (NLP), and speech synthesis technologies to enable machines to understand and produce human language.

Speech communication in AI has become increasingly important in recent years, as more and more devices and applications are being designed to support voice-based interaction. Some examples of speech communication in AI include:

  • Virtual assistants: Virtual assistants like Siri, Alexa, and Google Assistant use speech recognition and NLP to understand user commands and queries, and speech synthesis to respond.
  • Smart home devices: Devices like smart speakers, thermostats, and lights can be controlled using voice commands, enabling hands-free operation.
  • Call centers: Many call centers now use speech recognition and NLP to automate customer service interactions, such as automated phone trees or chatbots.
  • Language translation: Speech recognition and NLP can be used to automatically translate spoken language from one language to another, enabling communication across language barriers.
  • Transcription: Speech recognition can be used to transcribe audio recordings into text, making it easier to search and analyze spoken language.

Speech communication in AI is still a developing field, and many challenges must be overcome, such as dealing with accents, dialects, and variations in speech patterns.

Q. Which type of AI is used in speech recognition?

A. There are different types of AI techniques used in speech recognition, but the most commonly used approach is Deep Learning.

Deep Learning is a type of machine learning that uses artificial neural networks to model and solve complex problems. In speech recognition, the neural network is trained on large datasets of human speech, which allows it to learn patterns and relationships between speech sounds and language.

The specific type of neural network used in speech recognition is often a type of Recurrent Neural Network (RNN) called a Long Short-Term Memory (LSTM) network. LSTMs can model long-term dependencies in sequences of data, making them well-suited for processing speech, which is a sequence of sounds over time.

Other AI techniques used in speech recognition include Hidden Markov Models (HMMs), Support Vector Machines (SVMs), and Gaussian Mixture Models (GMMs).

Q. What are the difficulties in voice recognition AI in artificial intelligence?

A. Despite advances in speech recognition technology, there are still several challenges that must be addressed to improve the accuracy and effectiveness of voice recognition AI. Here are some of the key difficulties in voice recognition AI:

  • Background noise: One of the biggest challenges in speech recognition is dealing with background noise. Ambient noise, such as music, traffic, or other people talking, can interfere with the accuracy of the system, making it difficult to distinguish between speech and noise.
  • Variations in speech: People speak in different accents, and dialects, and with varying levels of clarity. This can make it difficult for speech recognition systems to accurately transcribe spoken language, especially for individuals with non-standard speech patterns.
  • Speaker diarization: In situations where multiple people are speaking, it can be difficult for the system to identify and distinguish between different speakers. This can result in errors in transcription or the misattribution of words to the wrong speaker.
  • Contextual understanding: Speech recognition systems often struggle to understand the context of the spoken language. This can result in errors in transcription or misinterpretation of the intended meaning of the speech.
  • Limited training data: Building accurate speech recognition models requires large amounts of high-quality training data. However, collecting and labeling speech data can be time-consuming and expensive, especially for languages or dialects with limited resources.
  • Privacy concerns: Voice recognition systems often rely on collecting and storing user voice data, which can raise concerns about privacy and security.

Overall, these difficulties demonstrate that speech recognition technology is still a developing field, and there is a need for ongoing research and development to address these challenges and improve the accuracy and effectiveness of voice recognition AI.

  • Python for Machine Learning
  • Machine Learning with R
  • Machine Learning Algorithms
  • Math for Machine Learning
  • Machine Learning Interview Questions
  • ML Projects
  • Deep Learning
  • Computer vision
  • Data Science
  • Artificial Intelligence

Speech Recognition Module Python

  • Python | Speech recognition on large audio files
  • Speech Recognition in Python using CMU Sphinx
  • Python | Face recognition using GUI
  • Python Text To Speech | pyttsx module
  • What is Speech Recognition?
  • How to Set Up Speech Recognition on Windows?
  • Python word2number Module
  • Automatic Speech Recognition using Whisper
  • Speech Recognition in Hindi using Python
  • Speech Recognition in Python using Google Speech API
  • Python text2digits Module
  • Python - Get Today's Current Day using Speech Recognition
  • Python winsound module
  • Automatic Speech Recognition using CTC
  • TextaCy module in Python
  • Python subprocess module
  • Google Chrome Dino Bot using Image Recognition | Python
  • Restart your Computer with Speech Recognition
  • Python Modules

Speech recognition, a field at the intersection of linguistics, computer science, and electrical engineering, aims at designing systems capable of recognizing and translating spoken language into text. Python, known for its simplicity and robust libraries, offers several modules to tackle speech recognition tasks effectively. In this article, we’ll explore the essence of speech recognition in Python, including an overview of its key libraries, how they can be implemented, and their practical applications.

Key Python Libraries for Speech Recognition

  • SpeechRecognition : One of the most popular Python libraries for recognizing speech. It provides support for several engines and APIs, such as Google Web Speech API, Microsoft Bing Voice Recognition, and IBM Speech to Text. It’s known for its ease of use and flexibility, making it a great starting point for beginners and experienced developers alike.
  • PyAudio : Essential for audio input and output in Python, PyAudio provides Python bindings for PortAudio, the cross-platform audio I/O library. It’s often used alongside SpeechRecognition to capture microphone input for real-time speech recognition.
  • DeepSpeech : Developed by Mozilla, DeepSpeech is an open-source deep learning-based voice recognition system that uses models trained on the Baidu’s Deep Speech research project. It’s suitable for developers looking to implement more sophisticated speech recognition features with the power of deep learning.

Implementing Speech Recognition with Python

A basic implementation using the SpeechRecognition library involves several steps:

  • Audio Capture : Capturing audio from the microphone using PyAudio.
  • Audio Processing : Converting the audio signal into data that the SpeechRecognition library can work with.
  • Recognition : Calling the recognize_google() method (or another available recognition method) on the SpeechRecognition library to convert the audio data into text.

Here’s a simple example:

Practical Applications

Speech recognition has a wide range of applications:

  • Voice-activated Assistants: Creating personal assistants like Siri or Alexa.
  • Accessibility Tools: Helping individuals with disabilities interact with technology.
  • Home Automation: Enabling voice control over smart home devices.
  • Transcription Services: Automatically transcribing meetings, lectures, and interviews.

Challenges and Considerations

While implementing speech recognition, developers might face challenges such as background noise interference, accents, and dialects. It’s crucial to consider these factors and test the application under various conditions. Furthermore, privacy and ethical considerations must be addressed, especially when handling sensitive audio data.

Speech recognition in Python offers a powerful way to build applications that can interact with users in natural language. With the help of libraries like SpeechRecognition, PyAudio, and DeepSpeech, developers can create a range of applications from simple voice commands to complex conversational interfaces. Despite the challenges, the potential for innovative applications is vast, making speech recognition an exciting area of development in Python.

FAQ on Speech Recognition Module in Python

What is the speech recognition module in python.

The Speech Recognition module, often referred to as SpeechRecognition, is a library that allows Python developers to convert spoken language into text by utilizing various speech recognition engines and APIs. It supports multiple services like Google Web Speech API, Microsoft Bing Voice Recognition, IBM Speech to Text, and others.

How can I install the Speech Recognition module?

You can install the Speech Recognition module by running the following command in your terminal or command prompt: pip install SpeechRecognition For capturing audio from the microphone, you might also need to install PyAudio. On most systems, this can be done via pip: pip install PyAudio

Do I need an internet connection to use the Speech Recognition module?

Yes, for most of the supported APIs like Google Web Speech, Microsoft Bing Voice Recognition, and IBM Speech to Text, an active internet connection is required. However, if you use the CMU Sphinx engine, you do not need an internet connection as it operates offline.

Please Login to comment...

Similar reads.

  • AI-ML-DS With Python
  • Python Framework
  • Python-Library
  • Machine Learning

Improve your Coding Skills with Practice

 alt=

What kind of Experience do you want to share?

IMAGES

  1. The Difference Between Speech and Voice Recognition

    introduction to speech recognition

  2. An Introduction To Speech Recognition

    introduction to speech recognition

  3. PPT

    introduction to speech recognition

  4. Speech Recognition: Everything You Need to Know in 2023

    introduction to speech recognition

  5. General structure of speech recognition system

    introduction to speech recognition

  6. Introduction to Automatic Speech Recognition (ASR)

    introduction to speech recognition

VIDEO

  1. Introduction Speech 1

  2. How to Enable Speech Recognition in Windows 11

  3. speech recognition #introduction to python programming

  4. How to use speech recognition/computer best tricks/ speech recognition by sajidi

  5. Speech recognition system in Hindi/Automatic Speech Recognition tutorial

  6. Automatic Speech Recognition

COMMENTS

  1. What Is Speech Recognition?

    Speech recognition, also known as automatic speech recognition (ASR), computer speech recognition or speech-to-text, is a capability that enables a program to process human speech into a written format. While speech recognition is commonly confused with voice recognition, speech recognition focuses on the translation of speech from a verbal ...

  2. Speech recognition

    Speech recognition is an interdisciplinary subfield of computer science and computational linguistics that develops methodologies and technologies that enable the recognition and translation of spoken language ... A good and accessible introduction to speech recognition technology and its history is provided by the general audience book "The ...

  3. Speech Recognition: Everything You Need to Know in 2024

    Speech recognition, also known as automatic speech recognition (ASR), speech-to-text (STT), and computer speech recognition, is a technology that enables a computer to recognize and convert spoken language into text. Speech recognition technology uses AI and machine learning models to accurately identify and transcribe different accents ...

  4. The Ultimate Guide To Speech Recognition With Python

    This article provides an in-depth and scholarly look at the evolution of speech recognition technology. The Past, Present and Future of Speech Recognition Technology by Clark Boyd at The Startup. This blog post presents an overview of speech recognition technology, with some thoughts about the future. Some good books about speech recognition:

  5. What is Automatic Speech Recognition?

    Speech recognition continues to grow in adoption due to its advancements in deep learning-based algorithms that have made ASR as accurate as human recognition. Also, breakthroughs like multilingual ASR help companies make their apps available worldwide, and moving algorithms from cloud to on-device saves money, protects privacy, and speeds up ...

  6. Introduction to Automatic Speech Recognition (ASR)

    This is all for this quite long introduction to automatic speech recognition. After a brief introduction to speech production, we covered historical approaches to speech recognition with HMM-GMM and HMM-DNN approaches. We also mentioned the more recent end-to-end approaches. If you want to improve this article or have a question, feel free to ...

  7. PDF Lecture 12: An Overview of Speech Recognition

    Speech Recognition 1 Lecture 12: An Overview of Speech Recognition 1. Introduction We can classify speech recognition tasks and systems along a set of dimensions that produce various tradeoffs in applicability and robustness. Isolated word versus continuous speech: Some speech systems only need identify

  8. An introduction to speech recognition

    An introduction to speech recognition Speech recognition, also known as automatic speech recognition (ASR), computer speech recognition, or speech-to-text, is a capability which enables a program to identify human speech and convert it into readable text. Whilst the more basic speech recognition software has a limited vocabulary, we are now ...

  9. PDF A Brief Introduction to Automatic Speech Recognition

    Automatic Speech Recognition. Speech Recognized Signal Words. An ASR system converts the speech signal into words. The recognized words can be. The final output, or. The input to natural language processing. Advanced Natural Language Processing (6.864) Automatic Speech Recognition 6.

  10. Speech Recognition

    33.1 Introduction. Speech recognition is principally concerned with the problem of transcribing the speech signal as a sequence of words. Today's best-performing systems use statistical models (Chapter 12) of speech. From this point of view, speech generation is described by a language model which provides estimates of Pr ( w) for all word ...

  11. Learn Essential Speech Recognition Skills

    Speech recognition refers to the process by which computer software translates human speech to a written, machine-readable format. This capability is used increasingly widely, in applications ranging from simple dictation and question-answering programs to tools for real-time foreign language translation and full-featured chatbots.

  12. Introduction to Speech Recognition With TensorFlow

    The introduction of transformers has significantly impacted speech recognition, enabling more accurate models for tasks such as speech recognition, natural language processing, and virtual assistant devices. This tutorial demonstrated how to build a basic speech recognition model using TensorFlow by combining a 2D CNN, RNN, and CTC loss.

  13. PDF Introduction Speech Recognition

    2. Introduction. Speech is a dominant form of communication between humans and is becoming one for humans and machines. Speech recognition: mapping an acoustic signal into a string of words. Speech understanding: mapping what is said to its meaning. 3. Applications. Medical transcription.

  14. What is Speech Recognition?

    voice portal (vortal): A voice portal (sometimes called a vortal ) is a Web portal that can be accessed entirely by voice. Ideally, any type of information, service, or transaction found on the Internet could be accessed through a voice portal.

  15. Speech Recognition

    Speech Recognition, Audio-Visual. G. Potamianos, in Encyclopedia of Language & Linguistics (Second Edition), 2006 Audio-visual speech recognition refers to the automatic transcription of speech into text by exploiting information present in the video of the speaker's mouth region, in addition to the traditionally used acoustic signal. The use of visual information in automatic speech ...

  16. The History of Speech Recognition to the Year 2030

    provements in automatic speech recognition. Many people now use speech recognition on a daily ba-sis, for example to perform voice search queries, send text messages, and interact with voice assis-tants like Amazon Alexa and Siri by Apple. Before 2010 most people rarely used speech recognition. Given the remarkable changes in the state of speech

  17. [2402.01778] Introduction to speech recognition

    Introduction to speech recognition. This document contains lectures and practical experimentations using Matlab and implementing a system which is actually correctly classifying three words (one, two and three) with the help of a very small database. To achieve this performance, it uses speech modeling specificities, powerful computer ...

  18. What is Speech Recognition?

    Speech recognition is the process of converting sound signals to text transcriptions. Steps involved in conversion of a sound wave to text transcription in a speech recognition system are: Recording: Audio is recorded using a voice recorder. Sampling: Continuous audio wave is converted to discrete values.

  19. What is ASR? An Overview of Automatic Speech Recognition

    Sep 12, 2023. Automatic Speech Recognition, also known as ASR, is the use of Machine Learning or Artificial Intelligence (AI) technology to process human speech into readable text. The field has grown exponentially over the past decade, with ASR systems popping up in popular applications we use every day such as TikTok and Instagram for real ...

  20. Speech Recognition in AI

    Introduction. One of the most basic forms of human communication is speech. It serves as our main form of thought, emotion, and idea expression. ... Speech recognition AI and natural language processing (NLP) are two closely related fields that have enabled machines to understand and interpret human language. While speech recognition AI focuses ...

  21. Introducing Whisper

    Whisper is an automatic speech recognition (ASR) system trained on 680,000 hours of multilingual and multitask supervised data collected from the web. We show that the use of such a large and diverse dataset leads to improved robustness to accents, background noise and technical language. Moreover, it enables transcription in multiple languages ...

  22. Speech Recognition Module Python

    The Speech Recognition module, often referred to as SpeechRecognition, is a library that allows Python developers to convert spoken language into text by utilizing various speech recognition engines and APIs. It supports multiple services like Google Web Speech API, Microsoft Bing Voice Recognition, IBM Speech to Text, and others.

  23. Introducing Speech Recognition for Uncommon Spoken Languages

    Advancements in speech recognition technology will now enable identifying uncommon languages. This article shows how speech recognition system for uncommon languages can be used. Speech recognition can now be used to identify uncommon spoken languages! Automated speech recognition technology has become one of the fastest-growing technolo