[
  
  
  
  
    {
      "title": "M2: Speech Signal Processing",
      "url": "/speech-recognition-course/M2_Speech_Signal_Processing/",
      "excerpt": "Speech Signal Processing Table of Contents Introduction Feature Extraction for Speech Recognition Mel filtering Logarithmic compression Other considerations Feature Normalization Summary Lab Introduction Speech sound waves propagate through the air and are captured by a microphone which converts the pressure wave into electrical activity which can be captured. The electrical...",
      "content": "Speech Signal ProcessingTable of Contents  Introduction  Feature Extraction for Speech Recognition  Mel filtering  Logarithmic compression  Other considerations  Feature Normalization  Summary  LabIntroductionSpeech sound waves propagate through the air and are captured by a microphone which converts the pressure wave into electrical activity which can be captured. The electrical activity is sampled to create a sequence of waveform samples that describe the signal. Music signals are typically sampled at 44,100 Hz (or 44,100 samples per second). Due to the Nyquist theorem, this means that audio with frequencies of up to 22,050 Hz can be faithfully captured by sampling. Speech signals have less high frequency (only up to 8000 Hz) information so a sampling rate of 16,000 Hz is typically used. Speech over conventional telephone lines and most mobile phones is band-limited to about 3400 Hz, so a sampling rate of 8000 Hz is typically used for telephone speech.A typical waveform is plotted here, for the partial sentence “speech recognition is cool stuff”.Recall from module 1, where the concepts of voiced and unvoiced phonemes were discussed. If we focus on the waveform for the last word “stuff” we can see that the waveform has 3 distinct parts, the initial unvoiced sound ‘st’, the middle voiced vowel sound ‘uh’, and the final unvoiced ‘f’. You can see that the unvoiced parts look noise-like and random which the voiced portion is periodic due to the vibration of the vocal chords.If we zoom further into the voiced vowel segment, you can see the periodic nature more clearly. The periodic nature arises from the vibration of the vocal chords.From observing these waveforms, it is apparent that two factors contribute to the characteristics of the waveform, 1) the excitation from the vocal chords that drives the air through the vocal tract and out the mouth and 2) the shape of the vocal tract itself when making a particular sound.For example, we can see that both the ‘st’ and ‘f’ sounds are noise-like due to the unvoiced excitation but have different shapes because they are different sounds. The ‘uh’ sound is more periodic due to the voiced excitation and with its own shape due to the vocal tract. So, from the same speaker, a different vowel sound would have a similar periodicity but different overall shape because the same vocal chords are generating the excitation, but the shape of the vocal tract is different when producing a different sound.This speech production process is most commonly modeled in signal processing using a source-filter model. The source is the excitation signal generated by the vocal chords that passes through the vocal tract, modeled as a time-varying linear filter. The source-filter model has many applications in speech recognition, synthesis, analysis and coding, and there are many ways of estimating the parameters of the source signal and the filter, such as the well-known linear predictive coding (LPC) approach.For speech recognition, the phoneme classification is largely dependent on the vocal tract shape and, therefore, the filter portion of the source-filter model. The excitation or source signal is largely ignored or discarded. Thus, feature extraction process for speech recognition is largely designed for capturing the time-varying filter shapes over the course of an utterance.Feature extraction for speech recognitionShort-time Fourier AnalysisOne thing that is apparent from observing these waveforms is that speech is a non-stationary signal. That means its statistical properties change over time. Therefore, in order to properly analyze a speech signal, we need to examine the signal in chunks (also called windows or frames) that are small enough that the speech can be assumed to be stationary within those windows. Thus, we perform the analysis on a series of short, overlapping frames of audio. In speech recognition, we typically use windows of length 0.025 sec (25 ms) with an overlap of 0.01 (10 ms). This corresponds to a frame rate of 100 frames per second.Because we are extracting a chunk from a longer continuous signal, it is important to take care of edge effects by applying a window to the frame of data. Typically, a Hamming window is used, although other windows may also be used.If we let m be in the frame index, n is the sample index, and L is the frame size in samples and N is the frameshift in samples, each frame of audio is exacted from the original signal as\\[x_m[n] = w[n] x[m N+n], n=0, 1, \\ldots, L-1\\]where $w[n]$ is the window function.We then transform each frame of data into the frequency-domain using a discrete Fourier transform.\\[X_m[k]=\\sum_{n=0}^{N-1}x_m[n]e^{-j 2 \\pi k n N}\\]Note that all modern software packages have routines for efficiently computing the Fast Fourier Transform (FFT), which is an efficient way of computing the discrete Fourier transform.The Fourier representation $X_M[k]$ is a complex number that represents both the spectral magnitude (absolute amplitude) and phase of each frame and frequency. For feature extraction purposes, we do not use the phase information, so we only consider the magnitude $\\vert X_m[k] \\vert$.A spectrogram shows a 2D plot log magnitude (or log power) of the result of a short-time Fourier analysis of a speech signal. The horizontal axis shows the frame index (in 10 ms units), and the vertical axis shows the frequency axis from 0 Hz up to the Nyquist frequency with one-half of the sampling rate. For example, the spectrogram of the original waveform “speech recognition is cool stuff” is shown here. In the spectrogram, high-energy regions are shown in orange and red.Mel filteringFrom the spectrogram, you can see high energy regions at the high frequencies (upper portion of the figure), which correspond roughly to unvoiced consonants, and high energy regions at the lower frequencies, which correspond roughly to voiced vowels. You’ll also notice the horizontal lines in the voiced regions, which signify the harmonic structure of voice speech.To remove variability in the spectrogram caused by the harmonic structure in the voiced regions and the random noise in the unvoiced regions, we perform a spectral smoothing operation on the magnitude spectrum. We apply a filterbank which is motivated by the processing done by the auditory system. This filterbank applies an approximately logarithmic scale to the frequency axis. That is, the filters become wider and farther apart as frequency increases. The most common filterbank used for feature extraction is known as the mel filterbank. A mel filterbank of $40$ filters is shown here. Each filter will average the power spectrogram across a different frequency range.Observe that the filters are narrow and closely spaced on the left side of the figure and wider and farther apart on the right side of the figure.It is typical to represent the mel filterbank as a matrix, where each row corresponds to one filter in the filterbank. Thus, P-dimensional mel filterbank coefficients can be computed from the magnitude spectrum as\\[X_{\\tt{mel}}[p] = \\sum_k M[p,k] \\left|X_m[k]\\right|,      p = 0, 1, \\ldots, P-1\\]A mel filterbank of length $40$ is typical, though state-of-the-art systems have been built with fewer or more. Fewer results in more smoothing, and more results in less smoothing.Logarithmic compressionThe last step of the feature extraction process is to apply a logarithm operation. This helps compress the dynamic range of the signals and also closely models a nonlinear compression operation that occurs in the auditory system. We refer to the output of this logarithm operation as “filterbank” coefficients.The spectrogram-like view of the filterbank coefficients for the original waveform is shown here for a 40-dimensional filterbank. Compared to the original spectrogram, the filterbank coefficients are a much smoother version along the vertical (frequency) axis of the spectrogram, where both the high-frequency noise variability and pitch/harmonic structure have been removed.Other considerationsThere are other pre-processing steps that can be applied prior to feature extraction. These includeDithering: adding a very small amount of noise to the signal to prevent mathematical issues during feature computation (in particular, taking the logarithm of 0)DC-removal: removing any constant offset from the waveformPre-emphasis: applying a high pass filter to the signal prior to feature extraction to counteract that fact that typically the voiced speech at the lower frequencies has much high energy than the unvoiced speech at high frequencies. Pre-emphasis is performed with a simple linear filter.\\[y[n] = x[n] - \\alpha x[n-1]\\]where a value of $\\alpha=0.97$ is commonly used.Feature normalizationIt is possible that the communication channel will introduce some bias (constant filtering) on the captured speech signal. For example, a microphone may not have a flat frequency response. In addition, variations in signal gain can cause differences in the computed filterbank coefficients even though the underlying signals represent the same speech. These channel effects can be modeled as a convolution in time, which is equivalent to element-wise multiplication in the frequency domain representation of the signal.Thus, we can model the channel effects as a constant filter,\\[X_{t,{\\tt obs}}[k] = H[k] X_t[k]\\]And the magnitude of the observation as\\[\\left|X_{t,{\\tt obs}}[k]\\right| = \\left|H[k]\\right|\\left|X_t[k]\\right|.\\]If we take the log of both sides and compute the mean of all frames in the utterance, we have\\[\\begin{align*}\\mu_{\\tt obs} &amp;=\\frac{1}{T}\\sum_t \\log\\left(\\left|X_{t,{\\tt obs}}[k]\\right|\\right) \\\\\\\\&amp;=\\frac{1}{T}\\sum_t \\log\\left(\\left|H[k]\\right|\\left|X_t[k]\\right|\\right) \\\\\\\\&amp;=\\frac{1}{T}\\sum_t \\log\\left(\\left|H[k]\\right|\\right)+\\frac{1}{T}\\sum_t \\log\\left(\\left|X_t[k]\\right|\\right)\\end{align*}\\]Now, if we assume that the filter is constant over time and the log magnitude of the underlying speech signal has zero mean, this can be simplified to:\\[\\mu_{tt obs}=\\log\\left(\\left|H[k]\\right|\\right)\\]Thus, if we compute the mean of the log magnitude of the observed utterance and subtract it from every frame in the utterance, we’ll remove any constant channel effects from the signal.For convenience, we perform this normalization on filterbank features directly after the log operation. Below is a spectrogram of the previous filterbank coefficients after mean normalization.SummaryTo compute features for speech recognition from a speech signal, we are interested in extracting information about the time-varying spectral information that corresponds to the different underlying shapes of the vocal tract. These are modeled by a filter in the common source-filter model. The steps for computing the features of an utterance can be summarized as  Pre-process the signal, including pre-emphasis and dithering  Segment the signal into a series of overlapping frames, typically 25 ms frames with 10 ms frameshift  For each frame,          Apply a Hamming window function to the signal      Compute the Fourier transform using the FFT operation      Compute the magnitude of the spectrum      Apply the mel filterbank      Apply the log operation        If channel compensation is desired, apply mean normalize the frames of filterbank coefficients.LabFeature extraction for speech recognitionRequired files:  M2_Wav2Feat_Single.py  M2_Wav2Feat_Batch.py  speech_sigproc.py  htk_featio.pyInstructions:In this lab, you will write the core functions necessary to perform feature extraction on audio waveforms. Your program will convert an audio file to a sequence of log mel frequency filterbank (“FBANK”) coefficients.The basic steps in features extraction are  Pre-emphasis of the waveform  Dividing the signal into overlapping segments or frames  For each frame of audio:          Windowing the frame      Computing the magnitude spectrum of the frame      Applying the mel filterbank to the spectrum to create mel filterbank coefficients      Applying a logarithm operation to the mel filterbank coefficient      In the lab, you will be supplied with a Python file called speech_sigproc.py. This file contains a partially completed Python class called FrontEnd that performs feature extraction using methods that perform the steps listed above. The methods for dividing the signal into frames (step 2) will be provided for you, as will the code for generating the coefficients of the mel filterbank that is used in step 3c. You are responsible for filling in the code in all the remaining methods.There are two top-level Python scripts that call this class. The first is called M2_Wav2Feat_Single.py. This function reads a single pre-specified audio file, computes the features, and writes them to a feature file in HTK format.In the first part of this lab, you are to complete the missing code in the FrontEnd class and then modify M2_Wav2Feat_Single.py to plot the following items:  Waveform  Mel frequency filterbank  Log mel filterbank coefficientsYou can compare the figures to the figures below. Once the code is verified to be working, the feature extraction program should be used to create feature vector files for the training, development, and test sets. This will be done using M2_Wav2Feat_Batch.py. This program takes a command line argument –-set (or -s) which takes as an argument either train , dev , or test. For example$ python M2_Wav2Feat_Batch.py –set trainThis program will use the code you write in the FrontEnd class to compute feature extraction for all the files in the LibriSpeech corpus. You need to call this program 3 times, once each for train, dev, and test sets.When the training set features are computed (–set train) the code will also generate the global mean and precision (inverse standard deviation) of the features in the training set. These quantities will be stored in two ASCII files in the am directory during acoustic model training in the next module.Here are the outputs you should get from plotting:Next",
      "categories": [],
      "tags": [],
      "date": ""
    },
  
    {
      "title": "M1: Introduction",
      "url": "/speech-recognition-course/M1_Introduction/",
      "excerpt": "Module 1: Introduction Table of Contents Phonetics Words and Syntax Syllables and words Syntax and Semantics Measuring Performance Significance Testing Real-time Factor The Fundamental Equation Lab Developing and understanding Automatic Speech Recognition systems is an interdisciplinary activity, taking expertise in linguistics, computer science, and electrical engineering. This course will focus...",
      "content": "Module 1: IntroductionTable of Contents  Phonetics  Words and Syntax  Syllables and words  Syntax and Semantics  Measuring Performance  Significance Testing  Real-time Factor  The Fundamental Equation  LabDeveloping and understanding Automatic Speech Recognition systems is an interdisciplinary activity, taking expertise in linguistics, computer science, and electrical engineering.This course will focus on the structure of American English speech. Other languages may differ in more or less significant ways, from the use of tone to convey meaning to the sets of meaningful distinctions in the sound inventory of the language.PhoneticsPhonetics is the part of linguistics that focuses on the study of the sounds produced by human speech. It encompasses their production (through the human vocal apparatus), their acoustic properties, and perception. There are three basic branches of phonetics, all of which are relevant to automatic speech recognition.      Articulatory phonetics focuses on the production of speech sounds via the vocal tract and various articulators        Acoustic phonetics focuses on the transmission of speech sounds from a speaker to a listener        Auditory phonetics focuses on the reception and perception of speech sounds by the listener.  The atomic unit of a speech sound is called a phoneme. Words are comprised of one or more phonemes in sequence. The acoustic realization of a phoneme is called a phone. Below is a table of phonemes of U.S. English and common realizations.One major way to categorize phonemes is by dividing them into vowels and consonants.Vowels can be distinguished by two attributes. First, they are voiced sounds, meaning that the airflow from the vocal cords into the mouth cavity is created by the vibration of the vocal cords at a particular fundamental frequency (or pitch). Second, the tongue does not in any way form a constriction of airflow during production. The placement of the tongue, lips, and jaw distinguishes different vowel sounds from each other. These different positions form different resonances inside the vocal tract called formants and the resonant frequencies of these formants characterizes the different vowel sounds.Consonants are characterized by significant constriction of airflow in the airway or mouth. Like vowels, some consonants can be voiced, while others are unvoiced. Unvoiced phonemes do not engage the vocal cords and, therefore, do not have a fundamental frequency or pitch. Some consonant phonemes occur in pairs that differ only in whether they are voiced or unvoiced but are otherwise identical. For example, the sounds /b/ and /p/ have identical articulatory characteristics (your mouth, tongue, and jaw are in the same position for both), but the former is voiced, and the latter is unvoiced. The sounds /d/ and /t/ are another such pair.One important aspect of phonemes is that their realization can change depending on the surrounding phones. This is called phonetic context, and it is caused by a phenomenon called coarticulation. The process of producing these sounds in succession changes their characteristics. Modified versions of a phoneme caused by coarticulation are called allophones.All state-of-the-art speech recognition systems use this context-dependent nature of phonemes to create a detailed model of phonemes in their various phonetic contexts.Words and SyntaxSyllables and wordsA syllable is a sequence of speech sounds composed of a nucleus phone and optional initial and final phones. The nucleus is typically a vowel or syllabic consonant and is the voiced sound that can be shouted or sung.For example, the English word “bottle” contains two syllables. The first syllable has three phones, which are “b aa t” in the Arpabet phonetic transcription code. The “aa” is the nucleus, the “b” is a voiced consonant initial phone, and the “t” is an unvoiced consonant final phone. The second syllable consists only of the syllabic consonant “l.”A word can also be composed of a single syllable, which itself is a single phoneme, e.g., “Eye,” “uh,” or “eau.”In speech recognition, syllable units are rarely considered, and words are commonly tokenized into constituent phonemes for modeling.Syntax and SemanticsSyntax describes how sentences can be put together given words and rules that define allowable grammatical constructs. Semantics generally refers to the way that meaning is attributed to the words or phrases in a sentence. Both syntax and semantics are a major part of natural language processing, but neither plays a major role in speech recognition.Measuring PerformanceWhen we build and experiment with speech recognition systems, it is obviously very important to measure performance. Because speech recognition is a sequence classification task (in contrast to image labeling, where samples are independent), we must consider the entire sequence when we measure error.The most common metric for speech recognition accuracy is the Word Error Rate (WER). There are three types of errors a system can make: a substitution, where one word is incorrectly recognized as a different word, a deletion, where no word is hypothesized when the reference transcription has one, and an insertion where the hypothesized transcription inserts extra words not present in the reference. The overall WER can be computed as\\[WER = \\frac{N_{\\text{sub}} + N_{\\text{ins}} + N_{\\text{del}}}{N_{\\text{ref}}}\\]where $N_{\\text{sub}}$, $N_{\\text{ins}}$, and $N_{\\text{del}}$ are the number of substitutions, insertions, and deletions, respectively, and $N_{\\text{ref}}$ is the number of words in the reference transcription.The WER is computed using a string edit distance between the reference transcription and the hypothesized transcription. String edit distance can be efficiently computed using dynamic programming. Because string edit distance can be unreliable over a long body of text, we typically accumulate the error counts on a sentence-by-sentence basis and these counts are aggregated overall sentences in the test set to compute the overall WER.In the example below, the hypothesis “how never a little later he had a comfortable chat” is measured against the reference “however a little later we had a comfortable chat” to reveal two substitution errors, one insertion error, and one deletion error.            Reference      Hypothesis      Error                  however      how      Substitution                     never      Insertion              a      a                     little      little                     later      later                     we      he      Substitution              had      had                     a             Deletion              comfortable      comfortable                     chat      chat             The WER for this example is 4/7 = 0.4444 or 44.44%. It can be calculated as follows:\\[WER = \\frac{2 + 1 + 1}{9} = 0.4444\\]In some cases, the cost of the three different types of errors may not be equivalent. In this case, the edit distance computation can be adjusted accordingly.Sentence error rate (SER) is a less commonly used evaluation metric that treats each sentence as a single sample that is either correct or incorrect. If any word in the sentence is hypothesized incorrectly, the sentence is judged incorrect. SER is computed simply as the proportion of incorrect sentences to total sentences.Significance testingStatistical significance testing involves measuring to what degree the difference between two experiments (or algorithms) can be attributed to actual differences in the two algorithms or is merely the result of inherent variability in the data, experimental setup, or other factors. The idea of statistical significance underlies all pattern classification tasks. However, the way statistical significance is measured is task-dependent. At the center of most approaches is the notion of a “hypothesis test” in which there is a “null” hypothesis. The question then becomes, with what confidence can you argue that the null hypothesis can be rejected?For speech recognition, the most commonly used measure to compare two experiments is called the Matched Pairs Sentence-Segment Word Error (MAPSSWE) Test, commonly shortened to just the Matched Pairs Test. It was suggested for speech recognition evaluations by Gillick et al..In this approach, the test set is divided into segments with the assumption that errors in one segment are statistically independent from each other. This assumption is well-matched with typical speech recognition experiments where many test utterances are run through the recognizer one by one. Given the utterance-level error count from the WER computation described above, constructing a matched pairs test is straightforward. More details of the algorithm can be found in Pallet et al..Real-time FactorBesides accuracy, there may be computational requirements that impact performance, such as processing speed or latency. Decoding speed is usually measured with respect to a real-time factor (RTF). An RTF of 1.0 means that the system processes the data in real-time and takes ten seconds to process the audio.\\[RTF = \\frac{\\text{Total processing time}}{\\text{Total audio time}}\\]Factors above 1.0 indicate that the system needs more time to process the data. For some applications, this may be acceptable. For instance, when creating a transcription of a meeting or lecture, it may be more important to take more time and produce accurate transcriptions than to get the transcriptions quickly.When the RTF is below 1.0, the system processes the data more quickly than it arrives. This can be useful when more than one system runs on the same machine. In that case, multithreading can effectively use one machine to process multiple audio sources in parallel. RTF below 1.0 also indicates that the system can “catch up” to real-time in online streaming applications. For instance, when performing a remote voice query on the phone, network congestion can cause gaps and delays in receiving the audio at the server. If the ASR system can process data faster than in real-time, it can catch up after the data arrives, hiding the latency behind the speed of the recognition system.In general, any ASR system can be tuned to tradeoff speed for accuracy. But there is a limit. For a given model and test set, the speed-accuracy graph has an asymptote that is impossible to cross, even with unlimited computing power. The remaining errors can be entirely ascribed to modeling errors. Once the search finds the best result according to the model, further processing will not improve the accuracy.The Fundamental EquationSpeech recognition is cast as a statistical optimization problem. Specifically, for a given sequence of observations $\\mathbf{O} = \\lbrace O_{1},\\ldots,O_{N} \\rbrace$, we seek the most likely word sequence $\\mathbf {W } =\\lbrace W_{1},\\ldots,W_{M} \\rbrace$. That is, we are looking for the word sequence which maximizes the posterior probability $P(\\mathbf{W}\\vert\\mathbf{O})$. Mathematically, this can be expressed as:\\[\\hat{W} = \\mathrm{arg\\,max}_{W}P(W|O)\\]To solve this expression, we employ the Bayes rule,\\[P\\left( W \\middle| O \\right) = \\frac{P\\left( O \\middle| W \\right)P\\left( W \\right)}{P(O)}.\\]Because the word sequence does not depend on the marginal probability of the observation $P(O)$, this term can be ignored. Thus, we can rewrite this expression as\\[\\hat{W} = \\mathrm{arg\\,max}_{W}P\\left( O \\middle| W \\right)P(W)\\]This is known as the fundamental equation of speech recognition. The speech recognition problem can be cast as a search over this joint model for the best word sequence.The equation has a component $P(O\\vert W)$ known as an acoustic model that describes the distribution over acoustic observations $O$ given the word sequence $W$. The acoustic model is responsible for modeling how sequences of words are converted into acoustic realizations and then into the acoustic observations presented to the ASR system. Acoustics and acoustic modeling are covered in Modules 2 and 3 of this course.The equation has a component $P(W)$ called a language model based solely on the word sequence $W$. The language model assigns a probability to every possible word sequence. It is trained on sequences of words that are expected to be like those the final system will encounter in everyday use. A language model trained on English text will probably assign a high value to the word sequence “I like turtles” and a low value to “Turtles sing table.” The language model steers the search towards word sequences that follow the same patterns as in the training data. Language models can also be seen in purely text-based applications, such as the autocomplete field in modern web browsers. Module 4 of this course is dedicated to language modeling.For a variety of reasons, building a speech recognition engine is much more complicated than this simple equation implies. In this course, we will describe how these models are constructed and used together in modern speech recognition systems.LabLab for Module 1: Create a speech recognition scoring programRequired files:  wer.py  M1_Score.pyInstructions:In this lab, you will write a program in Python to compute the word error rate (WER) and sentence error rate (SER) for a test corpus. A set of hypothesized transcriptions from a speech recognition system and a set of reference transcriptions with the correct word sequences will be provided for you.This lab assumes the transcriptions are in a format called the “trn” format (TRN files), created by NIST. The format is as follows. The transcription is output on a single line followed by a single space, followed by the root name of the file, without any extension, in parentheses. For example, the audio file “tongue_twister.wav” would have a transcription.  sally sells seashells by the seashore (tongue_twister)Notice that the transcription does not have any punctuation or capitalization, nor any other formatting (e.g., converting “doctor” to “dr.” or “eight” to “8”). This formatting is called Inverse Text Normalization and is not part of this course.The Python code M1_Score.py and wer.py contain the scaffolding for the first lab. A main function parses the command line arguments, and string_edit_distance() computes the string edit distance between two strings.Add code to read the TRN files for the hypothesis and reference transcriptions, compute the edit distance on each, and aggregate the error counts. Your code should report:  Total number of reference sentences in the test set  Number of sentences with an error  Sentence error rate as a percentage  Total number of reference words  Total number of word errors  Total number of word substitutions, insertions, and deletions  The percentage of total errors (WER) and percentage of substitutions, insertions, and deletionsThe specific format for outputting this information is up to you. Note that you should not assume that the order of sentences in the reference and hypothesis TRN files is consistent. You should use the utterance name as the key between the two transcriptions.When you believe your code is working, use it to process hyp.trn and ref.trn in the misc directory and compare your answers to the solution.Next",
      "categories": [],
      "tags": [],
      "date": ""
    },
  
    {
      "title": "M5: Decoding",
      "url": "/speech-recognition-course/M5_Decoding/",
      "excerpt": "Speech Decoding Previous Table of Contents Overview Weighted Finite State Transducers The Decoding Problem The Grammar, a Finite State Acceptor The HMM State Transducer, a Finite State Transducer Weighted Finite State Transducers and Acceptors The N-gram Grammar Transducer, a More Complex Grammar WFST Graph Composition The Search Quiz Lab Overview...",
      "content": "Speech DecodingPreviousTable of Contents  Overview  Weighted Finite State Transducers  The Decoding Problem  The Grammar, a Finite State Acceptor  The HMM State Transducer, a Finite State Transducer  Weighted Finite State Transducers and Acceptors  The N-gram Grammar Transducer, a More Complex Grammar WFST  Graph Composition  The Search  Quiz  LabOverviewThe acoustic model scores sequences of acoustic model labels. For every time frame of input data, it computes the relative score for every label. Every sequence of labels has an associated score.The language model scores sequences of words. To every valid sequence of words in the vocabulary, it assigns a language model score.The glue that ties the two together is the decoding graph. It is a function that maps valid acoustic label sequences to the corresponding word sequences, together with the language model score.Weighted Finite State TransducersIn this course, we will be using some constructs from the field of weighted finite state transducers (WFST).Finite state automata (FSA) are a compact graph structures that encode sets of strings and are efficient to search.Finite state transducers (FST) are similar, but encode an mapping from one set of strings to another.Either FSA or FST can be weighted, in which case each element of the set (FSA), or each mapping pair (FST) is associated with a numeric weight.In this lesson, we explore how the structure of the speech recognition task can be encoded as WFST.The Grammar, a Finite State AcceptorFor example, assume we wish to build a trivial speech recognitionsystem, and the set of all valid phrases consists of the followingfive-element set. Such a set is efficiently expressed by a finite stateacceptor (FSA).  any thinkingsome thinkinganything kingsomething kingthinkingThe complete vocabulary of the system is six words.  anyanythingkingsomesomethingthinkingAn FSA that describes our phrases is shown below. Our phrases areencoded as paths through the graph, starting at the initial state,ending at the final state, and including the vocabulary words from thetraversed arcs. The figure follows the usual visual conventions forthese graphs:All paths start at the initial state labeled 0, and end at any state with a double circle.Arcs between states are directional, and labeled with one word from the vocabulary.Acceptors have a single symbol. A transducer would have two symbols, separated by a colon.Arc weights are assumed be zero, unless specified otherwise.To compile this graph with the OpenFST toolkit, we need a symbol tableand a textual description of the graph. The symbol table is a mappingfor the vocabulary between human-readable text and machine-friendlyintegers. The vocabulary for our grammar is covered by the followingtable. The &lt;eps&gt; symbol is special, and will be described later.Vocabulary.sym&lt; eps&gt; 0any 1anything 2king 3some 4something 5thinking 6The text representation of the graph consists of a series of arc andfinal state definitions. Arc definitions consist of start state, endstate, input label, output label, and an optional weight field. Finalstate definitions consist of a state number and an optional weightfield. Our grammar is represented as follows:Grammar.tfst0 1 any any1 0 thinking thinking0 2 some some2 0 thinking thinking0 3 anything anything3 0 king king0 4 something something4 0 king king0 0 thinking thinking0 0Note that the initial state of the first arc definition is assumed to bethe initial state for the entire graph, and that we also define it asthe final state with the last entry in the file. Also, the conventionfor finite state acceptors is that the input label and output labels beidentical.To compile the graph, use the fstcompile command.Fstcompile --isymbols=vocabulary.sym --osymbols=vocabulary.sym --keep_isymbols --keep_osymbols Grammar.tfst Grammar.fstThe Pronunciation Lexicon, a Finite State TransducerFor some parts of our speech recognition system, we need to associate sequences of one type of symbol with sequences of another. For example, sequences of words imply sequences of phones, and sequences of phones imply sequences of acoustic model labels.Phone pronunciations for our six words are shown in the table below. This mapping from words to phone sequences is called a pronunciation lexicon. A finite state transducer that describes the pronunciation lexicon accepts valid sequences of phones as input, and produce sequences of words as output.            Pronunciation      Word                  EH N IY      any              EH N IY TH IH NG      anything              K IH NG      king              S AH M      some              S AH M TH IH NG      something              TH IH NG K IH NG      thinking      One possible way of building such a transducer would be to make a path for every word in the pronunciation lexicon. The first part of this FST might look like this.0 1 EH any1 2 N &lt;eps&gt;2 0 IY&lt;eps&gt;0 3 EH anything3 4 N &lt;eps&gt;4 5 IY &lt;eps&gt;5 6 TH &lt;eps&gt;6 7 IH &lt;eps&gt;7 0 NG &lt;eps&gt;0 8 K king8 9 IH &lt;eps&gt;9 0 NG &lt;eps&gt;…Here, we first encounter the epsilon symbol, &lt;eps&gt;. When building the input or output string for a path through the graph, epsilon symbols are ignored. So, the path through states 0, 1, 2, 0 might seem like the output symbol sequence is any, &lt;eps&gt;, &lt;eps&gt;. But, in reality it consists only of the single symbol “any”.A close inspection of the lexical FST structure above reveals some redundancy in the graph. For instance, one valid phone at the beginning of the input sequence is “EH”, but that is encoded with two separate arcs, going to two separate states. We can eliminate one of these states by accepting EH as an input, but associating it with an &lt;eps&gt; output and delaying the output of either word “any” or “anything.”0 1 EH any&lt;eps&gt;1 2 N ~~&lt;eps&gt;~~any2 0 IY&lt;eps&gt;0 3 EH anything31 4 N ~~&lt;eps&gt;~~anything4 5 IY &lt;eps&gt;5 6 TH &lt;eps&gt;6 7 IH &lt;eps&gt;7 0 NG &lt;eps&gt;0 8 K king8 9 IH &lt;eps&gt;9 0 NG &lt;eps&gt;…The entry for “any” is encoded in the state path 0, 1, 2, 0, and the entry for “anything” is through the path 0, 1, 4, 5, 6, 7, 0. The state 3 is eliminated, as well as one of the redundant arcs. Furthermore, state 1 now has two arcs that share an N input symbol, and the procedure can be repeated. This process, where the graph is compressed by merging arcs that start in the same state and share an input label, is called determinization. It ensures every unique input string prefix maps to a unique state of the graph. Its complementary algorithm, FSTminimization, works analogously with the suffixes of the string. Both determinization and minimization can compact the graph structure without affecting the set of strings described by the FSA, and it would be difficult to construct a reasonable decoding graph without them.A compact pronunciation lexicon FST that represents our pronunciation lexicon is shown here. It is a bit more complex than the grammar FST from the previous lesson. One possible path through this FST is through the state sequence 0, 4, 8, 11, 3, 7, 0. This corresponds with an input string of “EH N IY TH IH NG” and an output string of “&lt;eps&gt; &lt;eps&gt; &lt;eps&gt; anything&lt; eps&gt; &lt;eps&gt;”. Because the &lt;eps&gt; symbols can be ignored, it is clear that this path represents one of the entries from the pronunciation lexicon above.A transducer, like this one, may have more than one path that accepts the same input string. If these paths are associated with different output strings, the transducer is said to be non-functional. This is the algebraic equivalent of an expression mapping a single input to two different outputs.In the graph above, there is a path through states 0, 4, 8, 11, 1, 5, 9, 3, 7, 0 that maps “EH N IY TH IH NG K IH NG” to “any thinking”. There is another path through states 0, 4, 8, 11, 3, 7, 0, 3, 7, 0 that maps the same phone sequence to “anything king”. Because a function should associate a unique output with every input, and this transducer fails that property, it is non-functional.Although non-functional FST are valid structures, not all graph algorithms can be applied. For instance, standard determinization algorithms can not be applied to a non-functional FST.In speech recognition, non-functional transducers arise from one of these situations:Homophones in the pronunciation lexicon.Sequence homophones, as the “any thinking” and “anything king” example above.Acoustic model state sequences that don’t map to unique phone sequences.The first two cases are addressed by using a technique involving “disambiguation symbols,” and dealing with the third case is beyond the scope of this course.Disambiguation symbols modify each pronunciation in the lexicon to end with an artificial phone. With these symbols added, our pronunciation lexicon would become:Phone Sequence\tWordEH N IY #0 \tanyEH N IY TH IH NG #0 \tanythingK IH NG #0 \tkingS AH M #0 \tsomeS AH M TH IH NG #0 \tsomethingTH IH NG K IH NG #0 \tthinkingThis is enough structure to break the sequence homophone example: “EX N IY #0 TH IH NG K IH NG #0” is unambiguously “any thinking,” and “EX N IY TH IH NG #0 K IH NG #0” is “anything king.” For the case where the pronunciation lexicon contains true homophones, each one would receive a unique symbol, starting with #0 and proceeding through #1, #2, #3, and so on. With proper use of disambiguation symbols, any lexical FST can be made functional.Another common way to address the problem of determinizing a non-functional FST is to transform it into an FSA. This process is known as encoding the FST. To encode an FST into an FSA, the input and output symbol of every arc in the graph is fused into a single symbol. Because the notion of input and output has been eliminated, the result is an FSA that describes a sequence of input/output pairs. Because it is a FSA, it is necessarily determinizable. After determinizing the encoded FSA, the encoding process can be reversed.The HMM State Transducer, a Finite State TransducerThe HMM state sequence transducer H maps sequences of acoustic model states to sequences of phone labels. As with the pronunciation lexicon, the desired mapping can be described with a table.            Acoustic Label Sequence      Phone                  AH_s2 AH_s3 AH_s4      AH              EH_s2 EH_s3 EH_s4      EH              IH_s2 IH_s3 IH_s4      IH              IY_s2 IY_s3 IY_s4      IY              K_s2 K_s3 K_s4      K              N_s2 N_s3 N_s4      N              NG_s2 NG_s3 NG_s4      NG              S_s2 S_s3 S_s4      S              TH_s2 TH_s3 TH_s4      TH              M_s2 M_s3 M_s4      M              #0      #0      The structure of our model is that each phone is associated with a sequence of three acoustic labels. These represent the beginning, middle, and end of the phone’s acoustic realization. Larger acoustic models typically have many more acoustic labels, capturing the way each phone’s acoustic realization changes depending on its neighboring sounds in the sequences. These context dependent models are beyond the scope of this course.Note that the table should include the disambiguation symbols from the lexical transducer. The H transducer should be constructed in such a way that any disambiguation symbols embedded in the phone sequence should be mapped to corresponding symbols in the acoustic label sequence.An visualization of the HMM transducer covering three of the ten phones in our toy example is shown below. Each phone model consists of a loop, where the state names, such as AH_s2, occur on the input side in sets of three, and the corresponding phone names, such as AH, occur on the output side. The other seven phones would have a similar structure, adding one loop per phone.Weighted Finite State Transducers and AcceptorsA weighted finite state automaton (WFSA) is a FSA that assigns a score to every sequence that it accepts. Every string that the automaton accepts is mapped to a score. A lexical FST might be augmented with weights to encode the relative probability of each word’s pronunciation variants. A grammar FSA can be augmented with weights to encode the relative probability of the word sequences it represents.The n-gram Grammar Transducer, a More Complex Grammar WFSTAn ngram language model, such as those developed in Module 4, can be approximately expressed as a WFSA that accepts strings of words, interspersed with special non-word tokens that indicate when the language model context experiences a backoff. As an example, we will examine a small part of a transducer that encodes the following ngrams:-0.03195206 half an hour-1.020315 half an -1.11897-1.172884 an hour-1.265547 an old-1.642946 an instant-1.698012 an endThe portion of the WFST that these ngrams encode is shown here:In this graph, it is given that the first three tokens are the start token, followed by “half” and “an,” and the graph computes the weight for the next word. Because “half an hour” is quit a common phrase in English, it has it’s own path from state 3 to state 4, which encodes the probability of “hour” following the “half an” bigram. If any words followed “hour” here, they would have an “an hour” context.Other words are allowed to follow “half an,” even though there is no trigram for them in the language model definition. Whereas state 3 represents the “half an” context, the state 5 represents only a context of “an”. The weight associated with contracting the context in this way is given by the backoff weight for the “half an” grammar in the definition. After taking this penalty, the model allows for four words in the context of “an”. You can see that one of these words is hour, because “an hour” is also quite common in English.Graph CompositionThe final algorithm we are interested in is FST composition. Just as with algebraic function composition, FST composition is merging two FST by feeding the output of one into the input of the second. If the first WFST maps string A to string B with weight x, and the second maps string B to string C with weight y, then the composed FST will map string A to string C with weight x+y.The Decoding GraphWe desire a decoding graph that can unify sequences of acoustic model states and sequences of words.In lesson 2 and 3, we saw how the structure of the speech recognition problem can be encoded in WFST:  The grammar G can be a FSA that encodes a flat set of phrases, or a WFSA that assigns weights to the phrases like a language model does.  The pronunciation lexicon L is a WFST that maps sequences of phones (with disambiguation symbols) to sequences of words.  The HMM transducer H that maps from sequences of HMM states (senone labels) to sequences of HMM model names (phones, or triphones).Not covered in this course, but common to many systems, is a fourth WFST that maps from sequences of triphone models to sequences of phones, and is called the context transducer C.To associate sequences of acoustic model states with sequences of words, the transducers H, C, L, and G should be consecutively applied. This could be accomplished at runtime in the decoder, but if this composition is done offline as a preprocessing step, then the decoder becomes much simpler.When fully composed, the HCLG of our toy example looks like this:The general practice is to compose from right to left, and determinize and minimize the graph after each individual composition. In the upcoming lab assignment, we have pre-compose H, C, and L, and all that is left is to create the G graph, and compose it with the given HCL WFST.Recall that the graph G has the language model backoff symbols on its input side. The HCL passes these symbols through as they occur. As a result, the composed HCLG will have these symbols as input.Recall also that because the L graph contains pronunciation disambiguation symbols, the HCL graph has these as well on its input side. The fully composed HCLG will also have these symbols as input.Because the decoder doesn’t need this information, these symbols are usually replaced with the epsilon symbol, indicating that traversing these arcs doesn’t affect the input string.The SearchSpeech recognition decoding is the process of finding the word sequence that jointly maximizes the language model score and acoustic model score. A sequence of acoustic states is assigned an acoustic model score by the acoustic model, and a language model score by path it describes through the decoding graph.It is a path search algorithm through the decoding graph, where the score of the path is the sum of the score given to it by the decoding graph, and the score given to it by the acoustic model. Due to the nature of our models, we can use a simple dynamic programming approach to find the shortest path. If the best path passes through state $S$ at time $T$, then it includes the best prefix path ending at time $T$ and state $S$.A typical frame synchronous beam search proceeds in three stages. For each time $t$,  Advance each partial hypothesis forward in the graph across arcs that have a non-epsilon input symbol.          As a result, all the new partial hypotheses that are generated have exactly t input symbols in their path.      If two partial hypotheses collide onto the same state, only keep the higher scoring hypothesis.            Eliminate any hypotheses that are “out of beam.” This could mean either keeping the top K hypotheses (where K is the maximum token count), or eliminating any partial hypothesis that has a score more than B worse than the best (where B is the beam width).    Advance each of the remaining tokens across arcs with an epsilon input symbol.          If two partial hypotheses collide onto the same state, only keep the higher scoring hypothesis.      The important parameters for a pruned search are beam width B and maximum token count K. The beam width insures that partial hypotheses that are far from the current best are abandoned. The maximum token count limits the total amount of work done on every frame.It is possible that the best overall path has a very low score for some time t, and is discarded by the pruning process. In this case, the beam search algorithm will result in a sub-optimal path. When this happens, we say that the algorithm has produced a search error. It is possible to reduce these errors arbitrarially close to zero by increasing the beam width and maximum token count.QuizQuestion 1Which of these can most compactly represent a set of strings? (Choose one)  Weighted Finite State Transducer  Weighted Finite State Acceptor  Finite State Transducer  Finite State Acceptor  None of the aboveQuestion 2Which component of the decoding graph represents valid sequences of words? (Choose one)  The Grammar  The HMM State Transducer  The Pronunciation Lexicon  None of the aboveQuestion 3Which one of the following properties will cause a FST to be non-functional? (Choose one)  More than one input string maps to the same output string  For at least one path, the path score is negative  One input string maps to more than one output string  For at least one path, the input string and output string are identicalQuestion 4Which component of the decoding graph generally needs disambiguation symbols to prevent non-functionality? (Choose one)  The Grammar  The HMM State Transducer  The Pronunciation Lexicon  None of the aboveQuestion 5Which component of the decoding graph describes the HMM structures of the acoustic model? (Choose one)  The Grammar  The HMM State Transducer  The Pronunciation Lexicon  None of the aboveQuestion 6When successfully applied, which of these algorithms tend to make the resulting structure larger? (Choose all that apply)  Minimization  Determinization  Composition  None of the aboveQuestion 7When successfully applied, which of these algorithms tend to make the resulting structure smaller? (Choose all that apply)  Minimization  Determinization  Composition  None of the aboveQuestion 8During beam search decoding, hypotheses may be discarded if which of the following conditions exist? (Choose all that apply)  The token’s score falls outside the beam width.  There are too many tokens, and this token isn’t one of the N best.  The token enters a determinized state.  The token enters a minimized state.  None of the aboveQuestion 9Which of these are standard conventions for WFST? (Choose all that apply)  The initial state in the first line of a WFST description is assumed to be the initial state of the graph.  Weights are assumed to be zero unless otherwise specified.  The final state of the final line of a WFST description is assumed to be the final state of the graph.  None of the aboveQuestion 10When a path traverses an arc with an epsilon symbol, what does this indicate? (Choose one)  The path’s string is appended with an epsilon.  The path’s string is unaffected by this arc.  The path’s weight is reset to zero.  None of the aboveLabLab for Module 5: DecodingRequired files:      HCL.fst    This is a finite state transducer that maps sequences of context independent phone states (acoustic model labels) to sequences of words. The lexicon has 200,000 words.    HCL.fst has not been determinized. Instead, it has been prepared to make composition with a language model WFST as efficient as possible.    This FST has disambiguation symbols on its input side. They ensure that every unique input sequence has a unique output sequences, regardless of homonyms. This preserves the functional nature of the transducer, which makes determinization possible.    It is expected that the language model contains “\" symbols on its input side, which represent the backoff transitions of the language model. This HCL.fst contains arcs with \"\" labels on the input and output so these transitions will also be present on the input side of the fully composed graph. If it did not, then the composed graph would not be functional, and determinization would be impossible.        DecodingGraph.fst    This is the result of compiling HCL.fst with a trigram language model, and applying a series of transformations to remove both disambiguation and language model backoff symbols, as well as to compact the structure into fewer arcs.        H.FST.isym and L.fst.osym    These are the input and output symbol tables that should cover the input and output of HCL.fst, DecodingGraph.fst, and any other decoding graph you build in this lab.        StaticDecoder.py    This is a simple Python-based beam decoder. It relies on loading a CNTK acoustic model, a WFST decoding graph, and pre-processed acoustic features.  Instructions:  Run the StaticDecoder.py to decode the test data using the provided DecodingGraph.fst and the Experiments/lists/feat_test.rscp generated in lab 2.          The provided DecodingGraph.fst is in OpenFst format, but the decoder expects it to be in text format. Create DecodingGraph.fst.txt using the OpenFST fstprint tool.      Run the provided decoder with default parameters on the test data, using any acoustic model built in Section 3 of this class.      Measure the word error rate of the decoder’s output with respect to the given reference text, using the word error rate module from Section 1 of this class.        Create a new decoding graph using a language model you have trained in Module 4 of this class.          Convert the ARPA format language model to its FST approximation. The arpa2fst.py tool is provided for this purpose.      Compose your new G.fst with the given HCL.fst.      Process the graph using a mixture of label pushing, encoding, decoding, minimization, and determinization. As part of this process, all disambiguation symbols and language model backoff symbols should be manually converted into “\".      Use the resulting HCLG.fst in place of DecodingGraph.fst to repeat Assignment 2 above.        Measure the time-accuracy tradeoff of the decoder.          Run the decoder with the provided DecodingGraph.fst two more times: once with the beam width decreased by a factor of ten, and once with the beam width increased by a factor of ten.      What do you observe about the relationship between decoding speed and word error rate? What do you expect if the beam width was increased to the point that no pruning occurred?      Supplementary Material:https://cs.nyu.edu/~mohri/pub/hbka.pdfThis chapter from the Springer Handbook on Speech Processing and Speech Communication has more information than you will need to complete the labs, but may be interesting for the motivated student. Section 3 details some standard algorithms, and Section 4 describes how the WFST framework is typically applied for the speech recognition task.Next",
      "categories": [],
      "tags": [],
      "date": ""
    },
  
    {
      "title": "M3: Acoustic Modeling",
      "url": "/speech-recognition-course/M3_Acoustic_Modeling/",
      "excerpt": "M3: Acoustic Modeling Previous Table of Contents Introduction Hidden Markov Models The Evaluation Problem The Decoding Problem The Training Problem Hidden Markov Models for Speech Recognition Choice of subword units Deep Neural Network Acoustic Models Generate Frame based Sonal Levels Training Feedforward Deep Neural Networks Training Recurrent Neural Networks Long...",
      "content": "M3: Acoustic ModelingPreviousTable of Contents  Introduction  Hidden Markov Models  The Evaluation Problem  The Decoding Problem  The Training Problem  Hidden Markov Models for Speech Recognition  Choice of subword units  Deep Neural Network Acoustic Models  Generate Frame based Sonal Levels  Training Feedforward Deep Neural Networks  Training Recurrent Neural Networks  Long Short-Term Memory Networks  Using a Sequence-based Objective Function  Decoding with Neural Network Acoustic Models  LabIntroductionIn this module, we’ll talk about the acoustic model used in modern speech recognizers. In most systems today, the acoustic model is a hybrid model with uses deep neural networks to create frame-level predictions and then a hidden Markov model to transform these into a sequential prediction. A hidden Markov model (HMM) is a very well-known method for characterizing a discrete-time (sampled) sequence of events. The basic ideas of HMMs are decades old and have been applied to many fields.Before studying HMMs, it will be useful to briefly review Markov chains. Markov chains are a method for modeling random processes. In a Markov chains, discrete events are modeled with a number of states. The movement among states is governed by a random process.Let’s consider an example. In a weather prediction application, the states could be “Sunny”, “Partly Cloud”, “Cloudy”, and “Raining”. If we wanted to consider the probability of a particular 5 day forecast, e.g. $P(p,p,c,r,s)$, we would employ Bayes’ rule to break up this joint probability into a series of conditional probabilities.\\[p(X1,X2,X3,X4,X5)=p(X5|X4,X3,X2,X1)p(X4|X3,X2,X1)p(X3|X2,X1)p(X2|X1)p(X1)\\]This expression can be greatly simplified if we consider the first-order Markov assumption, which states that\\[p(X_i|X_1,\\ldots,X_{i-1})=p(X_i|X_{i-1})\\]Under this assumption, the joint probability of a 5-day forecast can be written as\\[\\begin{split}p(X1,X2,X3,X4,X5) &amp;= p(X5|X4)p(X4|X3)p(X3|X2)p(X2|X1)p(X1) \\\\&amp;=p(X_1)\\prod_{i=2}^5p(X_i|X_{i-1})\\end{split}\\]Thus, the key elements of a Markov chain are the state identities (weather forecast in this case) and the transition probabilities p(X_i \\vert X_{i−1}) that express the probability of moving from one state to another (including back to the same state).For example, a complete (though likely inaccurate) Markov chain for weather prediction can be depicted asNote that in addition to the conditional probabilities\\[p(X_i|X_{i-1})\\]in the equation above, there was also a probability associated with the first element of the sequence,\\[p(X_1).\\]So, in addition to the state inventory and the conditional transition probabilities, we also need a set of prior probabilities that indicate the probability of starting the chain in each of the states. Let us assume our prior probabilities are as follows:\\[p(p)=\\pi_p, p(c)=\\pi_c, p(r)=\\pi_r, p(s)=\\pi_s\\]Now, let us return to the example. We can now compute the probability of $ P(p,p,c,r,s) $ quite simply as\\[\\begin{split}p(p,p,c,r,s) &amp;= p(s|r,c,p,p) p(r|c,p,p) p(c|p,p) p(p|p) p(p) \\\\&amp;= p(s|r) p(r|c) p(c|p) p(p|p) p(p)\\end{split}\\]Hidden Markov ModelsHidden Markov models (HMMs) are a generalization of Markov chains. In a Markov chain, the state is directly visible to the observer, and therefore the state transition probabilities are the only parameters. In contrast, in an HMM, the state is not directly visible, but the output (in the form of data) is visible. Each state has a probability distribution over the possible output tokens. Therefore, the parameters of an HMM are the initial state distribution, the state transition probabilities, and the output token probabilities for each state.The Markov chains previously described are also known as observable Markov models. That is because once you land in a state, it is known what the outcome will be, e.g. it will rain. A hidden Markov model is different in that each state is defined not by a deterministic event or observation but by a probability distribution over events or observations. This makes the model doubly stochastic. The transitions between states are probabilistic and so are the observations in the states themselves. We could convert the Markov chain on weather to a hidden Markov model by replacing the states with distributions. Specifically, each state could have a different probability of seeing various weather conditions, such as sun, partly cloudy, cloudy, or rainy.Thus, a HMM is characterized by a set of N states along with  A transition matrix that defines probabilities of transitioning among states $A$ with elements $a_{ij}$  A probability distribution for each state $B= \\{ b_i(x) \\} , \\{ i= 1,2,\\ldots, N \\}$  A prior probability distribution over states $\\pi= \\lbrace \\pi_1, \\pi_2, \\ldots, \\pi_N \\rbrace $This, we can summarize the parameters of an HMM compactly as $\\Phi = \\left \\lbrace A, B, \\pi\\right \\rbrace $There are three fundamental problems for hidden Markov models, each with well-known solutions. We will only briefly describe the problems and their solutions next. There are many good resources online and in the literature for additional details.The Evaluation ProblemGiven a model with parameters $\\Phi$ and a sequence of observations $X = \\left \\lbrace x_1, x_2, \\ldots, x_T\\right \\rbrace$, how do we compute the probability of the observation sequence, $P(X \\vert \\Phi)$? This is known as the evaluation problem. The solution is to use the forward algorithm.This Evaluation problem can be solved summing up the probability over all possible values of the hidden state sequence. Implemented naively this can be quite expensive as there are an exponential number of states sequences ($O(N^T)$, where $N$ is the number of states and $T$ the number of time steps).The forward algorithm is a far more efficient dynamic-programming solution. As its name implies, it processes the sequence in a single pass. It stores up to N values at each time step, and reduces the computational complexity to $O(N^2T)$.The Decoding ProblemGiven a model $\\Phi$ and a sequence of observations $X = \\left\\lbrace x_1, x_2, \\ldots, x_T\\right\\rbrace$, how do we find the most likely sequence of hidden states $Q = \\left\\lbrace q_1, q_2, \\ldots, q_T\\right\\rbrace$ that produced the observations?This is known as the decoding problem. The solution is to use the Viterbi algorithm. The application of this algorithm to the special case of large vocabulary speech recognition is discussed in Module 5, and an example of how it can be integrated into the training criterion is discussed in Module 6.The Training ProblemGiven a model and an observation sequence (or a set of observation sequences) how can we adjust the model parameters $\\Phi$ to maximize the probability of the observation sequence?This problem can be efficiently solved using the Baum-Welch algorithm, which includes the Forward-Backward algorithm.A byproduct of the forward algorithm mentioned earlier in this lesson is that it computes the probability of being in a state i at time t given all observations up to and including time t. The backward algorithm has a similar structure, but computes the probability of being in state i at time t given all future observations starting at t+1. These two artifacts are combined in the forward-backward algorithm to produce the posterior probability of being in state i at time t given all of the observations.Once we know the posterior probability for each state at each time, the Baum-Welch algorithm acts as if these were direct observations of the hidden state sequence, and updates the model parameters to improve the objective function. An example of how this applies to acoustic modeling is covered in more depth in Module 6.Hidden Markov Models for Speech RecognitionIn speech recognition, hidden Markov models are used to model the acoustic observations (feature vectors) at the subword level, such as phonemes.It is typically for each phoneme to be modeled with 3 states, to separately model the beginning, middle and end of the phoneme. Each state has a self-transition and a transition to the next state.Word HMMs can be formed by concatenating its constituent phoneme HMMs. For example, the HMM word “cup” can be formed by concatenating the HMMs for its three phonemes.Thus, a high-quality pronunciation dictionary which “spells” each word in the system by its phonemes is critically important for successful acoustic modeling.Historically, each state in the HMM had a probability distribution defined by a Gaussian Mixture Model (GMM) which is defined as\\[p(x|s)=\\sum_m w_m {\\mathcal N}(x;\\mu_m, \\Sigma_m)\\]where ${\\mathcal N}(x;\\mu_m,\\Sigma_m)$ is a Gaussian distribution and $w_m$ is a mixture weight, with $\\sum_m w_m=1$. Thus, each state of the model has its own GMM. The Baum-Welch training algorithm estimated all the transition probabilities as well as the means, variances, and mixture weights of all GMMs.All modern speech recognition systems no longer model the observations using a collection of Gaussian mixture models but rather a single deep neural network that has output labels that represent the state labels of all HMMs states of all phonemes. For example, if there were 40 phonemes and each had a 3-state HMM, the neural network would have $40\\times3=120$ output labels.Such acoustic models are called “hybrid” systems or DNN-HMM systems to reflect the fact that the observation probability estimation formerly done by GMMs is now done by a DNN, but that the rest of the HMM framework, in particular the HMM state topologies and transition probabilities, are still used.Choice of subword unitsIn the previous section, we described how word HMMs can be constructed by chaining the HMMs for the individual phones in a word according to the pronunciation dictionary. These phonemes are referred to as “context-independent” phones, or CI phones for short. It turns out that the realization of a phoneme is, in fact, heavily dependent on the phonemes that can precede and follow it. For example, the /ah/ sound in “bat” is different from the /ah/ sound in “cap.”For this reason, higher accuracy can be achieved using “context-dependent” (CD) phones. Thus, to model “bat,” we’d use an HMM representing the context-dependent phone /b-ah+t/ for the middle /ah/ sound, and for the word “cap,” we’d use a separate HMM that modeled /k-ah+p/. So, imagine the word “cup” was in the utterance “a cup of coffee”. Then, the cup would be modeled by the following context-dependent HMMs.Because this choice of context-dependent phones models 3 consecutive phones, they are referred to as “triphones”. Though not common, some systems model an even longer phonetic context, such as “quinphones” which is a sequence of 5 consecutive phones.When context-independent phones are used, there are a very manageable number of states: $N$ phones times $P$ states per phone. U. S. English is typically represented using 40 phones, with three states per phone. This results in 120 context-independent states. As we move to context-dependent units, the number of triphones is $N^3$. This leads to a significant increase in the number of states, for example: $40^3 * 3 = 192,000$.This explosion of the label space leads to two major problems:  Far less data is available to train each triphone  Some triphones will not be observed in training but may occur in testingA solution to these problems is in widespread use, which involves pooling data associated with multiple context-dependent states that have similar properties and combining them into a single “tied” or “shared” HMM state. This tied state, known as a one, is then used to compute the acoustic model scores for all of the original HMM states whose data was pooled to create it.Grouping a set of context-dependent triphone states into a collection of senones is performed using a decision-tree clustering process. A decision tree is constructed for every state of every context-independent phone.The clustering process is performed as follows:      Merge all triphones with a common center phone from a particular state together to form the root node. For example, state 2 of all triphones of the form /-p+/        Grow the decision tree by asking a series of linguistic binary questions about the left or right context of the triphones. For example, “Is the left context phone a back vowel?” or “Is the right context phone voiced?” At each node, choose the question with results in the largest increase in likelihood of the training data.        Continue to grow the tree until the desired number of nodes are obtained or the likelihood increase of a further split is below a threshold.        The leaves of this tree define the senones for this context-dependent phone state.  This process solves both problems listed above. First, the data can now be shared among several triphone states, so the parameter estimates are robust. Second, if a triphone is needed at test time that was unseen in training, it’s corresponding senone can be found by walking the decision tree and answering the splitting questions appropriately.Almost all modern speech recognition systems that use phone-based units utilize senones as the context-dependent unit. A production-grade large vocabulary recognizer can typically have about 10,000 senones in the model. Note that this is far more than the 120 context-independent states but far less than the 192,000 states in an untied context-dependent system.Deep Neural Network Acoustic ModelsOne of the most significant advances in speech recognition in recent years is the use of deep neural network acoustic models. As mentioned earlier, the hybrid DNN systems replace a collection of GMMs (one for every senone) with a single deep neural network with output labels corresponding to senones.The most common objective function used for training neural networks for classification tasks is cross entropy. For a $M$-way multi-class classification task such as senone classification, the objective function for a single sample can be written as\\[E = -\\sum_{i=1}^M t_m \\log(y_m)\\]Where $t_m$ is the label (1 if the data is from class m and 0 otherwise) and $y_m$ is the output of the network, which is a softmax layer over the output activations. Thus, for each frame, we need to generate a M-dimensional 1-hot vector that consists of all zeros except for a single 1 corresponding to the true label. This means that we need to assign every frame of every utterance to a senone in order to generate these labels.Generate Frame based Sonal LevelsGenerating frame-based senone labelsTo label all frames of the training data with a corresponding senone label, a process known as forced alignment is used. In this process, we essentially perform HMM decoding but constrain search to be along all paths that will produce the correct reference transcription. Forced alignment then generates the single most-likely path, and thus, the senone label for every frame in the utterance.The forced alignment process needs a speech recognition system to start from. This can be an initial GMM-based system or if the senone set is the same, a previously trained neural network-based system.The output of forced alignment is typically a file that lists for each utterance the start frame and end frame of a segment and the corresponding senone label. This format can be different depending on the toolkit being used. HTK is a well-known speech recognition toolkit. In HTK, the output from forced alignment is called an MLF file. Here is a snippet from an MLF file produced by forced alignment. The columns of the MLF can be interpreted as  Start time (in 100ns time units)  End time (in 100ns time units)  Senone ID  Acoustic model score for that senone segment  Context-dependent triphone HMM model (appears at start of phone boundary)  Acoustic model score for the triphone HMM model  Word in the transcription (appears at start of word boundary)From this, or a similar output, we can easily generate the labels required for training a deep neural network acoustic model.Training Feedforward Deep Neural NetworksThe simplest and most common neural network used for acoustic modeling is the conventional fully connected feed-forward neural network. Information on feedforward DNNs is readily found online so we will focus here on the key aspects of DNN-based acoustic models.Although we are training a DNN to predict the label for each frame of input, it is very beneficial for classification to provide a context window of frames to the network as input. Specifically, for the frame at time t, the input to the network is a symmetric window of the N frames before and N frames after. Thus, if x_t is the feature vector at time t, the input to the network is\\[X_t = [ x_{t-N},  x_{t-N-1},  \\ldots,  x_t,  \\ldots,  x_{t+N-1},  x_{t+N} ]\\]Typical values of N are between 5 and 11, depending on the amount of training data. Larger context windows provide more information but require a larger matrix of parameters in the input layer of the model which can be hard to train without ample data.It is often advisable to augment the feature vectors with their temporal derivatives, also known as delta features. These features can be computed from simple differences or more complicated regression formulae. For example,\\[\\Delta x_t = x_{t+2} - x_{t-2}\\]\\[\\Delta^2 x_t = \\Delta x_{t+2} - \\Delta x_{t-2}\\]In this case the input to the network for each frame is a context window of stacked features which consist of the original feature vectors, the delta and the delta-delta features\\[x_t, \\Delta x_t, \\Delta^2 x_t.\\]This input is then processed through a number of fully connected hidden layers and then finally by a softmax layer over senone labels to make the prediction. This network can then be trained by backpropagation in the usual manner.Training Recurrent Neural NetworksRecurrent neural networks (RNNs) are a type of neural network that is particularly well-suited to sequence modeling tasks such as speech recognition. RNNs are designed to process sequences of data, such as speech signals, by maintaining an internal state that is updated at each time step. This allows RNNs to capture dependencies between elements in the sequence and to model long-term dependencies.Unlike feedforward DNNs, recurrent networks process data as a sequence and have a temporal dependency between the weights. There are several standard forms of recurrent networks. A conventional RNN has a hidden layer output that can be expressed as\\[h_t^i = f(W^i h_t^{i-1} + U^i h_{t-1}^i + c^i)\\]where $f(\\cdot)$ is a nonlinearity such as a sigmoid or relu function, i is the layer of the network, $t$ is the frame or time index, and the input $x$ is equivalent to the output of the zeroth layer, $h_t^0=x_t$.In contrast to a feedforward layer, a recurrent layer’s output has a dependence on both the current input and the output from the previous time step. If you are familiar with filtering operations in signal processing, an RNN layer can be considered a nonlinear infinite impulse response (IIR) filter.In offline applications, where latency is not a concern, it is possible to perform the recurrence in both the forward and backward directions. These networks are known as bidirectional neural networks. In this case, each layer has a set of parameters to process the sequence forward in time and a separate set of parameters to process the sequence in reverse. These two outputs can then be concatenated to input to the next layer. This can be expressed mathematically as\\[\\begin{split}\\overrightarrow{h_t^i} &amp;= f\\left(W_f^i h_t^{i-1} + U_f^i h_{t-1}^i + c_f^i\\right)  \\\\\\overleftarrow{h_t^i} &amp;= f\\left(W_b^i h_t^{i-1} + U_b^i h_{t+1}^i + c_b^i\\right) \\\\h_t^i &amp;= \\left[\\overrightarrow{h_t^i}, \\overleftarrow{h_t^i}\\right] \\end{split}\\]where the subscripts $f$ and $b$ indicate parameters for the forward and backward directions, respectively.RNNs are appealing for acoustic modeling because they can learn the temporal patterns in the feature vector sequences, which is very important for speech signals. In order to train RNNs, therefore, the sequential nature of the training sequences must be preserved. Thus, rather than frame-based randomization which is typically performed in feedforward networks, we perform utterance-based randomization, where the ordering the utterances is randomized but the sequential nature of the utterances themselves is preserved.Because the network itself is learning correlations in time of the data, the use of a wide context window of frames on the input is no longer required. It can be helpful for unidirectional RNNs to provide several frames of future context, but this is typically much smaller than in the feed-forward case. In bidirectional RNNs, there is typically no benefit to provide any context window because when processing any particular frame, the network has already seen the entire utterance either via the forward processing or the backward processing.Training an RNN can still be performed using the same cross-entropy objective function, with a slightly modified gradient computation. Due to the temporal nature of the model, a variation of back-propagation called back-propagation through time (BPTT) is used. This algorithm arises when you consider that in an RNN, the output at the current time step is dependent on the input at the current time step as well as the inputs at all previous time steps (assuming a unidirectional RNN).Like standard back propagation, BPTT optimizes the model parameters using gradient descent. The gradient of the objective function with respect to the model parameters is computed via the chain rule. Because of the temporal nature of the model, the chain rule requires the multiplication of many gradient terms (proportional to the number of frames of history). Because there is no restriction on these terms, it is possible for the expression to become close to zero (in which case no learning occurs) or become excessively large, which leads to training instability and divergent behavior. This is referred to as vanishing gradients or exploding gradients, respectively. To combat vanishing gradients, there are two well-known solutions: 1) employ specific recurrent structures that avoid these issues, such as LSTM, which will be discussed in the next section, or 2) truncate the BPTT algorithm to only look back to a fixed length history which limits the total number of terms. To combat exploding gradient, a method called gradient clipping is employed, which sets an upper limit on the size of the absolute value of the gradient for any parameter in the model. Gradients with an absolute value larger than the clipping threshold are set to the clipping threshold.All standard deep learning toolkits support training the recurrent networks with these features.Long Short-Term Memory NetworksOther Recurrent Network ArchitecturesIn order to combat the issues with vanishing/exploding gradients and better learn long-term relationships in the training data, a new type of recurrent architecture was proposed. These networks are called Long Short-Term Memory (LSTM). Because of their widespread success in many tasks, LSTMs are now the most commonly used type of recurrent network.An LSTM uses the concept of a cell, which is like a memory that stores state information. This information can be preserved over time or overwritten by the current information using multiplicative interactions called gates. Gate values close to 0 blocked information while gate values close to 1 pass through information. The input gate decides whether to pass information from the current time step into the cell. The forget gate decides whether to persist or erase the current contents of the cell, and the output gate decides whether to pass the cell information onward in the network. A diagram of the LSTM is shown below.There are many details of LSTMs that can be found online. For example, this blog post does a good job explaining the operation of LSTMs: https://colah.github.io/posts/2015-08-Understanding-LSTMs/.Other variants of the LSTM have been proposed, such as the Gated Recurrent Unit (GRU), which is a simplified version of the LSTM. On some tasks, GRUs have shown similar performance to LSTMs with fewer parameters.Using a Sequence-based Objective FunctionWhile RNNs are a sequential acoustic model in that they model the sequence of acoustic feature vectors as a time series, the objective function is still frame-independent. However, because speech recognition is inherently a sequence-classification task, it is beneficial to employ a sequence-based objective function. Sequence-based objective functions have been proposed for speech recognition for GMM-HMMs and have since been updated to work with neural-network acoustic models. The key difference between frame-based cross entropy and a sequence-discriminative objective function is that the sequence-based objective function more closely models the decoding process. Specifically, a language model is incorporated in order to determine the most likely competitors that the model needs to learn to correctly classify between. Put more simply, in frame-based cross-entropy, any incorrect class is penalized equally, even if that incorrect class would never be proposed in decoding due to HMM state topology or the language model. With sequence training, the competitors to the correct class are determined by performing a decoding of the training data.There are several sequence discriminative objective functions. One of the most well-known is maximum mutual information (MMI). The MMI objective function can be written as\\[F_{MMI}= \\sum_u \\log \\frac {p(X_u|S_u)p(W_u)} {\\sum_{W'}p(X_u|S_{W'})p(W')}\\]where $u$ is an index over utterances in the training set, $W_u$ is the reference word sequence for utterance $u$, and $S_u$ is the corresponding state sequence. The denominator is a sum over all possible word sequences. $S_{W’}$ would represent the state sequence corresponding to the alternative word sequence  $W’$. This summation penalizes the model by considering competing hypotheses $W’$ that could explain the observed features $X_u$. This is typically approximated by a word lattice, which is a graph over possible hypothesis encountered in decoding.Using this objective function add significant complexity to the training of acoustic models but typically result in improved performance and is a component of most state of the art systems. There are also many variations on this such as Minimum Phone Error (MPE) training, and state-level Minimum Bayes Risk (sMBR) training.Decoding with Neural Network Acoustic ModelsThe neural network acoustic models compute posterior probabilities $p\\left(s \\vert x_{t} \\right)$ over senone labels ($s$). These state-level posterior probabilities must be converted to state likelihoods $p\\left( x_{t} \\vert s \\right)$ for decoding using an HMM, as will be discussed in Module 5. This can be done by an application of Bayes’ rule:\\[p\\left( x_{t} \\middle| s \\right) = \\frac{p\\left( s \\middle| x_{t} \\right)p\\left( x_{t} \\right)}{p(s)} \\propto \\frac{p\\left( s \\middle| x_{t} \\right)}{p(s)}\\]Note that because the prior over the observations $p\\left( x_{t} \\right)$ is constant over all senones, it contributes a constant factor to all likelihood scores so it can be dropped. Thus, the likelihood $p\\left( x_{t} \\vert s \\right)$ is computed by dividing the network’s posterior probabilities by the senone prior $p(s)$. This senone prior probability $p(s)$ can be easily estimated by counting the occurrences of each senone in the training data.This likelihood is known as a scaled likelihood, to reflect the fact that it has been computed by scaling the senone posterior by its prior.LabInstructions:In this lab, we will use the features generated in the previous lab along with the phoneme state alignments provided in the course materials to train two different neural network acoustic models, a DNN and an RNN.The inputs to the training program are:  lists/feat_train.rscp, lists/feat_dev.rscp - List of training and dev feature files, stored in a format called RSCP. This standard for relative SCP file, where SCP is HTK-shorthand for script file. It is simply a list of files in the two sets. The dev set is used in training to monitor overfitting and perform early stopping. These files should have been generated as part of completing lab 2.  am/feat_mean.ascii, am/feat_invstddev.ascii - The global mean and precision (inverse standard deviation) of the training features, also computed in lab 2  am/labels_all.cimlf - The phoneme-state alignments that have been generated as a result of forced alignment of the data to an initial acoustic model. Generating this file requires the construction of a GMM-HMM acoustic model which is outside the scope of this course, so we are providing it to you. The labels for both the training and dev data are in this file.  am/labels.ciphones - The list of phoneme state symbols which correspond to the output labels of the neural network acoustic model  am/labels_ciprior.ascii - The prior probabilities of the phoneme state symbols, obtained by counting the occurences of these labels in the training data.The training, dev, and test RSCP files and the training set global mean and precision files were generated by the lab in Module 2. The remaining files have been provided for you and are in the am directory.Part 1: Training a feedforward DNNWe have provided a python program called M3_Train_AM.py which will train a feed-forward deep network acoustic model using the files described above. The program is currently configured to train a network with the following hyperparameters:  4 hidden layers of size 512 hidden units per layer.  120 output units corresponding to the phoneme states  Input context window of 23 frames, which means the input to the network for a given frame is the current frame plus 11 frames in the past and 11 frames in the future  Minibatch size of 256  Learning is performed with Momentum SGD with a learning rate of 1e-04 per sample with momentum as a time constant of 2500  One epoch is defined as a complete pass of the training data and training will run for 100 epochsThe development set will be evaluated every 5 epochs.This can be executed by running$ python M3_Train_AM.pyOn a GTX 965M GPU running on a laptop, the network trained at a rate of 63,000 samples/sec or about 20 seconds per epoch. Thus, 100 epochs will run in 2000 seconds or about 30 minutes.After 100 epochs, the result of training, obtained from the end of the log file, wasFinished Epoch[100 of 100]: [CE_Training] loss = 1.036854 * 1257104, metric = 32.74% * 1257104 17.146s (73317.6 samples/s);Finished Evaluation [20]: Minibatch[1-11573]: metric = 44.26% * 370331;Thus, the training set has a cross entropy of 1.04 per sample, and a 32.74% frame error rate, while the held-out dev set has a frame error rate of 44.3%After training is complete, you can visualize the training progress using M3_Plot_Training.py. It takes a CNTK log file as input and will plot epoch vs. cross-entropy of the training set on one figure and epoch vs. frame error rate of the training and development sets on another figure.$ python M3_Plot_Training.py -–log &lt;logfile&gt;For this experiment,  would be `../am/dnn/log`Here is an example of the figure produced by this script.As you can see from the figure, overfitting has not yet occurred, as the development set performance is still the best in the final epoch. It is possible that small improvements can be made with additional training iterations.You can now experiment with this neural network training script. You can modify the various hyperparameters to see if the performance can be further improved. For example, you can vary the  Number of layers  Number of hidden units in each layer  Learning rate  Minibatch size  Number of epochs  Learning algorithm (see the CNTK documentation for details on using other learners, such as Adam or AdaGrad)Part 2: Training a Recurrent Neural NetworkIn the second part of this lab, you will modify the code to train a Bidirectional LSTM (BLSTM) network, a type of recurrent neural network.To train an BLSTM, there are several changes in the code that you should be aware of.In DNN training, all frames (samples) are processed independently, so the frames in the training data are randomized across all utterances. In RNN training, the network is trying to learn temporal patterns in the speech sequence, so the order of the utterances can be randomized but the utterances themselves must be kept intact. Thus, we set frame_mode=False in the MinibatchSource instantiated by create_mb_source().Change the network creation to create a BLSTMIn create_network(), we’ve created a function called MyBLSTMLayer as specified below. This function uses the Optimized_RNN Stack functionality in CNTK. A complete description and additional examples can be found in the CNTK documentation. One thing to be aware of is that with a BLSTM, the size of the hidden layer is actually applied to both directions. Thus, setting the number of hidden units to 512 means that both the forward and backward layers consist of 512 cells. The outputs of the forward and backward layer are then concatenated forming an output of 1024 units. This is then projected back to 512 using the weight matrix W.def MyBLSTMLayer(hidden_size=128, num_layers=2):     W = C.Parameter((C.InferredDimension, hidden_size),     init=C.he_normal(1.0), name='rnn_parameters')     def _func(operand):     return C.optimized_rnnstack(operand, weights=W, hidden_size=hidden_size, num_layers=num_layers, bidirectional=True, recurrent_op='lstm') return _funcThe code calls MyBLSTMLayer when the model_type is BLSTM. We’ve reduced the number of hidden layers to 2, since the BLSTM layers have more total parameters than the DNN layers.For utterance based processing, entire utterance needs to be processed during training. Thus the minibatch size specifies the total number of frames to process but will pack multiple utterances together if possible. Setting the minibatch size to a larger number will allow for efficient processing with multiple utterances in each minibatch size. We have set the minibatch size to 4096.The traing the BLSTM model, you can execute the following command.$ python M3_Train_AM.py –-type BLSTMBecause of the sequential nature of the BLSTM processing, they are inherently less parallelizable, and thus, train much slower than DNNs. On a GTX 965M GPU running on a laptop, the network trained as a rate of 440 seconds per epoch, or 20 times slower than the DNN. Thus, we will only train for 10 epochs to keep processing time reasonable.Here too, you can use M3_Plot_Training.py to inspect the learning schedule in training. And again, if you are interested, you can vary the hyperparameters to try to find a better solution.Next",
      "categories": [],
      "tags": [],
      "date": ""
    },
  
    {
      "title": "M4: Language Modeling",
      "url": "/speech-recognition-course/M4_Language_Modeling/",
      "excerpt": "Language Modeling Previous Table of Contents Introduction Vocabulary Markov factorization and N-grams N-gram probability estimation N-gram smoothing and discounting Back-off in N-gram models Likelihood, Entropy, and Perplexity N-gram Pruning Interpolating Probabilities Merging Models Class-based Language Models Neural Network Language Models Lab Introduction This module covers the basics of language modeling...",
      "content": "Language ModelingPreviousTable of Contents  Introduction  Vocabulary  Markov factorization and N-grams  N-gram probability estimation  N-gram smoothing and discounting  Back-off in N-gram models  Likelihood, Entropy, and Perplexity  N-gram Pruning  Interpolating Probabilities  Merging Models  Class-based Language Models  Neural Network Language Models  LabIntroductionThis module covers the basics of language modeling – the component of a speech recognition that estimates the prior probabilities P(W) of possible spoken utterances. Recall that these prior probabilities are combined with the acoustic model likelihoods $P(O \\vert W)$ in the Fundamental Equation of Speech Recognition to arrive at the overall best hypothesis\\[\\hat{W} = \\mathrm{arg\\,max}_{W} P( O | W) P(W)\\]Thus, the language model (or LM) embodies the recognizer’s knowledge of what probable word sequences are, even before it has heard any actual speech sounds. Instead of encoding hard rules of syntax and semantics that allow some utterances and disallow others, the LM should assign high probabilities to likely utterances and low probabilities to unlikely ones, without ruling anything out completely (because one never knows what people might actually say). Furthermore, the assignment of probabilities is not done by linguistic or other types of rules. Instead, just as with the acoustic model, we will estimate a parameterized model from data. Thus, we let the statistics of actually observed training data determine what words are likely to be heard in a language, scenario, or application.A note on terminology: in language modeling we often talk about sentences as the word sequence corresponding to an entire speech utterance, without suggesting that these represent anything like a correct and complete sentence in the conventional grammatical sentences. In fact, a sentence for LM purposes can be anything a speaker would utter in the context of a speech application.VocabularyWe need to assign a probability to every possible sentence\\[W = w_1 w_2 \\ldots w_n\\]where $n$ is the number of words, which is unbounded in principle. First, we simplify the problem by limiting the choice of words to a finite set, the vocabulary of the LM. Note the vocabulary of the LM is also the vocabulary of the speech recognizer – we cannot recognize a word that is not considered possible by the LM (i.e., it’s probability would be effectively zero).Words outside the vocabulary are called out-of-vocabulary words, or OOVs. If we ever encounter an OOV in the input data we will incur (at least) one word recognition error, so it is important to choose the vocabulary so as to minimize the chances of OOVs. An obvious strategy is to pick the words that have the highest prior probability of occurring, as estimated from data. In other words, we choose the most frequently occurring words in a corpus of training data. For example, we can pick the N most frequent words, or all words occurring more than K times in the data, for suitable values of N or K. There is often an optimal vocabulary size that represents a good tradeoff between recognizer speed (a larger vocabulary means more computation in decoding) and accuracy (by reducing OOVs, but adding very rare words will have negligible effect on accuracy, and might even hurt accuracy, due to search errors and greater acoustic confusability within the vocabulary).Markov factorization and N-gramsEven with a finite vocabulary, we still have an infinite set of word sequences, so clearly we cannot parameterize the LM by listing the probability of every possible sentence. Even if we conceptually could do so, it would be impossible to get reliable probability estimates as the majority of possible sentences are very rare (the smaller a probability the more data is needed to estimate it reliably).We can work around both these problems by using a trick we already discussed in the acoustic modeling module: use the chain rule to factor the sentence probability into a product of conditional word probabilities, and then apply the Markov assumption to limit the number of states, and thereby, parameters.\\[P(W) = P(w_1) \\times P(w_2 |w_1) \\times P(w_3 | w_1 w_2)\\times \\ldots \\times P(w_n | w_1 \\ldots w_{n-1})\\]Now let’s assume that the Markov state of the LM (often called the context or history) is limited to just one word. This gives us\\[P(W) = P(w_1) \\times P(w_2 |w_1) \\times P(w_3 | w_2) \\times\\ldots \\times P(w_n | w_{n-1})\\]Note how each word is now predicted by only the immediately preceding one, i.e., we’re using a first-order Markov model. However, in language modeling this terminology is not usually used, and instead we call such a model a bigram model, because it uses only statistics of two adjacent words at a time. Correspondingly, a second-order Markov model would be called a trigram model, and predict each word based on the preceding two, and so forth.The generalization of this scheme is the N-gram model, i.e., each word is conditioned on the previous N-1 words. The parameters of such a model as associated with N-grams, i.e., strings N words. It turns out that there is good improvement when going from bigrams to trigrams, but little improvement as N is increased further. Therefore, in practice we rarely use LMs beyond 4-grams and 5-grams. In the labs we will use trigrams, and in the remainder of this module we will stick to bigrams for the most part, just to simplify notation. Just keep in mind that the concepts generalize to longer N-grams.Sentence start and endTo let our N-gram model assign probabilities to all possible finite word sequences we are left with one small problem: how will the model predict where to end the sentence? We could devise a separate model component for the sentence length n, but it is far easier to introduce a special end-of-sentence tag &lt;/s&gt; into the vocabulary that marks the end of a sentence. In other words, the LM generates words left to right, and stops as soon as &lt;/s&gt; is drawn according to the conditional probability distribution. Importantly, this also ensures that the (infinite) sum of all sentences probabilities is equal to one, as it should be for a probability distribution.Similarly, we also introduce a start-of-sentence tag &lt;s&gt;. It is inserted before the first word $w_1$, and in fact represents the context for the first real word. This is important because we want the first word to be predicted with knowledge that it is occurring first thing in the sentence. Certain words such as “I” and “well” are especially frequent in first position, and using the start-of-sentence tag we can represent this using the bigram probabilities $P( w_1 \\mid \\lt s \\gt)$.The complete sentence probability, according to the bigram model, is now\\[P(W) = P(w_1 | \\lt s \\gt) \\times P(w_2 |w_1) \\times \\ldots\\times P(w_n | w_{n-1}) \\times P(\\lt/s\\gt | w_n)\\]We now turn to the problem of actually estimating these N-gram probabilities.N-gram probability estimationThe conditional probabilities based on N-grams can be naively estimated by their relative frequencies. Let $c(w_1 \\dots w_k)$ be the number of occurrences (or count) of the k-gram $w_1 \\ldots w_k$. For example, the conditional probability of “bites” following “dog” is the ratio\\[P(bites | dog) = \\frac{c(dog\\ bites)} {c(dog)}\\]Exercise for the reader: prove that the conditional probabilities of all bigrams containing the same context part\\[P(bites | dog ) + P(bite | dog) + P(wags | dog) + \\cdots\\]equals one (so it is a probability over the entire vocabulary).More generally, k-gram probability estimates are\\[P(w_k | w_1 \\ldots w_{k-1}) = \\frac{c(w_1 \\ldots w_k)} {c(w_1\\ldots w_{k-1})}\\]N-gram smoothing and discountingRelative frequencies as estimates for probabilities have one severe problem: they give probability zero to any N-gram that is not observed in the training data (the numerator in the equation becomes zero). Training data is finite, and we should not rule out a combination of words simply because our limited language sample did not contain it. (Another reason is that language is not a static system, and speakers come up with new expressions and even words all the time, either because they are being creative or because language production is error-prone.)So we need a principled way to assign nonzero probability estimates to N-grams that we have never seen, a process that is often called language model smoothing (we can think of the unobserved N-grams as “holes” in the model, that have to smoothed over). An entire sub-specialty of LM research has looked at this problem and many methods have been proposed, several of which are implemented by the SRILM tools. Here, we will discuss one method in detail, known as Witten-Bell smoothing, which was chosen for two reasons. First, it is relatively simple to explain and implement. Second, unlike some of the more sophisticated methods that make additional assumptions about the training data distribution, this method is very robust.The idea behind Witten-Bell smoothing is to treat the advent of a previously unseen word type as an event in itself, to be counted along with the seen words. How many times does an “unseen” word occur in the training data? Once for every unique word type, since the first time we encounter it, it counts as a novel word. For unigram (context-independent) probability estimates, this means that\\[\\hat{P}(w)=\\frac{c(w)} {c(.)+V}\\]where $c(w)$ is the unigram word count, $c(.)$ is the sum of all word counts (the length of the training text), and $V$ is the count of “first seen” events, i.e., the vocabulary size. The extra term in the denominator lowers all probability estimates compared to the old relative frequencies. For this reason, LM smoothing is often called discounting, i.e., the N-gram probabilities are lowered relative to their relative frequencies. In aggregate, this then leaves some probability mass for the occurrence of new unseen words. For Witten-Bell, the exact amount of the freed-up probability is\\[P(unseenword)= \\frac {V} {c(.)+V}\\]This is the unigram case. We generalize this to N-grams of length $k$ by treating the first $k-1$ words as the context for the last word, and counting the number of unique word types that occur in that context.\\[\\hat{P}(w_k|w_1 \\ldots w_k-1) = \\frac {c(w_1\\ldots w_k)} {c(w_1 \\ldots w_k-1)+V(w_1 \\ldots w_{k-1}\\cdot)}\\]where $V(w_1 \\ldots w_{k-1}\\cdot)$ means the size of the vocabulary observed in the context (i.e., right after) $w_1 \\ldots w_{k-1}$. Also, the freed-up probability mass now goes to words that are not previously seen in that context.Back-off in N-gram modelsHow should we distribute the discounted probability mass for a given context? One possibility is evenly over the entire vocabulary. Say we are looking at the context “white dog”, and in fact the only trigram with that context in the training data is “white dog barked”, twice. The trigram probability under Witten-Bell becomes\\[\\hat{P}(barked|white\\ dog) = \\frac{c(white\\ dog\\ barked)}{c(white\\ dog)+V(white\\ dog \\cdot)} = \\frac{2}{2+1}  = \\frac {2}{3}\\]So we now have a probability of $1/3$ to share with all the other words that might follow “white dog.” Distributing it evenly would ignore the fact that some words are just overall more frequent than others. Therefore, we could distribute $1/3$ in proportion to the unigram probabilities of words. However, this would make “white dog the” much more probable than “white dog barks”, since “the” much more common than “barks”. A better solution is to use reduced context, in this case, just “dog” to allocate the probability mass. This means we can draw all occurrences of “dog” to guess what could come next. This method is called back-off, since we are falling back to a shorter (1-word) version of the context when the following word has not been observed in the full (2-word) context. We can write this as\\[\\hat P_{\\text{bo}}\\left(w_k \\mid w_1 \\ldots w_{k-1}\\right) = \\begin{cases} \\hat P \\left(w_k \\mid w_1 \\ldots w_{k-1}\\right) &amp; \\text{if } c\\left(w_1 \\ldots w_k\\right) &gt; 0 \\\\\\hat P_{\\text{bo}}\\left(w_k \\mid w_2 \\ldots w_{k-1}\\right) \\alpha\\left(w_2 \\ldots w_{k-1}\\right) &amp; \\text{if } c\\left(w_1 \\ldots w_k\\right) = 0 \\end{cases}\\]Note that $\\hat P_{bo}$ is the new back-off estimate for all N-grams. If an N-gram has been observed (count &gt; 0, the first branch), it makes direct use of the discounted estimates $\\hat{P}$. If the N-gram is unseen in training, it looks up the estimate recursively for the shortened context (leaving out $w_1$) and then scales it by a factor $\\alpha$, which is a function of the context so that the estimates for all $w_k$ again sum to one. ($\\alpha$ is the probability of the unseen words in context $w_1 \\ldots w_{k-1}$, as discussed earlier, divided by the sum of the same unseen-word probabilities according to the back-off distribution $\\hat P_{bo} ( \\cdot \\mid w_2 \\ldots w_{k-1} )$ ).The $\\alpha$ parameters are called backoff weights, but they are not free parameters of the model. Rather, once the N-gram probabilities $\\hat{P}$ have been determined, the backoff weights are completely determined. Computing them is sometimes called (re-)normalizing the model, since they are chosen just so all the probability distributions sum to unity.Likelihood, Entropy, and PerplexityGiven two language models A and B, how can we tell which is better? Intuitively, if model A always gives a higher probabilities than B to the words that are found in a test (or evaluation) set, then A is better, since it “wastes” less probability on the words that did not occur in actuality. The total probability of a test set $w_{1}\\ldots w_{n}$according to the model is\\[P\\left( w_{1}\\ldots w_{n} \\right) = P\\left( w_{1}| &lt; s &gt; \\right) \\times P\\left( w_{2} \\right|\\ w_{1}) \\times P\\left( w_{3} \\right|\\ w_{2}) \\times \\ldots \\times P( &lt; /s &gt; \\ |\\ w_{n})\\](where we revert to the case of a bigram model just to keep the notation simple). We are now talking about a test set containing multiple sentences, so at sentence boundaries we reset the context to the &lt;s&gt; tag. The probabilities get very small very quickly, so it is more practical to carry out this computation with log probabilities:\\[\\log{\\ P(w_{1}\\ldots w_{n})} = \\log{P(w_{1}| &lt; s &gt; )} + \\log{P\\left( w_{2} \\right|\\ w_{1})} + \\ldots + \\log{P( &lt; /s &gt; \\ |\\ w_{n})}\\]Viewed as a function of the model, this is sometimes called the log likelihood of the model on the test data. Log likelihoods are always negative because the probabilities are less than one. If we flip the sign, and take the average over all the words in the test data\\[- \\frac{1}{n}\\log{P(w_{1}\\ldots w_{n})},\\]we get a metric called entropy, which is a measure of information rate of the word stream. The entropy gives us the average number of bits required to encode the word stream using a code based on the model probabilities (more probable words are encoded with fewer bits, to minimize the overall bit rate).Yet another metric for model quality (relative to some test data) is the average reciprocal of the word probability, or perplexity. So, if words on average are assigned probability 1/100 then the perplexity would be 100. In other words, the perplexity is the size of a vocabulary of equally probable words that produces the same uncertainty about what comes next, as the actual model in question (justifying the term “perplexity”).How to compute perplexity: Because probabilities are combined by multiplication, not addition, we must use the geometric average on the product of the word probabilities:\\[\\sqrt[n]{\\frac{1}{P(w_{1}\\ldots w_{n})}} = {P(w_{1}\\ldots w_{n})}^{- \\frac{1}{n}}\\]This means that perplexity is just the anti-logarithm (exponential) of the entropy, which means all the above metrics are equivalent and related as follows:  HIGH likelihood $\\leftrightarrow$ LOW entropy $\\leftrightarrow$ LOW perplexity $\\leftrightarrow$ GOOD model  LOW likelihood $\\leftrightarrow$ HIGH entropy $\\leftrightarrow$ HIGH perplexity $\\leftrightarrow$ BAD modelRemember to evaluate model quality (by likelihood, entropy, or perplexity) on a test set that is independent (not part) of the training data to get an unbiased estimate.N-gram PruningAn N-gram language model effectively records the N-grams in the training data, since each such N-gram yields a probability estimate that becomes a parameter in the model. This has the disadvantage that model size grows almost linearly with the amount of training data. It would be good to eliminate parameters that are redundant, i.e., where the back-off mechanism gives essentially the same result after removing a higher-order N-gram parameter. In a related task, we may want to shrink a model down to a certain size (for practical reasons) such that the least important parameters are removed to save space.We can use the notion of entropy (or perplexity) to perform these tasks in a principled way. For each N-gram probability in the model, we can compute the change in entropy (perplexity) this entails, and if the difference is below some threshold eliminates, or prune, the parameter. After pruning probabilities, the model needs to be renormalized (back-off weights recomputed).To make this algorithm practical, we don’t need or want to use a separate test set to estimate entropy. Instead, we can use the entropy of the distribution embodied by the model itself. This leads to a succinct and efficient pruning criterion that uses only the information contained in the model.Interpolating ProbabilitiesAssume you have two existing language models already trained, producing probability estimates $\\hat{P}_1$ and $\\hat{P}_2$, respectively. How can we combine these models for a better estimate of N-gram probabilities? If available, we could retrieve the training data for these models, pool it, and train a new model from the combined data. However, this is inconvenient and raises new problems. What can be done if one model has vastly more training data than the other? For example, the large model might be trained on newswire text, and the small model might be trained on small data collected for a new application. The large, mismatched corpus would completely swamp the N-gram statistics, and the resulting model would be mismatched to the intended application.A better approach is to combine the existing models at the probability level by interpolating their estimates. Interpolation means we compute a weighted average of the two underlying probability estimates:\\[\\hat{P}( w_k \\mid w_1 \\ldots w_{k-1}) = \\lambda \\hat P_1 ( w_k \\mid w_1 \\ldots w_{k-1}) + (1 - \\lambda) \\hat P_2 ( w_k \\mid w_1 \\ldots w_{k-1})\\]The parameter $\\lambda$ controls the relative influence of the component models. A value close 1 means the first model dominates; a value close to 0 gives most of the weight to the second model. The optimal value of $\\lambda$ can be itself estimated using held-out data (i.e., data that is separate from the training data for the component models), by choosing a value that minimizes the perplexity on the held-out data.Model interpolation is easily generalized to more than two models: a weighted combination of $M$ models using weights $\\lambda_{1},\\ \\lambda_{2},\\ldots,\\ \\lambda_{M}$, such that $\\lambda_{1} + \\ \\lambda_{2}, + \\ \\ldots + \\ \\lambda_{M} = 1$. The condition that weights sum to 1 is needed to make sure that the interpolated model is again a properly normalized probability distribution over all word strings.Merging ModelsEven though model interpolation is often highly effective in lowering overall perplexity without requiring retraining from the original training data, there is one practical problem: we need to keep multiple models around (on disk, in memory), evaluate each, and combine their probabilities on the fly when evaluating the interpolated model on test data.Fortunately, in the case of backoff N-gram based LMs we can construct a single combined N-gram model that is a very good approximation to the interpolated model defined by the above equation. Such a model can be obtained by the following steps:for k=1,...,N:   for all ngrams w1...wk        Insert w1...wk into the new model        Assign probability P^(wk|w1...wk-1 according to equation (\\*)   end   Compute backoff weights for all n-gram contexts of length k-1 in the new modelendNote that when we compute the $\\hat{P}$ interpolated estimates, one of $\\hat{P}_1$ and $\\hat{P}_2$ (but not both) could be obtained by the back-off mechanism.Class-based Language ModelsIn this section, we provide a high-level understanding of a couple of the more advanced techniques in language modeling that are now widely used in practice. The methods in language modeling are constantly evolving, and research in this area is very active. By some estimates, the perplexity of state-of-the-art LMs is still two to three times worse than the predictive powers of humans, so we have a long way to go!One of the drawbacks of N-gram models is that all words are treated as completely distinct. Consequently, the model needs to see a word sufficiently many times in the training data to learn N-grams it typically appears in. This is not how humans use language. We know that the words ‘Tuesday’ and ‘Wednesday’ share many properties, both syntactically and meaning-wise, and seeing N-grams with one word primes us to expect similar ones using the other (seeing ‘store open Tuesday’ makes us expect ‘store open Wednesday’ as a likely N-gram). Word similarity should be exploited to improve generalization in the LM.Class-based language models therefore group (some) words into word classes, and then collect N-gram statistics involving the class labels instead of the words. So if we had defined a class ‘WEEKDAY’ with members ‘Monday’, ‘Tuesday’, …, ‘Friday’, the N-gram ‘store open Tuesday’ would be treated as an instance of ‘store open WEEKDAY’. The probability of the pure word string according to the LM is now a product of two components: the probability of the string containing the class labels (computed in the usual way, the class labels being part of the N-gram vocabulary), multiplied by the class membership probabilities, such as $P(Tuesday \\vee WEEKDAY)$. The membership probabilities can be estimated from data (how many times ‘Tuesday’ occurs in the training corpus relative to all the weekdays), or set to a uniform distribution (e.g., all equal to 15).There are two basic approaches to come up with good word classes. One involves prior knowledge, typically from an application domain. For example, for building a language model for a travel app, we know that entities such as the names of destinations (‘Tahiti’, ‘Oslo’), airlines and days of the week will need to be covered regardless of training data coverage, even though the training data is unlikely to have usage samples of all the possible instantiations. We can ensure coverage by defining an N-gram model in terms of classes such as ‘AIRPORT’, ‘AIRLINE’, ‘WEEKDAY’, etc. This also means we can set class membership probabilities from domain statistics, such as the popularity of certain travel destinations. It also suggests generalizing the concept of word-class membership to word phrases, such as ‘Los Angeles’ for AIRPORT. The class N-gram modeling framework can accommodate word phrases, with some modifications.The other way to define word classes is in a purely data-driven way, without human or domain knowledge. We can search the space of all possible word/class mappings and pick one that minimizes the perplexity of the resulting class-based model on the training data (or equivalently, maximizes the likelihood). The details of this search are nontrivial if is to be carried out in reasonable time, but several practical algorithms have been proposed.Neural Network Language ModelsNeural network-based machine learning methods have taken over in many areas, including in acoustic modeling for speech recognition, as we saw earlier in this course. Similarly, artificial neural networks (ANNs) have also been devised for language modeling, and, given sufficient training data, have been shown to give superior performance compared to N-gram methods.Much of the success of ANNs in language modeling stems from overcoming two specific limitations of N-gram models. The first limitation is the lack of generalization across words. We saw how word classes tried to address this problem while introducing a new problem: how to define suitable word classes. The first proposed ANN LM architecture, now known as a feedforward language model, addressed the word generalization problem by including a word embedding layer that maps the discrete word labels upon input to a dense vector space.As depicted in the figure (taken from the Bengio et al. paper), the input to the network are unary (one-hot) encodings of the N - 1 words forming the N-gram context. The output is a vector of probabilities of predicted following words. (In both input and outputs, we use vectors of the length of the vocabulary size.) The model is thus a drop-in replacement for the old N-gram-based LM. The key is that the input words are reencoded via a shared matrix into new vectors, which are no longer one-hot, i.e., they live into a dense high-dimensional space. This mapping is shared for all context word positions and, crucially, is trained concurrently with the next-word predictor. The beauty of this approach is that the learned word embeddings can be tuned to represent word similarity for the purposes of word prediction. In other words, context words that affect the next word similarly will be encoded as nearby points in space, and the network can then exploit this similarity when encountering the words in novel combinations. This is because all network layers perform smooth mappings, i.e., nearby inputs will generate similar outputs. (It has been shown that words like ‘Tuesday’ and ‘Wednesday’ do indeed end up with similar embeddings.)The second limitation of N-grams that was overcome with ANN methods is the truncation of the context, which so far always was limited to the previous $N - 1$ words. This is a problem because the language allows embedded clauses, arbitrarily long lists of adjectives, and other constructs that can put arbitrary distance between related words that would be useful in next-word prediction. Any reasonable value of $N$ would be insufficient to capture all predictive words in a context. The limitation is overcome in recurrent networks, which feed the activations of a hidden layer at time $t - 1$ as extra inputs to the next processing step at time $t$, as shown in this figure:This allows the network to pass information from one-word position to the next, repeatedly, without a hard limit on how far back in time information originates that can be used to predict the current next word. There are practical issues with the trainability of such recurrent networks because the mathematical rules governing ANN activations lead to an exponential dilution of information over time. However, these problems can be solved with mechanisms to gate information flow from one time step to the next.Both feed-forward and recurrent network LMs have also benefited from general improvements in ANN technology, such as deeper stacking of network layers (‘deep learning’) and better training methods. Another trend in neural LMs is to base the model on characters rather than word units. It is clear that the flexibility that ANNs provide for experimenting with model architectures in terms of high-level information flow, rather than having to worry about the detailed design of encodings and probability distributions, has greatly advanced the field, with more still to come.LabLanguage ModelingThis lab covers the following topics:  Defining a top-N vocabulary from training data  Computing N-gram counts  Estimating backoff N-grams, understanding the backoff rule  Computing perplexity, OOV rates  Perplexity as a function of training data size  Interpolating two models  Pruning of an N-gram model, perplexity as a function of model sizeIntroduction and setupIn this lab, we will practice the main techniques for building N-gram based language models for our speech recognizer. In subsequent modules, the resulting LM will be used in the speech recognition decoder, together with the acoustic model from the preceding lab.This lab is carried out in a Linux command shell environment. The get started, make sure you know how to invoke the Linux bash shell, either on a native Linux system, using Cygwin on a Windows system, or in the Windows subsystem for Linux. Bash is the default shell on most systems.Inside bash, change into the M4_Language_Modeling directory:cd M4_Language_ModelingWe will be using pre-built executables from the SRI Language Modeling toolkit (SRILM). Start by adding the SRILM binary directories to your search path. If you are in a Cygwin,PATH=$PWD/srilm/bin/cygwin64:$PWD/srilm/bin:$PATHIn Linux or Windows Subsystem for Linux, usePATH=$PWD/srilm/bin/i686-m64:$PWD/srilm/bin:$PATHYou can put this command in the .bashrc file in your home directory, so it is run automatically next time you invoke the shell. Also, make sure you have the gawk utility installed on your system. As a check that all is set up, runngram-count -write-vocab –compute-oov-rate &lt; /dev/nullwhich should each output a few lines of text without error messages. It will be helpful to also install the optional wget command.Since language modeling involves a fair amount of text processing it will be useful to have some familiarity with Linux text utilities such as sort, head, wc, sed, gawk or perl, and others, as well as Linux mechanisms for redirecting command standard input/output, and pipelining several commands. We will show you commands that you can copy into the shell to follow along, using this symbolcommand argument1 argument2 ...but we encourage you try your own solutions to achieve the stated goals of each exercise, and to explore variants.Preparing the dataWe will be using the transcripts of the acoustic development and test data as our dev and test sets for language modeling.TASK: Locate files in the ‘data’ subdirectory and count the number of lines and words in them.SOLUTION:ls datawc -wl data/dev.txt data/test.txtTASK: View the contents of these files, using your favorite pager, editor, or other tool. What do you notice about the format of these files? How do they differ from text you are used to?SOLUTIONhead data/*.txtYou will notice that the data is in all-lowercase, without any punctuation. This is because we will model sequences of words only, devoid of textual layout, similar to how one your read or speak them. The spelling has to match the way words are represented in the acoustic model. The process of mapping text to the standard form adopted for modeling purposes is called text normalization (or TN for short), and typically involves stripping punctuation, mapping case, fixing typos, and standardizing spellings of words (like MR. versus MISTER). This step can consume considerable time and often relies on powerful text processing tools like sed or perl.Because it is so dependent on the source of the data, domain conventions, and tool knowledge, we will not elaborate on it here. Instead, we will download an LM training corpus that has already been normalized,wget http://www.openslr.org/resources/11/librispeech-lm-norm.txt.gzIf your system doesn’t have the wget command you can download this file in a browser and move it into the LM lab directory.TASK: Inspect the file and count lines and word tokens. How does the text normalization of this file differ from our test data?SOLUTION: The file is compressed in the gzip (.gz) so we must use the gunzip toolgunzip -c librispeech-lm-norm.txt.gz | headgunzip -c librispeech-lm-norm.txt.gz | wc -wlThe second command can take a while as the file is large. You will notice that this file is text normalized but uses all-uppercase instead of all-lowercase.Language model training data, and language models themselves, are often quite large but compress well since they contain text. Therefore, we like to keep them in compressed form. The SRILM tools know how to read/write .gz files, and it is easy to combine gzip/gunzip with Linux text processing tools.OPTIONAL TASK: Download the raw training data at http://www.openslr.org/12/librispeech-lm-corpus.tgz, and compare it to the normalized text. How would you perform TN for this data?SOLUTION left to the reader!Defining a vocabularyThe first step in building a LM is to define the set of words that it should model. We want to cover the largest possible share of the word tokens with the smallest set of words, to keep model size to a minimum. That suggests picking the words that are most frequent based on the training data.One of the functions of the ngram-count tool is to count word and ngram occurrences in a text file.ngram-count -text TEXT -order 1 -write COUNTS -tolowerWill count 1-grams (i.e., words) and write the counts to a file. The final option above maps all text to lowercase, thus dealing with the mismatch we have between our training and test data.TASK: Extract the list of the 10,000 most frequent word types in the training data. What kinds of words do you expect to be at the top of the list? Check your intuition.HINT: Check out the Linux sort, head, and cut commands.SOLUTION:ngram-count -text librispeech-lm-norm.txt.gz -order 1 -write librispeech.1grams -tolowersort -k 2,2 -n -r librispeech.1grams | head -10000 &gt; librispeech.top10k.1gramscut -f 1 librispeech.top10k.1grams | sort &gt; librispeech.top10k.vocabThe intermediate file librispeech.top10k.1grams contains the words and their counts sorted most frequent first. As you might expect, common function words liked “the”, “and”, “of” appear at the top of the list. Near the top we also find two special tags, &lt;s&gt; and &lt;/s&gt;. These are added by ngram-count to mark the start and end, respectively, of each sentence. Their count equals the number of non-empty lines in the training data, since it is assumed that each line contains one sentence (empty lines are ignored).We now want to find out how well out 10k vocabulary covers the test data. We could again use Linux tools for that, but SRILM contains a handy script compute-oov-rate that takes two arguments: the unigram count file and the list of vocabulary words.TASK: What is the rate of out-of-vocabulary (OOV) words on the training, dev and test sets?HINT: Use the same method as before to generate the unigrams for dev and test data.SOLUTION:compute-oov-rate librispeech.top10k.vocabngram-count -text data/dev.txt -order 1 -write dev.1gramscompute-oov-rate librispeech.top10k.vocab dev.1gramsngram-count -text data/test.txt -order 1 -write test.1gramscompute-oov-rate librispeech.top10k.vocab test.1gramsUsually we expect the OOV rate to be lowest on the training set because we used it to select the words (the vocabulary is biased toward the training set), but in this case the test sets have been chosen to be “cleaner” and have lower OOV rates. (The training data actually contains some languages other than English, though most of those will not make it into the vocabulary.)Note that compute-oov-rate also reports about “OOV types”. OOV types are the number of unique words that are missing from the vocabulary, regardless of how many times they occur.The OOV rate of around 5% is quite high – remember that we will never be able to recognize those OOV words since the LM does not include them (they effectively have probability zero). However, we chose the relatively small vocabulary size of 10k to speed up experiments with the decoder later.OPTIONAL TASK: Repeat the steps above for different vocabulary sizes (5k, 20k, 50k, 100k, 200k). Plot the OOV rate as a function of vocabulary size. What shape do you see?Training a modelWe are now ready to build a language model from the training data and the chosen vocabulary. This is also done using the ngram-count command. For instructional purpose we will do this in two steps: compute the N-gram statistics (counts), and then estimate the model parameters. (ngram-count can do both in one step, but that’s not helpful to understand what happens under the hood.)TASK: Generate a file containing counts of all trigrams from the training data. Inspect the resulting file.HINT: Consult the ngram-count man page and look up the options -order, -text, and -write. Remember the case mismatch issue.SOLUTION: The first command uses about 10GB of memory and takes 15 minutes on a 2.4GHz Intel Xeon E5 CPU, so be sure to procure a sufficiently equipped machine and some patience.ngram-count -text librispeech-lm-norm.txt.gz -tolower -order 3 -write librispeech.3grams.gzgunzip -c librispeech.3grams.gz | lessNote that we want to compress the output file since it is large. The -order option in this case is strictly speaking optional since order 3 is the default setting. Note that the output is grouped by common prefixes of N-grams, but that the words themselves are not alphabetically sorted. You can use the -sort option to achieve the latter.Now we can build the LM itself. (Modify the output file names from previous steps according to your own choices.)TASK: Estimate a backoff trigram LM from librispeech.3grams.gz, using the Witten-Bell smoothing method.HINT: Consult the ngram-count man page for options -read, -lm, -vocab, and -wbdiscount .SOLUTION:ngram-count -debug 1 -order 3 -vocab librispeech.top10k.vocab -read librispeech.3grams.gz -wbdiscount -lm librispeech.3bo.gzWe added the -debug 1 option to output a bit of information about the estimation and resulting LM, in particular the number of N-grams output.We will now try to understand the way LM parameters are stored in the model file. Peruse the file usinggunzip -c librispeech.3bo.gz | lessor, if you prefer, gunzip the entire file usinggunzip librispeech.3bo.gzand open librispeech.3bo in an editor. Note: the editor better be able to handle very large files – the LM file has a size of 1.3 GB.Model evaluationConsult the description of the backoff LM file format ngram-format(5), and compare to what you see in our model file, to be used in the next task.TASK: Given the sentence “a model was born”, what is the conditional probability of “born”?SOLUTION: The model is a trigram, so the longest N-gram that would yield a probability to predict “born” would be “model was born”. So let’s check the model for that trigram. (One way to locate information in the model file is the zgrep command, which searches a compressed file for text strings. Each search string below starts with a TAB character to avoid spurious matches against other words that contain the string as a suffix. You can use your favorite tools to perform these searches.)zgrep \" model was born\" librispeech.3bo.gzThis outputs nothing, meaning that trigram is not found in the model, and we have to use the back-off mechanism. We look for the line that contains the context bigram “model was” following a whitespace character:zgrep -E “\\smodel was” librispeech.3bo.gz | head -1-2.001953 model was 0.02913048The first number is the log probability $P(was \\vert model)$, which is of no use to use here. The number at the end is the backoff weight associated with the context “model was”. It, too, is encoded as a base-10 logarithm. Next, we need to find the bigram probability we’re backing off to, i.e., $P(born \\vert was)$:zgrep -E “\\swas born” librispeech.3bo.gz | head -1-2.597636 was born -0.4911189The first number is the bigram probability $P(born \\vert was)$. We can now compute the log probability for $P(born \\vert model was)$ as the sum of the backoff weight and the bigram probability:0.02913048 + -2.597636 = -2.568506, or as a linear probability 10-2.568506 = 0.002700813.TASK: Compute the total sentence probability of “a model was born” using the ngram -ppl function. Verify that the conditional probability for “born” is as computed above.SOLUTION: We feed the input sentence to the ngram command in a line of standard input, i.e., using “-“ as the filename argument to -ppl. Use the option -debug 2 to get a detailed breakdown of the sentence-level probability:echo “a model was born” | ngram -debug 2 -lm librispeech.3bo.gz -ppl –a model was bornp( a | &lt;s&gt; ) =2gram0.01653415-1.781618p( model | a ...) =3gram0.0001548981-3.809954p( was | model ...) =3gram0.002774693-2.556785p( born | was ...) =2gram0.002700813-2.568506p( &lt;/s&gt; | born ...) =3gram0.1352684-0.86880381 sentences, 4 words, 0 OOVs0 zeroprobs, logprob= -11.58567 ppl= 207.555 ppl1= 787.8011Notice how ngram adds the sentence start and end tags, &lt;s&gt; and &lt;/s&gt;. The final line gives both the log probability and the perplexity of the entire sentence. The line starting $p(born \\vert was~ \\dots)$ has the conditional word probability that we computed previously. The label ``2gram” indicates that a backoff to bigram was used. The final “logprob” value -11.58567 is just the sum of the log probabilities printed for each word token. Let’s verify the perplexity value based on it’s definition: we divide the logprob by the number of word tokens (including the end-of-sentence), convert to a probability and take the reciprocal (by negating the exponent): 10-\\ (-11.58567\\ /\\ 5) = 207.555. Of course this is not a good estimate of perplexity as it is based on only 5 data points.TASK: Compute the perplexity of the model over the entire dev set.SOLUTION: The exact same invocation of ngram can be used, except we use the file containing the dev set as ppl input. We also omit the -debug option to avoid voluminous output. Note: these commands take a few seconds to run, only because loading the large LM file into memory takes some time - the model evaluation itself is virtually instantaneous.ngram -lm librispeech.3bo.gz -ppl data/dev.txtfile dev.txt: 466 sentences, 10841 words, 625 OOVs0 zeroprobs, logprob= -21939 ppl= 113.1955 ppl1= 140.4475We thus have a perplexity of about 113. The first line of summary statistics also gives the number of out-of-vocabulary words (which don’t count toward the perplexity, since they get probability zero). In this case the OOV rate is 625/10841 = 5.8%.Running the same command on the test set (data/test.txt) yields a perplexity of 101 and an OOV rate of 4.9%. Both statistics indicate that the test portion of the data is a slightly better match to our model than the development set.TASK: Vary the size of the training data and observe the effect this has on model size and quality (perplexity).HINT: The original librispeech-lm-norm.txt.gz has about 40 million lines. Use gunzip and the head command to prepare training data that is 1/2, 1/4, …, of the full size. (This is very easy, but can you think of better ways to pare down the data?)SOLUTION: Rebuild the model (using the original vocabulary), and evaluate perplexity for different amounts of data. Plot model size (number of ngrams in the head of the model file) and perplexity as a function of training data size. Details left to the student, using the steps discussed earlier.Model adaptationWe will now work through the steps involved in adapting an existing LM to a new application domain. In this scenario we typically have a small amount of training data for the new, target domain, but a large amount, albeit mismatched data from other sources. For this exercise we target the AMI domain of multi-person meetings as our target domain. The language in this are spontaneous utterances from face-to-face interactions, whereas the “librispeech” data we used so far consisted of read books, a dramatic mismatch in speaking styles and topics.We will use the “librispeech” corpus as our out-of-domain data, and adapt the model we just created from that corpus to the AMI domain, using a small amount of target-domain data corpus. Corpus subsets for training and test are in the data directory:wc -wl data/ami-*.txt6473 data/ami-dev.txt2096 20613 data/ami-test.txt86685 924896 data/ami-train.txtAlso provided is a target domain vocabulary consisting of all words occurring at least 3 times in the training data, consisting of 6171 words:wc -l data/ami-train.min3.vocab6271 data/ami-train.min3.vocabTASK: Build the same kind of Witten-Bell-smoothed trigram model as before, using the provide AMI training data and vocabulary. Evaluate its perplexity on the AMI dev data.SOLUTION:ngram-count -text data/ami-train.txt -tolower -order 3 -write ami.3grams.gzngram-count -debug 1 -order 3 -vocab data/ami-train.min3.vocab -read ami.3grams.gz -wbdiscount -lm ami.3bo.gzngram -lm ami.3bo.gz -ppl data/ami-dev.txtfile data/ami-dev.txt: 2314 sentences, 26473 words, 1264 OOVs0 zeroprobs, logprob= -55254.39 ppl= 101.7587 ppl1= 155.5435TASK: Evaluate the previously built librispeech model on the AMI dev set.SOLUTION: Again, this takes a few seconds due to the loading time of the large model.ngram -lm librispeech.3bo.gz -ppl data/ami-dev.txtfile data/ami-dev.txt: 2314 sentences, 26473 words, 3790 OOVs0 zeroprobs, logprob= -56364.05 ppl= 179.8177 ppl1= 305.3926Note how both the perplexity and the OOV count are substantially higher for this large model than for the much small, but well-matched AMI language model. If we modified the vocabulary of the old model to match the new domain its perplexity would increase further. (Can you explain why?)We will now adapt the old model by interpolating it with the small AMI LM. As explained in the course materials, model interpolation means that all N-gram probabilities are replaced by weighted averages of the two input models. So we need to specify the relative weights of the two existing models, which must sum to 1. A good rule of thumb is to give the majority of weight (0.8 or 0.9) to the in-domain model, leaving a small residual weight (0.2 or 0.1) to the out-of-domain model.TASK: Construct an interpolated model based on the existing librispeech and AMI models, giving weight 0.8 to the AMI model, and evaluate it on the dev set.HINT: Make use of the ngram options -mix-lm, -lambda, and -write-lm.ngram -debug 1 -order 3 -lm ami.3bo.gz -lambda 0.8 -mix-lm librispeech.3bo.gz -write-lm ami+librispeech.bo.gzngram -lm ami+librispeech.3bo.gz -ppl data/ami-dev.txtfile ami-dev.txt: 2314 sentences, 26473 words, 783 OOVs0 zeroprobs, logprob= -56313.77 ppl= 102.546 ppl1= 155.6145At first sight, this result is disappointing. Note how the perplexity is now 102, slightly up from the value that the AMI-only model produced. But also note how the number of OOVs was almost halved (from 1264 to 783), due to the addition of words covered by the out-of-domain model that were not in the AMI model. The model now has more words to choose from when making its predictions. An important lesson from this exercise is that we can only compare perplexity values when the underlying vocabularies are the same. Otherwise, the enlarged vocabulary is a good thing, as it reduces OOVs. The interpolation step has effectively adapted not only the model probabilities, but the vocabulary as well.Still, it would be nice to do an apples-to-apples comparison to see the effect of just the probability interpolation on model perplexity. We can do this by telling the ngram tool to only use words from the AMI vocabulary in the interpolated model:ngram -debug 1 -order 3 -lm ami.3bo.gz -lambda 0.8 -mix-lm librispeech.3bo.gz -write-lm ami+librispeech.bo.gz -vocab data/ami-train.min3.vocab -limit-vocabThis is the same command as before, but with the -limit-vocab option added, telling ngram to only use the vocabulary specified by the -vocab option argument. We can now evaluate perplexity again:file ami-dev.txt: 2314 sentences, 26473 words, 1264 OOVs0 zeroprobs, logprob= -53856.04 ppl= 90.52426 ppl1= 136.8931The number of OOVs is now back to the same as with ami.3bo.gz, but perplexity is reduced from 102 to 90.TASK (optional): Repeat this process for different interpolation weights, and see if you can reduce perplexity further. Check results on both AMI dev and test sets.This step is best carried out using the enlarged vocabulary, since that is what we want to use in our final model. But notice how we are now effectively using the dev set to train another model parameter, the interpolation weight. The result will thus be tuned to the dev set. This is why we better have another test set held out (data/ami-test.txt in this case) to verify that the result of this tuning also improves the model (lowers the perplexity) on independent data.The tuning of interpolation weights would be rather tedious if carried out by trial and error. Fortunately, there is an efficient algorithm that finds optimal weights based on expectation maximation, implemented in the command compute-best-mix, described under ppl-scripts.TASK (optional): Use compute-best-mix to find the best -lambda value for interpolation for the two models we built.HINT: As input to the command, generate detailed perplexity output for both models, using ngram -debug 2 -ppl data/ami-dev.txt.Model pruningWe saw earlier that model size (and perplexity) varies with the amount of training data. However, if a model gets too big for deployment as the data size increases it would be a shame to have to not use it just for that reason. A better approach is to train a model on all available data, and then eliminate parameters that are redundant or have little effect on model performance. This is what model pruning does.A widely used algorithm for model pruning based on entropy is implemented in the ngram tool. The option -prune takes a small value, such as 10-8 or 10-9, and remove all ngrams from the model that (by themselves) raise the perplexity of the model less than that value in relative terms.TASK: Shrink the large librispeech model trained earlier, using pruning values between 10-5 and 10-10 (stepping by powers of ten). Observe/plot the resulting model sizes and perplexities, and compare to the original model.SOLUTION: Starting with 1e-5 (= 10-5 in floating point notation), create the pruned model:ngram -debug 1 -lm librispeech.3bo.gz -prune 1e-5 -write-lm librispeech-pruned.3bo.gzThen evaluate the librispeech-pruned.3bo.gz model on the entire dev set, as before. The -debug option lets the tool output the number of ngrams pruned and written out. Add the resulting number of bigrams and trigrams to characterize the pruned model size (roughly, the number of model parameters, since the number of unigrams is fixed to the vocabulary, and the backoff weights are determined by the probability parameters).Next",
      "categories": [],
      "tags": [],
      "date": ""
    },
  
    {
      "title": "M6: End-to-End Models",
      "url": "/speech-recognition-course/M6_End_to_End_Models/",
      "excerpt": "Module 6: End-to-End Models Previous Table of Contents End-to-End Models Improved Objective Functions Sequential Objective Function Connectionist Temporal Classification Sequence Discriminative Objective Functions Grapheme or Word Labels Encoder-Decoder Networks Improved Objective Functions Recall from Module 3 that he most common objective function used for training neural networks for classification tasks...",
      "content": "Module 6: End-to-End ModelsPreviousTable of Contents  End-to-End Models  Improved Objective Functions  Sequential Objective Function  Connectionist Temporal Classification  Sequence Discriminative Objective Functions  Grapheme or Word Labels  Encoder-Decoder NetworksImproved Objective FunctionsRecall from Module 3 that he most common objective function used for training neural networks for classification tasks is frame-based cross entropy. With this objective function, a single one-hot label $z\\left\\lbrack t \\right\\rbrack$ is specified for every input frame of data t, and compared with the softmax output of the acoustic model.If we define\\[z\\lbrack i,t\\rbrack = \\left\\lbrace \\begin{matrix} 1 &amp; z\\lbrack t\\rbrack = i, \\\\ 0 &amp; \\text{otherwise} \\end{matrix} \\right.\\]then the cross-entropy against the softmax network output $y[i,t]$ is as follows.\\[L = - \\sum_{t = 1}^{T} \\sum_{i = 1}^{M} z\\left\\lbrack i,t \\right\\rbrack\\log\\left( y\\left\\lbrack i,t \\right\\rbrack \\right)\\]Using a frame-based cross entropy objective function implies threethings that are untrue for the acoustic modeling task.That every frame of acoustic data has exactly one correct label.The correct label must be predicted independently of the other frames.All frames of data are equally important.This module explores some alternative strategies that address these modeling deficiencies.Sequential Objective FunctionAcoustic modeling is essentially a sequential task. Given a sequence of acoustic feature vectors, the task is to output a sequence of words. If a model can do that well, the exact alignment from the feature vectors to acoustic labels is irrelevant. Sequential objective functions train models that produce the correct sequence of labels, without regard their relative alignment with the acoustic signal. Note that this is a separate feature from the sequential discriminative objective functions, such as maximum mutual information (MMI), discussed in Module 3.Sequential objective functions allow the training labels to drift in time. As the model converges, it finds a segmentation that explains the labels and obeys the constraint that the ground-truth sequence label sequence is unchanged.Whereas the frame-based cross entropy objective function requires a sequence of labels $z\\lbrack t\\rbrack$ that is the same length as the acoustic feature vector sequence, sequential objective functions specify a sequence of symbols $S=\\lbrace s_{0},s_{1},\\ldots,\\ s_{K - 1} \\rbrace$ for each utterance. An alignment from the T acoustic features to the K symbols is denoted by $\\pi\\left\\lbrack t \\right\\rbrack$. The label for time $t$ is found in the entry of $S$ indexed by $\\pi[t]$.The objective function can be improved by moving from a frame-based to a segment-based formulation. Whereas frame-based cross entropy assigns labels to frames, sequence-based cross entropy specifies the sequence of labels to assign, but is ambivalent about which frames are assigned to the labels. Essentially, it allows the labels to drift in time. As the model trains, it finds a segmentation that is easy to model, explains the data, and obeys the constraint that the ground-truth sequence of labels should be unchanged. Instead of using the alignment $z\\left\\lbrack i,t \\right\\rbrack$, we produce a pseudo-alignment $\\gamma\\left\\lbrack i,t \\right\\rbrack$ that has the labels in the same order as the reference, but is also a function of the current network output. It can be either a soft alignment or a hard alignment. It is easily computed by turning the alignment sequence into an HMM and using standard Viterbi (hard alignment) or forward-backward (soft alignment) algorithms.\\[L = \\ {P\\left( S \\middle| \\pi \\right)P\\left( \\pi \\right)} = {P\\left( \\pi \\right)\\prod_{t}^{}{y\\left\\lbrack \\pi\\left( t \\right),t \\right\\rbrack}} - \\sum_{i}^{}{\\gamma\\left\\lbrack i,t \\right\\rbrack\\log\\left( y\\left\\lbrack i,t \\right\\rbrack \\right)}\\]Let $\\overset{\\overline{}}{z}\\left\\lbrack k \\right\\rbrack$ represent the $K$ symbols in the label sequence, after duplicates have been removed. Define a HMM that represents moving through each of these labels in order. It always begins in state zero, always ends in state $K - 1$, and will emit symbol $\\overset{\\overline{}}{z}\\lbrack k\\rbrack$ in state $k$. The soft alignment is the product of a forward variable $\\alpha$ and a backward variable $\\beta$. The nonzero values are given by:\\[\\gamma\\left\\lbrack \\overset{\\overline{}}{z}\\left\\lbrack k \\right\\rbrack,t \\right\\rbrack = \\alpha\\left\\lbrack k,t \\right\\rbrack\\beta\\lbrack k,t\\rbrack\\]The forward recursion computes the score of state k given the acoustic evidence up to, and including, time $t$. Its initial state is the model’s prediction for the score of the first label in the sequence.\\[\\alpha\\lbrack k,0\\rbrack = \\left\\lbrace    \\begin{matrix}     y\\left\\lbrack \\overset{\\overline{}}{z}\\left\\lbrack k \\right\\rbrack,\\ 0 \\right\\rbrack &amp; k = 0, \\\\     0 &amp; \\text{otherwise}    \\end{matrix} \\right.\\]The recursion moves forward in time by first projecting this score through a transition matrix $T$ with elements $t_{\\text{ij}}$, and then applying the model’s score for the labels.\\[\\alpha\\left\\lbrack k,t \\right\\rbrack = y\\left\\lbrack \\overset{\\overline{}}{z}\\left\\lbrack k \\right\\rbrack,t \\right\\rbrack\\sum_{j}^{}{t_{kj}\\alpha\\left\\lbrack j,\\ t - 1 \\right\\rbrack}\\]The transition matrix $T$ simply restricts the model topology to be left-to-right.\\[t_{\\text{ij}} = \\left\\lbrace \\begin{matrix} 1 &amp; i = j, \\\\ 1 &amp; i = j + 1, \\\\ 0 &amp; \\text{otherwise} \\end{matrix} \\right.\\]An example of this forward variable computed on an utterance about 2.6 seconds long, and containing 66 labels, is shown below. Yellow indicates larger values of the forward variable, and purple represents smaller values. Structures depart the main branch, searching possible paths forward in time. Because the alpha computation for a particular time has no information about the future, it is exploring all viable paths with the current information.The backward recursion computes the score of state k given acoustic evidence from the end of the segment back to, but not including, the current time t. Its initial state at time T - 1 doesn’t include any acoustic evidence and is simply the final state of the model.\\[\\beta\\left\\lbrack k,T - 1 \\right\\rbrack = 1\\]The recursion applies the appropriate acoustic score from the model, and then projects the state backward in time using the transpose of the transition matrix $T$.\\[\\beta_{k}\\left\\lbrack t \\right\\rbrack = \\sum_{j}^{}{t_{jk}\\beta\\left\\lbrack j,t + 1 \\right\\rbrack}y\\lbrack\\overset{\\overline{}}{z}\\lbrack j\\rbrack,t + 1\\rbrack\\backslash n\\]When the forward and backward variables are combined into the gamma variable, each time slice contains information from the entire utterance, and the branching structures disappear. What is left is a smooth alignment between the label index and time index.When the forward and backward variables are combined into the $\\gamma$ variable, each time slice contains information from the entire utterance, and the branching structures disappear. What is left is a smooth alignment between the label index and time index.Connectionist Temporal ClassificationConnectionist Temporal Classification (CTC) is a special case of sequential objective functions that alleviates some of the modeling burden that exists cross-entropy. One perceived weakness of the family of cross-entropy objective functions is that it forces the model to explain every frame of input data with a label. CTC modifies the label set to include a “don’t care” or “blank” symbol in the alphabet. The correct path through the labels is scored only by the non-blank symbols. If a frame of data doesn’t provide any information about the overall labeling of the utterance, a cross-entropy based objective function still forces it to make a choice. The CTC system can output “blank” to indicate that there isn’t enough information to discriminate among the meaningful labels.\\[L = \\ \\sum_{\\pi}^{}{P\\left( S \\middle| \\pi \\right)P\\left( \\pi \\right) = \\sum_{\\pi}^{}{P\\left( \\pi \\right)\\prod_{t}^{}{y\\left\\lbrack \\pi\\left( t \\right),t \\right\\rbrack}}}\\]Sequence Discriminative Objective FunctionsModule 3 introduced an entirely different set of objective functions, which are also sometimes referred to as sequential objective functions. But, there are sequential in a different respect than the one considered so far in this module.In this module, “sequential objective function” means that the objective function only observes the sequence of labels along a path, ignoring the alignment of the labels to the acoustic data. In module 3, “sequence based objective function” meant that the posterior probability of a path isn’t normalized against all sequences of labels, but only those sequences that are likely given the current model parameters and the decoding constraints.For instance, recall the maximum mutual information objective function:\\[F_{\\text{MMI}} = \\sum_{u}^{}{\\log\\frac{p\\left( X_{u} \\middle| S_{u} \\right)p\\left( W_{u} \\right)}{\\sum_{W'}^{}{p\\left( X_{u} \\middle| S_{W'} \\right)p(W^{'})}}}\\]Maximizing the numerator will increase the likelihood of the correct word sequence, and so will minimizing the denominator. If the denominator were not restricted to valid word sequences, then the objective function would simplify to basic frame-based cross entropy.To minimize confusion, we prefer to refer to objective functions that produce hard or soft alignments during training as sequence training, and those that restrict the set of competitors as discriminative training. If both features are present, this would be sequence discriminative training.Grapheme or Word LabelsDeriving a good senone label set on a new task is labor intensive and requires skill and linguistic information. One needs to know the phonetic inventory of the language, a model of coarticulation effects, a pronunciation lexicon, and have access to labeled data to drive the process. Consequently, although senone labels match the acoustic representation of speech, it is not always desirable to use them as acoustic model targets.Graphemes are a simpler alternative that can be used in place of senones. Whereas senones are related to the acoustic realization of the language sounds, graphemes are related to the written form. The table below illustrates possible graphemic and phonemic representations for six common words.            Word      Phonemic Representation      Graphemic Representation                  any      EH N IY      A N Y              anything      EH N IY TH IH NG      A N Y T H I N G              king      K IH NG      K I N G              some      S AH M      S O M E              something      S AH M TH IH NG      S O M E T H I N G              thinking      TH IH NG K IH NG      T H I N K I N G      The grapheme set chosen for this example are the 26 letters of the English alphabet. The advantage of this representation is that it doesn’t require any knowledge about how English letters are expressed as English sounds. The disadvantage is that these rules must now be learned by the acoustic model, from data. As a result, graphemic systems tend to produce worse recognition accuracy than their senone equivalents, when trained on the same amount of labeled data.It is possible to improve graphemic system performance somewhat by choosing a more parsimonious set of symbols. We can take advantages of light linguistic knowledge to take advantage of this effect. In English,      Letter pairs such as “T H” and “N G” are often associated with a single sound. We can replace them with “TH” and “NG” symbols.        The letter “Q” is often followed by “U.” We can introduce the “QU” symbol.        The apostrophe doesn’t have a pronunciation, but we can have symbols for contraction and plural ends, such as ‘T and ‘S.        Some letter sequences are very rare and occur in only a handful of words, such as the double I at the end of Hawaii. Modeling these sequences by a single symbol alleviates modeling burden caused by sparse data.  An extreme example of this is to eliminate graphemes altogether, and emit whole-word symbols directly. These types of model typically use recurrent acoustic models with a CTC objective function. Although the systems are simple, they are difficult to train properly, and can suffer from a severe out of vocabulary problem. A naively trained system will have a closed set of words that it can recognize, and must be retrained to increase the vocabulary size. Addressing this limitation is an area of active research.A grapheme decoder, analogous to the decoder used in phoneme based systems, can often improve recognition results. Its decoding network maps from sequences of letters to sequences of words, and a search is performed to determine the path that corresponds to the best combined language model and grapheme model scores.Encoder-Decoder NetworksWhereas most speech recognition systems employ a separate decoding process to assign labels to the given speech frames, an encoder-decoder network uses a neural network to recursively generate its output.Encoder-decoder networks are common in machine translation systems, where the meaning of text in the source language must be transformed to an equivalent meaning in a target language. For this task, it is not necessary to maintain word order, and there generally isn’t a one-to-one correspondence between the words in the source and target language.When this concept is applied to speech recognition, the source language is the acoustic realization of the speech, and the target language is its textual representation.Unlike the translation application, the speech recognition task is both a monotonic and one to one mapping from each spoken word to its written form. As a result, the encoder-decoder networks are often modified when used in a speech recognition system.In its basic form, the encoder part of the network summarizes an entire segment as one vector, passing a single vector to the decoder part of the network, which should stimulate it to recursively produce the correct output. Because this sort of long-term memory and summarization is at the limit of what we can achieve with recurrent networks today, the structure is often supplemented with a feature known as an attention mechanism. The attention mechanism is an auxiliary input to each recurrent step of the decoder, where the decoder can essentially query, based on its internal state, some states of the encoder network.The decoder network is trained to recursively emit symbols and update its state, much like a RNN language model. The most likely output given the states of the encoder network is typically found using a beam search algorithm against the tree of possible decoder network states.",
      "categories": [],
      "tags": [],
      "date": ""
    },
  
    {
      "title": "Home",
      "url": "/speech-recognition-course/",
      "excerpt": "Speech Recognition Course Material for learning speech recognition, based on Microsoft teaching material on EdX (changed from CNTK to PyTorch). Learning/teaching materials are given in each module/directory. The comprehensive learning materials covers signal processing, acoustic modeling, language modeling, and modern end-to-end approaches Github Pages: https://bagustris.github.io/speech-recognition-course Repository: https://github.com/bagustris/speech-recognition-course Modules Module 1:...",
      "content": "Speech Recognition CourseMaterial for learning speech recognition, based on Microsoft teaching material on EdX (changed from CNTK to PyTorch). Learning/teaching materials are given in each module/directory. The comprehensive learning materials covers signal processing, acoustic modeling, language modeling, and modern end-to-end approachesGithub Pages:  https://bagustris.github.io/speech-recognition-courseRepository: https://github.com/bagustris/speech-recognition-courseModules  Module 1: Introduction to Speech Recognition  Module 2: Speech Signal Processing  Module 3: Acoustic Modeling  Module 4: Language Modeling  Module 5: Decoding  Module 6: End-to-End ModelsConvert from markdown to pdf with pandoc in each module:pandoc readme.md -o readme.pdfThen, you can inspect the generated PDFs.References:  https://learning.edx.org/course/course-v1:Microsoft+DEV287x+1T2019a/home  L. Gillick and S. J. Cox, “Some statistical issues in the comparison of speech recognition algorithms,” in ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings, 1989, vol. 1, pp. 532–535, doi: 10.1109/icassp.1989.266481.  M. Mohri, F. Pereira, and M. Riley, “SPEECH RECOGNITION WITH WEIGHTED FINITE-STATE TRANSDUCERS,” in Springer Handbook on Speech Processing and Speech Communication.  D. S. Pallet, W. M. Fisher, and J. G. Fiscus, “Tools for the analysis of benchmark speech recognition tests,” ICASSP, IEEE Int. Conf. Acoust. Speech Signal Process. - Proc., vol. 1, pp. 97–100, 1990, doi: 10.1109/icassp.1990.115546.  T. Morioka, T. Iwata, T. Hori, and T. Kobayashi, “Multiscale recurrent neural network based language model,” in Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, 2015, vol. 2015-Janua, pp. 2366–2370.  M. Sundermeyer, R. Schlüter, and H. Ney, “LSTM neural networks for language modeling,” in 13th Annual Conference of the International Speech Communication Association 2012, INTERSPEECH 2012, 2012, vol. 1, pp. 194–197, Accessed: Aug. 08, 2020. [Online]. Available: http://www.isca-speech.org/archive.  Y. Bengio, R. Ducharme, and P. Vincent, “A neural probabilistic language model,” J. Mach. Learn. Res., vol. 3, pp. 1137–1155, 2003.  P. F. Brown, P. V DeSouza, R. L. Mercer, V. J. Della Pietra, and J. C. Lai, “Class-Based n-gram Models of Natural Language,” Comput. Linguist., vol. 18, no. 4, pp. 467–480, 1992.  M. Levit, S. Parthasarathy, S. Chang, A. Stolcke, and B. Dumoulin, “Word-phrase-entity language models: Getting more mileage out of N-grams,” in Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, 2014, pp. 666–670.  X. Shen, Y. Oualil, C. Greenberg, M. Singh, and D. Klakow, “Estimation of gap between current language models and human performance,” in Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, 2017, vol. 2017-Augus, pp. 553–557, doi: 10.21437/Interspeech.2017-729.  A. Stolcke, “Entropy-based Pruning of Backoff Language Models,” pp. 579–588, 2000, doi: 10.3115/1075218.107521.",
      "categories": [],
      "tags": [],
      "date": ""
    },
  
    {
      "title": "Search",
      "url": "/speech-recognition-course/search/",
      "excerpt": "Search the Course Use the search box above to find content throughout this speech recognition course. The search will look through all modules, lessons, and documentation for matching content. Search Features Intelligent Search: Powered by Lunr.js for fast, client-side full-text search Keyboard Navigation: Use ↑/↓ arrow keys to navigate results,...",
      "content": "Search the CourseUse the search box above to find content throughout this speech recognition course. The search will look through all modules, lessons, and documentation for matching content.Search Features  Intelligent Search: Powered by Lunr.js for fast, client-side full-text search  Keyboard Navigation: Use ↑/↓ arrow keys to navigate results, Enter to open  Quick Access: Press / from anywhere on the site to focus the search box  Highlighted Results: Search terms are highlighted in results for easy scanning  Content Types: Results show whether content is from a page or post  Categories &amp; Tags: Additional metadata helps contextualize resultsWhat You Can Search For  Technical terms (e.g., “acoustic model”, “language model”, “speech recognition”)  Module topics (e.g., “signal processing”, “decoding”, “neural networks”)  Code functions (e.g., “HTK”, “feature extraction”, “training”)  Concepts (e.g., “MFCC”, “HMM”, “CTC”, “beam search”)  File formats (e.g., “WAV”, “ARPA”, “MLF”)  Tools and frameworks (e.g., “CNTK”, “Kaldi”, “OpenFST”)  💡 Pro Tip: Start typing in the search box above to find relevant course content instantly. Use specific terms for better results.Browse by ModuleAlternatively, you can browse the course content by module:  Module 1: Introduction - Overview and scoring metrics  Module 2: Speech Signal Processing - Feature extraction and signal processing  Module 3: Acoustic Modeling - HMM and neural network acoustic models  Module 4: Language Modeling - N-gram and neural language models  Module 5: Decoding - Search algorithms and beam search  Module 6: End-to-End Models - Modern neural approaches",
      "categories": [],
      "tags": [],
      "date": ""
    }
  
]