This blog walks through writing a basic end-to-end Automatic Speech Recognition (ASR) system using Tensorflow. I will go over each component of a minimal neural network and a prefix beam search decoder required to generate a readable transcript from audio.
I've come across a lot of resources on building basic machine learning systems around computer vision and natural language processing tasks, but very few when it comes to speech recognition. This is an attempt to fill that gap and make this field less daunting for beginners.
I will be focusing on the Neural Network, CTC loss, and Decoding part.
You need to convert your audio into a feature matrix to feed it into your neural network. One simple way is to create spectrograms.
This function computes the Short-time Fourier Transform of your audio signal and then computes the power spectrum. The output is a matrix called spectrogram. You can directly use this as your input. Other alternatives to this are filter banks and MFCCs. Audio preprocessing is a whole topic in itself. You can read about it in detail here.
Here is a simple architecture.
The spectrogram input can be thought of as a vector at each timestamp. A 1D convolutional layer extracts features out of each of these vectors to give you a sequence of feature vectors for the LSTM layer to process. The output of the (Bi)LSTM layer is passed to a Fully Connected layer for each time step which gives a probability distribution of the character at that time step using softmax activation. This network will be trained with CTC (Connectionist Temporal Classification) loss function. Feel free to experiment with more complex models after understanding the entire pipeline.
This network attempts to predict the character at each timestep. Our labels, however, are not the characters at each timestep but just the transcription of the audio. Keep in mind that each character in the transcription may stretch across multiple timesteps. The word C-A-T will come across as C-C-C-A-A-T-T if you somehow label each timestep in the audio.
Annotating your audio dataset at every 10ms is not feasible. CTC solves this problem as it does not require us to label every timestep. It takes as input the entire output probability matrix of the above neural network and the corresponding text, ignoring the position and actual offsets of each character in the transcript.
Suppose the ground truth label is CAT. Within these four timesteps, sequences like C-C-A-T, C-A-A-T, C-A-T-T, _-C-A-T, C-A-T-_ all correspond to our ground truth. We will calculate the probability of our ground truth by summing up the probabilities for all these sequences. The probability of a single sequence is calculated by multiplying the probabilities of its characters as per the output probability matrix.
For the above sequences, the total probability comes out to be 0.0288 + 0.0144 + 0.0036 + 0.0576 + 0.0012 = 0.1056. The loss is the negative logarithm of this probability. The loss function is already implemented in TensorFlow. You can read the docs here.
The output you get from the above neural network is the CTC matrix. CTC matrix gives the probability of each character in its set at each timestep. We use Prefix Beam Search to make meaningful text out of this matrix.
The set of characters in the CTC matrix has two special tokens apart from the alphabet and space character. These are blank token and end of string token.
Purpose of blank token: The timestep in the CTC matrix is usually small. (~10 ms) So each character of the spoken sentence stretches across multiple timesteps. For example, C-A-T becomes C-C-C-A-A-T-T. Therefore we collapse all repetition across the possible candidate strings that stand out in the CTC matrix. What about words like FUNNY where N is supposed to repeat? A blank token between the two Ns prevents them from collapsing into one without adding anything in the text. So, F-F-U-N-[blank]-N-N--Y collapses into FUNNY.
Purpose of end-token: End-of-string denotes the end of the spoken sentence. Decoding at timesteps after end-of-string token does not add anything to the candidate string.
The best candidate after all the timesteps is the output.
We make two modifications to make this process faster.
Go through the code below for implementation details.
This completes a bare-bones speech recognition system. You can introduce a bunch of complications to get better outputs. Bigger networks and audio preprocessing tricks help a lot. Here is the complete code.
1. The code above uses TensorFlow 2.0 and the sample audio file has been taken from the LibriSpeech dataset.
2. You will need to write your own batch generators to train over an audio dataset. These implementation details are not included in the code.
3. You will need to write your own language model function for the decoding part. One of the simplest implementations would be to create a dictionary of bigrams and their probabilities based on some text corpus.
 A.Y. Hannun et al., Prefix Search Decoding (2014), arXiv preprint arXiv:1408.2873, 2014
 A. Graves et al., CTC Loss (2006), ICML 2006
 L. Borgholt, Prefix Beam Search (2018), Medium