Nigel Stuke
EEN540 - University of Miami - Spring 2008
Under the Instruction of Dr. Scordilis
Project Number 3
Isolated Word Recognition
Abstract:
The purpose of this project is to develop a speech recognition system using dynamic time warping. It will be used to recognize spoken digits in the English language from zero to nine. Two different feature sets are used, and the outcomes are compared and contrasted.
Overview:
This speech recognition system is made to be speaker dependent. This means each speaker needs a set of training data or reference data and then a testing sample. This type of system was a predecessor to Hidden Markov Models which generally yield better results, but this method requires less computational overhead. Because it is less computationally complex, this method is still popular for certain uses such as on cellphones.
The method used is called Dynamic Time Warping. It is called Dynamic Time Warping because the training sequences and the current test sequence may be of a different length. When a person speaks, it is not going to be exactly the same length utterance everytime, so this shows that Dynamic Time Warping is a valid solution. Dynamic Time Warping non-linear warps the test sequence to all the training sequences and decides which one is the best fit. In order to do this a few things must be done. The signals need to be broken into frames. After it is broken into a set of frames, a set of features that is defined will be used to determine the "distance" between the test signal and the reference signal. The best match is the choice that has the least distance from the beginning point to the end point. Variable begin / end points can be implemented and many different limiting rules can be placed in effect to limit the valid paths of the signal.
For the case of this project, MATLAB was used as a simulation environment to implement the algorithm and generate a solution. Test data was provided from 10 sources, although 3 of the 10 test sources provided were not of utterances of digits so they were not used. Also, the project guidlines specify that the spoken utterances should be encoded as linear .wav files using 16-bit depth and a sampling rate of 8 kHz. One of the test sources was provided at 16-bits and 48 kHz. This test source was used anyways and any relevant side effects of this will be noted in the conclusion. In addition to the 7 sources provided that I used, I additionally recorded samples of my own voice. The samples are provided here:
These .wav files need be to loaded into MATLAB and for the purpose of this project four training sets were used with one test data set for each speaker. The data extracted from the .wav file sets is used to create feature sets which are used to calculate distance. Because loading these files is a tedious process, and creating the feature set values takes time, I created a utility to allow you to load any number of training sets in and save it to a single file in your MATLAB work directory. First, you select all the training data you need to, as seen below:
![]()
After this, you simply select the filename that you want to save it as in the work directory. This can be seen below:
After it is done loading / saving the training data, it automatically goes to the next step, which allows you to select your test data file.
The scripts for this and the feature set data files that were generated are both available below:
Feature Set 1 Feature Set 2 Loading Scripts Training Data For the training data, samples 1-4 were always used and sample 5 was always used for the testing data.
The Training Data could then later be loaded back into MATLAB either by running the main function for the algorithm with the name of the training data file passed as parameters to the function. For example:
ASR1_main('nigel2.mat')
Would be an appropriate call to load the data back in and continue from that point.
The user is then prompted to load the test data. Then the digit that was decided to be matching is given. This data was then put manually into a confusion matrix in order to analyze the results.The rest of the files used can be found here:
Synthesis:
After the .wav files are loaded, a feature set needs to be generated out of them. In order to do this, they were broken into approximately 20ms chunks, and windowed using Hamming windows with 50% overlap. FFTs were taken out of these chunks using the spectrogram function in MATLAB. After the DTFT was taken, energy was calculated by using the formula |x|², where x equals each bin of the FFT of each frame. For the first algorithm, the FFT generates 33 bins, which results in 17 positive linear bins. For the second algorithm, the FFT generates as many bins as there are samples in the signal. This data will later be merged in order to form non-linear bins that match up to the bark scale. After this is done, the energy needs to be normalized since every time a phrase is spoken it is going to contain a different amount of energy. This is done by calculating the total energy in the entire phrase, and dividing all bins by that amount so that the total of all bins in all frames adds up to one. These values are used as the feature set.
At this point, a matrix is made comparing the feature set values of each frame of the test data to each reference digit. A matrix needs to be created for each reference digit in each trainingt set A single value is created for each spot in each matrix, that represents the sum of the magnitude of the difference of the feature set values between reference digit frame i and test data frame j if the matrix has dimension i x j. This is done to compare the test data to each digit in each training set. We refer to these matrices as local distance matrices.
The next step is to create a accumulated distance matrix out of each local distance matrix. In order to do this, we enforce a rule that a word always had to start at the beginning point (bottom left of the matrix), and always has to end at the end point (top-right of the matrix). We also enforce that a word can never be wrapped forwards, and then move back, as well as the fact that all frames need to wrap to something. This allows us to use the rule that from any point (i,j), the next point must be ( i+1 , j+1 ), ( i , j+1 ) , or ( i+1 , j ). Since we know the bottom row can only come from the left, we start by copying the bottom-left corner value from the local distance matrix. Then to calculate the value of the spot to the right, we take the current value and add the value of the spot to the right location in the local distance matrix. That gives up this accumulated distance to that spot. We can apply the same theory for the entire left-most column. Then we linearly fill in the rest of the matrix by determining the minimum distance to that point.
We follow this method to create an accumulated distance matrix for each possibile digit in each possibile data set. The end-point distance (top-right hand corner value) is taken out of each matrix and put into another matrix. This is illustrated in the image below:
![]()
As you can see, the lightest color represents the lowest value. In this example, the computer would pick digit zero, specifically from data set four matched the closest. As you can see all the digit zeros (represented by 1 on the x-axis) have the lightest colors, which means they matched a lot better than any other values. In the case of a tie value in two different digit numbers, a secondary step is taken. The distance of the digits in question for all tet sets are summed, and the one with the lowest sum is picked.
Results:
The results can best be interpreted on a confusion matrix. One is shown for each feature set method used. They can be seen here:
Linear Energy Bands Confusion Matrix | |||||||||||||
Recognized Digit |
|||||||||||||
Actual Digit | . | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | % Word Error Rate | |
0 | 7 | 1 | 12.50% | ||||||||||
1 | 8 | 0.00% | |||||||||||
2 | 6 | 2 | 25.00% | ||||||||||
3 | 1 | 7 | 12.50% | ||||||||||
4 | 1 | 7 | 12.50% | ||||||||||
5 | 7 | 1 | 12.50% | ||||||||||
6 | 8 | 0.00% | |||||||||||
7 | 8 | 0.00% | |||||||||||
8 | 8 | 0.00% | |||||||||||
9 | 2 | 6 | 25.00% | ||||||||||
Avg. WER | 10.00% | ||||||||||||
Bark Spectrum Scaled Bands Confusion Matrix Actual Digit Recognized Digit . 0 1 2 3 4 5 6 7 8 9 % Word Error Rate 0 2 4 1 1 50.00% 1 8 0.00% 2 8 0.00% 3 2 4 2 50.00% 4 8 0.00% 5 8 0.00% 6 8 0.00% 7 8 0.00% 8 1 7 12.50% 9 8 0.00% Avg. WER 11.25%
Also included is the Word Error Rating for each method and the average Word Error Rating
Conclusion:
As one can see, the word error rating was 10% for the linear case and 11.25% for the bark spectrum case. One would think that the bark spectrum case would be more accurate but if you look at the confusion matrix, most of the errors for the bark scale case happened at digit 0 or digit 3. For the linear case, the errors are much more randomly spread out. In either case, it failed between one out of eight to one out of ten times. Depedning on the usage, this could be acceptable and perhaps even not an annoyance.
When using the 48 kHz sample rate .wav files, the bark spectrum case outpreformed the linear case drastically. This could be due to having no weighting on any of the frequency bands, and the bark case throws away all frequency content over 3700 Hz. The random noise that was unrelated to the human speaking could be the reason for this.
It would also be fairly easy to include certain features to fix the bark spectrum case. For example, you could check for certain phonemes and that could be weighted higher to "fix" the problems with zero and three.