The goal of this project was to develop a speech recognition system using the dynamic time warping (DTW) method. This speaker dependent system will be developed and testing using recordings of my own voice. The system will be expected to recognize spoken digits, zero through nine.
Audio Files can be found here as .wav files
All MATLAB Figures can be found here as .png files
All MATLAB Code can be found here as .m files
There are several key elements that define the system specification. A brief list of the specifications are found below...
The plots from top to bottom: Time Domain Waveform, Wideband Spectrogram, Zero Crossing Rate, Log-Magnitude Spectra & Critical Band Energies
During the testing, classification was done under 13 different conditions.
One Clean Condition
Six Conditions with varying amount of Car Noise (SNRs of [30, 20, 10, 5, 0,
-5]
dB)
Six Conditions with varying amount of Babble Noise (SNRs of [30, 20, 10, 5, 0,
-5]
dB)
The colder the color, the smaller the distance between the (i,j) example pair
Cleary, from the Confusion Matrices shown above, we can see the effect noise has on the classification algorithm. We can note a few relationships. Starting with the most intuitive, as the Signal to Noise Ratio increases, that is, there's more noise relative to the signal, classification accuracy decreases. We can also see how the different noises affected different words. Words containing mostly low frequency energy were effected more by the car noise, however not many words contained a highly concentrated low-energy Power Spectral Density. Words with comparatively more high frequency energy were effected more by the babble noise. Due to this difference in effect, the babble noise induces a higher percent of error. It's important to note that system performace remained stable throughout and performed reasonably well in even the noisiest of conditions with a Word Error Rate averaging between 0% (most cases) to 30% (-5dB Babble Noise).