Project 3

Isolated Spoken Word Recognition

Evan Shenkman

The goal of this project was to develop a speech recognition system using the dynamic time warping (DTW) method. This speaker dependent system will be developed and testing using recordings of my own voice. The system will be expected to recognize spoken digits, zero through nine.

Audio Files can be found here as .wav files

All MATLAB Figures can be found here as .png files

All MATLAB Code can be found here as .m files

System Development

There are several key elements that define the system specification. A brief list of the specifications are found below...



Analysis of Examples

The plots from top to bottom: Time Domain Waveform, Wideband Spectrogram, Zero Crossing Rate, Log-Magnitude Spectra & Critical Band Energies



System Testing & Results

During the testing, classification was done under 13 different conditions.

One Clean Condition
Six Conditions with varying amount of Car Noise (SNRs of [30, 20, 10, 5, 0, -5] dB)
Six Conditions with varying amount of Babble Noise (SNRs of [30, 20, 10, 5, 0, -5] dB)



Example of Global Distance Matrix

The colder the color, the smaller the distance between the (i,j) example pair



Output Confusion Matrices for All Conditions

Cleary, from the Confusion Matrices shown above, we can see the effect noise has on the classification algorithm. We can note a few relationships. Starting with the most intuitive, as the Signal to Noise Ratio increases, that is, there's more noise relative to the signal, classification accuracy decreases. We can also see how the different noises affected different words. Words containing mostly low frequency energy were effected more by the car noise, however not many words contained a highly concentrated low-energy Power Spectral Density. Words with comparatively more high frequency energy were effected more by the babble noise. Due to this difference in effect, the babble noise induces a higher percent of error. It's important to note that system performace remained stable throughout and performed reasonably well in even the noisiest of conditions with a Word Error Rate averaging between 0% (most cases) to 30% (-5dB Babble Noise).