Project 3

Isolated Spoken Word Recognition

Evan Shenkman

The goal of this project was to develop a speech recognition system using the dynamic time warping (DTW) method. This speaker dependent system will be developed and testing using recordings of my own voice. The system will be expected to recognize spoken digits, zero through nine.

Audio Files can be found here as .wav files

All MATLAB Figures can be found here as .png files

All MATLAB Code can be found here as .m files

System Development

There are several key elements that define the system specification. A brief list of the specifications are found below...

Correct Endpoints are assumed for each word
The testing set will contain 5 recordings of each digit, spoken slightly differently each time
Each recording will be analyzed, frame by frame, extracting 17 critical band energies per frame
The Euclidean Distance Metric will be used during Classification
During classification, a single recording of each word will be selected from the training set to act as the testing set. The other four recordings of each word will make up the training set.
The system will be tested under clean and noise conditions, including car noise, and group chatter noise
The results will be cross-validated 5 times

Analysis of Examples

The plots from top to bottom: Time Domain Waveform, Wideband Spectrogram, Zero Crossing Rate, Log-Magnitude Spectra & Critical Band Energies

System Testing & Results

During the testing, classification was done under 13 different conditions.

One Clean Condition
Six Conditions with varying amount of Car Noise (SNRs of [30, 20, 10, 5, 0, -5] dB)
Six Conditions with varying amount of Babble Noise (SNRs of [30, 20, 10, 5, 0, -5] dB)

Example of Global Distance Matrix

The colder the color, the smaller the distance between the (i,j) example pair

Output Confusion Matrices for All Conditions

Cleary, from the Confusion Matrices shown above, we can see the effect noise has on the classification algorithm. We can note a few relationships. Starting with the most intuitive, as the Signal to Noise Ratio increases, that is, there's more noise relative to the signal, classification accuracy decreases. We can also see how the different noises affected different words. Words containing mostly low frequency energy were effected more by the car noise, however not many words contained a highly concentrated low-energy Power Spectral Density. Words with comparatively more high frequency energy were effected more by the babble noise. Due to this difference in effect, the babble noise induces a higher percent of error. It's important to note that system performace remained stable throughout and performed reasonably well in even the noisiest of conditions with a Word Error Rate averaging between 0% (most cases) to 30% (-5dB Babble Noise).