Noise robustness of automatic speech recognition benefits from using missing data imputation: Prior to recognition the parts of the spectrogram dominated by noise are replaced by clean speech estimates. Especially at low SNRs each frame contains at best only a few uncorrupted coefficients so frame-by-frame restoration of corrupted feature vectors, and thus recognition accuracy, is sub-optimal. In this paper we present a novel imputation technique working on entire word. A word is sparsely represented in an overcomplete basis of exemplar (clean) speech signals using only the uncorrupted time-frequency elements of the word. The corrupted elements are replaced by estimates obtained by projecting the sparse representation in the basis. We achieve recognition accuracies of 90% at SNR -5 dB using oracle masks on AURORA-2 as compared to 60% using a conventional frame-based approach. The performance obtained with estimated masks can be directly related to the proportion of correctly identified uncorrupted coefficients.
While using the l1 reconstruction technique, Jort also uses random projections to decrease the dimensionality of the very large data he has to handle. In all, a clever use of sparse decomposition and random projection techniques in a notoriously difficult problem. The conclusion speaks for itself.
We showed the potential of the method by achieving recognition accuracies on AURORA-2 of 91% at SNR -5 dB using an oracle mask, an increase of 30% percent absolute over a state-of-the art missing data speech recognizer.
The moral of the story: contributing to the Monday Morning Algorithm series will get you to substantially improve your field of endeavor :-)
Photo credit: NASA/JPL/Space Science Institute, Prometheus moon. Photo taken yesterday.