Internship project: Investigating deep learning for speech enhancement

Student project: Investigating deep learning for speech enhancement

Duration: 6-9 month internship

Candidate: Student with a signal processing (adaptive filter theory and spectral post-processing) and machine learning background (not necessarily required) interested in applying deep learning algorithms to speech enhancement applications. Candidates should be studying in or near Eindhoven.

Description:

The latest wave of breakthrough progress in the training of large (deep) neural networks in the last decade has seen their use explode in many fields including self-driving cars, object recognition in images, and speech recognition.

Recently, deep learning has been applied to various speech enhancement algorithms such as noise suppression and nonlinear acoustic echo cancellation/suppression with promising results. Acoustic echo is caused by the (acoustic) coupling between a device’s loudspeaker and causes the “far-end” signal played back by the loudspeaker to leak into the device’s microphone(s). Without an acoustic echo  canceller/suppressor, the far-end talker would hear his/her voice back. Combined with potentially long transmission delays, this echo can severely impede a conversation.

Especially for hands-free applications, the device’s loudspeaker is often driven into its nonlinear range of operation due to the higher sound pressure levels required for larger distance between the device and talker. For mobile phone devices that use small low-cost components packed into a small form factor, nonlinear acoustic echoes are dominant even at lower levels.

Typically, in acoustic echo cancellation/suppression systems, a linear system is used to model the acoustic echo path between the device’s loudspeakers and one or more microphones. This model consists of an acoustic echo canceller (AEC) employing a linear adaptive filter to model the acoustic echo path, and a post-processing stage that uses spectral subtraction to suppress any remaining residual echo power in the AEC residual signal. However, due to the coarse modelling performed in the post-processing stage, the desired near-end signal is often distorted or completely attenuated during double-talk situations when both talkers are active. Furthermore, the goal of preserving as much of this desired signal while still delivering full echo suppression remains a challenge.

To tackle the problem of nonlinear acoustic echoes, many of the signal processing algorithms developed in the past decade usually employ higher order polynomial models (e.g. Volterra systems) which inflict a high computational burden while often still resulting in under-modeling of the actual system.

Recent devices like Amazon’s Echo and Alexa voice service are pushing the envelope in terms of the performance and experience users can and will expect of these devices. This is also pushing voice communication requirements to new levels. The recent advances in deep learning may help model and/or further provide additional insights into the complex nonlinear behavior of these communication systems.

Student’s activities:

  • Learn the fundamentals of neural networks incl. cost functions, the back propagation algorithm for training, activation functions, regularization, and state of the art techniques for improving their adaptation.

  • Setup up a Python environment (Theano/Tensorflow) for quickly prototyping and working with multilayer neural networks.

  • Learn the basics of nonlinear acoustic echo cancellation and suppression with emphasis on the latest work in the field.

  • Carry out recordings using a mobile device to accumulate a large database of training/validation/test data.

  • Use the background knowledge developed around these nonlinear acoustic echo management systems to design and employ deep learning architectures and benchmark them against the latest signal processing state of the art systems.

  • Gain insight from the trained network’s resulting parameters.

  • Potentially present the results via an internal or external publication/presentation.

References:

[1] Schwarz, A., Hofmann, C. and Kellermann, W., 2013, October. Spectral feature-based nonlinear residual echo suppression. In Applications of Signal Processing to Audio and Acoustics (WASPAA), 2013 IEEE Workshop on (pp. 1-4). IEEE.

[2] Lu, X., Tsao, Y., Matsuda, S. and Hori, C., 2013, August. Speech enhancement based on deep denoising autoencoder. In Interspeech (pp. 436-440).

[3] Xu, Y., Du, J., Dai, L.R. and Lee, C.H., 2014. An Experimental Study on Speech Enhancement Based on Deep Neural Networks. IEEE Signal Process. Lett.21(1), pp.65-68.

[4] Huang, P.S., Kim, M., Hasegawa-Johnson, M. and Smaragdis, P., 2014, May. Deep learning for monaural speech separation. In Acoustics, Speech and Signal Processing (ICASSP), 2014 IEEE International Conference on (pp. 1562-1566). IEEE.

[5] Xu, Y., Du, J., Dai, L.R. and Lee, C.H., 2015. A regression approach to speech enhancement based on deep neural networks. IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP)23(1), pp.7-19.

[6] Lee, C.M., Shin, J.W. and Kim, N.S., 2015. DNN-based residual echo suppression. In INTERSPEECH (pp. 1775-1779).

 

For more information please contact patrick.kechichian@philips.com