Common-Gain Autoencoder Network for Binaural Speech Enhancement

(Manuscript submitted to IEEE Open Journal of Signal Processing)

Stefan Thaleiser, Gerald Enzner, Rainer Martin, Aleksej Chinaev

Binaural processing of audio signals has become an important feature of high-end commercial headsets and medical hearing aids. Speech enhancement with binaural output requires adequate treatment of spatial cues in addition to desirable noise reduction and speech preservation. It has traditionally been approached with model-based statistical signal processing, where the principle of common-gain filtering with identical treatment of left- and right-ear signals has been designed to achieve binaural enhancement constrained by strict binaural cue preservation. The model-based theory may now also prove instructive for the design of modern deep learning architectures. In this article, the common-gain paradigm is therefore embedded in an end-to-end autoencoder network architecture as well as the associated capabilities and requirements for compressed feature formation, data normalization, and in-network constraints to maintain the desired common-gain property end-to-end are derived. Binaural experiments with moderate-sized neural networks demonstrate the superiority of the proposed common-gain autoencoder network over model-based processing and related unconstrained network architectures in terms of segmental SNR and the binaural perception-based metrics MBSTOI and better-ear HASQI.


Audio samples:

In the following, we provide audio samples processed with linear and power-law compressed features in our proposed Common-Gain Autoencoder Network (CoGa-AN), in a binaural mapping Autoencoder Network and in a bilateral Autoencoder Network. For reference, we also provide the audio signals processed by the model-based cue-preserving MMSE filter with SPP-based noise estimation (CP-MMSE-SPP) [Azarpour, Enzner, EURASIP Journal Advanced Signal Processing 2017]. The noisy speech is simulated by binaural clean speech utterances with diverse direction of arrival (DOA) around the listener (0° in front of the listener, +/-90° left/ right of the listener) and superimposed with pub noise, road noise and inside train noise, respectively. In order to support the listening experience beyond the methods derived in the submitted manuscript, the audio samples of the masking-based processing options (i.e., all except the binaural mapping) easily allow for the application of an additional spectral floor. It is commonly used in speech enhancement and set here to limit the noise reduction to -10 dB for the hearing experiment.


Using power-law compression:

Processing Method pub noise, 0 dB SNR, -35° DOA pub noise, 10 dB SNR, -20° DOA pub noise, 20 dB SNR, 28° DOA
noisy
CP-MMSE-SPP
Bilateral Network
Binaural Mapping
CoGa-AN

Processing Method road noise, 0 dB SNR, -7° DOA road noise, 10 dB SNR, 18° DOA road noise, 20 dB SNR, 60° DOA
noisy
CP-MMSE-SPP
Bilateral Network
Binaural Mapping
CoGa-AN

Processing Method inside train noise, 0 dB SNR, -48° DOA inside train noise, 10 dB SNR, 17° DOA inside train noise, 20 dB SNR, 36° DOA
noisy
CP-MMSE-SPP
Bilateral Network
Binaural Mapping
CoGa-AN


Using linear compression:

Processing Method pub noise, 0 dB SNR, -53° DOA pub noise, 10 dB SNR, 2° DOA pub noise, 20 dB SNR, 37° DOA
noisy
CP-MMSE-SPP
Bilateral Network
Binaural Mapping
CoGa-AN

Processing Method road noise, 0 dB SNR, -56° DOA road noise, 10 dB SNR, 11° DOA road noise, 20 dB SNR, 59° DOA
noisy
CP-MMSE-SPP
Bilateral Network
Binaural Mapping
CoGa-AN

Processing Method inside train noise, 0 dB SNR, -24° DOA inside train noise, 10 dB SNR, -4° DOA inside train noise, 20 dB SNR, 21° DOA
noisy
CP-MMSE-SPP
Bilateral Network
Binaural Mapping
CoGa-AN