Stefan Thaleiser, Gerald Enzner, Rainer Martin, Aleksej Chinaev
Binaural processing is becoming an important feature of high-end commercial headsets and hearing aids. Speech enhancement with binaural output requires adequate treatment of spatial cues in addition to desirable noise reduction and simultaneous speech preservation. Binaural speech enhancement was traditionally approached with model-based statistical signal processing, where the principle of common-gain filtering with identical treatment of left- and right-ear signals has been designed to achieve enhancement constrained by strict binaural cue preservation. However, model-based approaches may also be instructive for the design of modern deep learning architectures. In this article, the common-gain paradigm is therefore embedded into an artificial neural network approach. In order to maintain the desired common-gain property end-to-end, we derive the requirements for compressed feature formation and data normalization. Binaural experiments with moderate-sized artificial neural networks demonstrate the superiority of the proposed common-gain autoencoder network over model-based processing and related unconstrained network architectures \textcolor{blue}{for anechoic and reverberant noisy speech} in terms of segmental SNR, binaural perception-based metrics MBSTOI, better-ear HASQI, and a listening experiment
In the following, we provide audio samples processed with our proposed Common-Gain Autoencoder Network (CoGa-AN), in a binaural mapping Autoencoder Network and in a bilateral Autoencoder Network. For reference, we also provide the audio signals processed by the binaural ConvTasnet [Han, Luo, Mesgarani, IEEE International Conference on Acoustics, Speech, and Signal Processing 2020], and the model-based cue-preserving MMSE filter with SPP-based noise estimation (CP-MMSE-SPP) [Azarpour, Enzner, EURASIP Journal Advanced Signal Processing 2017] and a Minimum Variance Distortionless Response beamformer with partial noise constraint (MVDR-N) [Marquardt, Doclo, IEEE/ACM Transactions on Audio, Speech, and Language Processing 2018]. The noisy speech is simulated by binaural clean speech utterances with diverse direction of arrival (DOA) around the listener (0° in front of the listener, +/-90° left/ right of the listener) and superimposed with pub noise and road noise, respectively. In order to support the listening experience beyond the methods derived in the submitted manuscript, the audio samples of the processing easily allow for the application of an additional spectral floor. It is commonly used in speech enhancement and set here to limit the noise reduction to -20 dB for the hearing experiment.
Processing Method | pub noise, 20 dB SNR, 65° DOA | road noise, 20 dB SNR, -35° DOA |
---|---|---|
noisy | ||
MVDR-N | ||
CP-MMSE-SPP | ||
Binaural ConvTasnet | ||
CoGa-AN | ||
Processing Method | road noise, 10 dB SNR, 55° DOA | pub noise, 10 dB SNR, 35° DOA |
---|---|---|
noisy | ||
MVDR-N | ||
CP-MMSE-SPP | ||
Binaural ConvTasnet | ||
CoGa-AN | ||
Processing Method | pub noise, 0 dB SNR, 45° DOA | road noise, 0 dB SNR, -65° DOA |
---|---|---|
noisy | ||
MVDR-N | ||
CP-MMSE-SPP | ||
Binaural ConvTasnet | ||
CoGa-AN | ||