[PDF] Speech Enhancement: Theory and Practice, Second Edition - A Comprehensive Guide for Engineers

destcocegalring
Aug 20, 2023
5 min read

In this chapter, we study the best speech enhancement estimator in the time domain. The first part focuses on the single-channel scenario, where important insights are given thanks to different kinds of correlation coefficients; in the linear case, we obtain the well-known Wiener filter whose functioning is explained within this general framework. The second part deals with the best binaural speech enhancement estimator; the approach taken here is by the reformulation of the binaural problem into a monaural one thanks to complex random variables.

[PDF] Speech Enhancement: Theory and Practice, Second Edition

Download File

A second approach for the combination of models in speech enhancement is to employ multi-stage processing, where either multiple identical models (cf. [33]) or most often different models are used in succession to improve the enhancement performance. Applied to classical speech enhancement, this principle is generally used to achieve a higher noise attenuation, e.g., with the multi-stage Wiener filter approach [34], which in turn leads to degradations of the speech quality. Different from that, some studies have focused on first performing speech separation and subsequently enhancing the separated signals using nonnegative matrix factorization [35] or Gaussian mixture models [36]. In combination with deep learning models, the multi-stage paradigm has been applied to music source separation using feedforward DNNs for the separation task as well as the subsequent task of enhancing the separated signals [37]. A further possibility is proposed in [38], where denoising and dereverberation are addressed in subsequent stages using separately trained feedforward DNNs and joint fine-tuning of the two-stage model is carried out in a second step.

The underlying idea of our newly proposed system is to employ separate processing stages for speech denoising and restoration, both using deep NN topologies with advantageous properties for the respective tasks. In the noise suppression stage, an LSTM-based network trained with the cMSA loss from (9) is employed to attain a strong noise attenuation, even at the cost of potentially introducing speech distortions. The subsequent restoration stage restores speech and further attenuates residual noise. For this second task a CED network is used, which has been found to be very well-suited for the restoration of slightly corrupted structured signals, e.g., in image restoration [28] or enhancement of coded speech [18]. The CED network training employs the cSA loss function defined in (11) and therefore a direct spectral mapping is performed in the second stage. The cSA loss function is chosen over a mask-based loss for two reasons: On the one hand, the restoration of missing T-F regions in the estimated signal can be quite difficult for a mask-based approach, requiring very large mask values. On the other hand, the CED network is specifically designed to map to outputs of the same domain as the input, in this case speech spectra rather than spectral masks. In the following, an overview of the system is given and the chosen network topologies for both stages are described in detail.

Block diagram of the proposed two-stage enhancement approach including the first-stage noise suppression, followed by the second-stage speech restoration used to restore a natural sounding speech signal. For details on the noise suppression network block and the speech restoration network block refer to Figs. 2 and 3, respectively

In the speech restoration stage, a second feature extraction, including MVN using the vectors of means \(\boldsymbol \mu _x^(2)\) and standard deviations \(\boldsymbol \sigma _x^(2)\) obtained during speech restoration network training, is employed. The resulting feature representation \(\tilde \mathbf x_\ell ^(2)\) is input to the speech restoration network, which directly maps to the enhanced speech spectrum \(\hat S_\ell ^(2)\left (k'\right)\), using the trained network parameters Θ(2). Reconstruction of the corresponding enhanced time-domain signal \(\hat s^(2)(n)\) is subsequently realized through IDFT, synthesis windowing, and overlap-add (OLA).

The comparison of deep learning-based methods for unseen pub and office noise in 5 dB SNR, depicted in Fig. 5, shows that all single-stage methods perform notably better in office noise according to PESQ. However, pub noise, which contains mostly interfering speech, seems to be quite difficult for the single-stage methods. This difference in overall quality can be mitigated to some extent by the usage of the proposed second stage (LSTM-cMSA+CED-cSA-tr), which improves PESQ in pub noise by an impressive 0.17 MOS points with regard to LSTM-cMSA. The additional analysis for unseen traffic noise is also depicted in Fig. 5 and shows that the evaluated methods also generalize well to a non-stationary noise type not containing interfering speech (whereas model training was focused on noise types including interfering speech). The proposed two-stage network LSTM-cMSA+CED-cSA-tr is able to improve on using only LSTM-cMSA by 0.11 MOS points, whereas the two-stage reference network LSTM-cMSA+DNN-cSA does not provide an improvement in speech quality in terms of PESQ. For all three evaluated noise types, using the tr- over the du-setup of our proposed system improves all three quality measures.

To further analyze the reasons for the observed quality improvements with our proposed two-stage approach, the enhanced speech spectrograms obtained with the deep learning-based methods are compared, using an exemplary test set utterance in pub noise at 5 dB SNR. The spectrograms of clean speech s(n), noisy speech y(n), and enhanced speech \(\hat s(n)\) for the different methods are shown in Fig. 6. Comparing the output of the two single-stage methods LSTM-MSA and LSTM-cMSA (third and fourth spectrogram from the top, respectively) shows the higher noise suppression that can be obtained with LSTM-cMSA. This comes at the cost of suppressing some parts of the speech signal as well, which can be examined in the highlighted areas in the respective spectrograms. Proceeding to the outputs after second-stage processing with DNN-cSA (third from bottom) and CED-cSA-tr (second from bottom), it can be observed that certain previously missing or distorted parts are restored (again highlighted in the respective spectrograms). Furthermore, CED-cSA-tr is able to more accurately restore the harmonic details of the original clean speech compared to DNN-cSA, as can be seen, e.g., in the rightmost highlighted region. We can credit this to the CED topology, which, as opposed to a fully connected topology, puts a focus on local dependencies over frequency through the use of convolutional kernels and is able to process different frequency regions with shared parameters, which we believe to be especially advantageous for the reconstruction of harmonic structures. Moreover, the CED is able to use high-resolution information on the clean speech inherent to the noisy features directly via its skip connections, which can also aid a more detailed reconstruction. The comparison of the proposed LSTM-cMSA+CED-cSA-tr network with the high-complexity reference LSTM-cMSA+CLED-cSA-du furthermore shows, that comparable speech restoration and noise suppression capabilities can be achieved with our newly proposed method, while employing significantly less model parameters and computational resources.Footnote 7

In this paper, we have proposed a new two-stage approach for speech enhancement, using specifically chosen network topologies for the subsequent tasks of noise suppression and restoration of natural sounding speech. The first stage consists of an LSTM network estimating T-F masks for real and imaginary parts of the noisy speech spectrum, while the second stage performs spectral mapping using a convolutional encoder-decoder (CED) network. Employing only the noise suppression stage trained with the complex masked spectrum approximation (cMSA) loss, we observe an impressive gain of more than 5 dB in SNR compared to the baselines, but only slight or no gains in terms of overall quality (PESQ). When employing both stages, average improvements of PESQ by about 0.1 MOS points can be obtained in unseen highly non-stationary noises including interfering speech. Furthermore, our approach also improves STOI in low-SNR conditions compared to the baselines.

Note that a part of this work has been pre-published in a workshop paper [39]; however, in [39], a convolutional LSTM layer has been employed and pooling and upsampling layers have been used in the CED architecture, resulting in a computationally much more complex second stage. Furthermore, in this work, a much more thorough evaluation of the two-stage approach to learned speech enhancement including all relevant single-stage baselines is provided. In addition to that, an analysis of the benefits of the two-stage approach and its effects on the enhanced speech spectra, and an analysis of the computational complexity are presented. 2ff7e9595c

[PDF] Speech Enhancement: Theory and Practice, Second Edition - A Comprehensive Guide for Engineers

[PDF] Speech Enhancement: Theory and Practice, Second Edition

Recent Posts

Commenti