Noise Estimation Employing Variational Model Composition for Speech Enhancement in Time-Varying Noise Conditions

This paper proposes an effective noise estimation method for speech enhancement to improve speech recognition in time-varying noise conditions. The proposed noise estimation scheme employs the Variation Model Composition (VMC) method. The VMC method generates multiple noise models by selectively applying perturbation factors to the mean parameters of a basis noise model. The resulting collection of the noise models is expected to effectively reflect the unseen noise signal included in the speech segments. The obtained noise models are used to generate multiple environmental models employing the parallel model combination method. The noise estimate is obtained by using the posterior probability of the multiple environmental models. The proposed noise estimation method is applied to the Spectral Subtraction. The proposed speech enhancement scheme is evaluated within the Aurora 2.0 evaluation framework over speech babble and background music noise conditions. Experimental results demonstrate that the proposed method is effective at increasing speech recognition accuracy in time-varying background noise conditions. 1122 Sunmee Kang and Wooil Kim


Introduction
Acoustic mismatch between training and operating conditions of an actual speech recognition system is one of the primary factors severely degrading recognition performance.To minimize this mismatch, extensive research has been conducted in recent decades including many types of speech/feature enhancement methods such as Spectral Subtraction, Cepstral Mean Normalization, and a variety of feature compensation schemes [1]- [3].However, the conventional methods continue to suffer from ineffectiveness in time-varying background noise conditions, where the noise characteristics need to be effectively estimated as time elapses.
In this study, a novel noise estimation method for speech enhancement is proposed to address time-varying background noise for improved speech recognition.Here our previous study of the Variational Model Composition (VMC) method [4] [5] is employed for noise estimation.The motivation of the VMC is that each order of the cepstral coefficients represents the frequency degree of the changing components in log-spectrum envelope [6].In the VMC method, variational noise models are generated by selectively applying perturbation factors to a basis model in the cepstral domain in order to obtain various types of spectral patterns.The variational model composition method showed the effectiveness by being employed to generate multiple environmental models for our feature compensation method [5].In this study, the posterior probability of each environmental model is used for estimating time-varying background noise.The proposed method will be evaluated on two types of time-varying background noise conditions including speech babble and background music within the Aurora 2.0 evaluation framework [7].

Variational Model Composition
In the VMC (Variational Model Composition) method [4][5], it is assumed that (i) a basis noise model can be obtained from periods of silence within the speech stream, and (ii) the target time-varying noise included in the speech duration would reflect variations of the estimated basis model.The variational models are generated by selectively applying weights on each component of the mean vector of the basis model in the cepstral domain.
First, a basis noise model is obtained from non-speech segments within the input speech, which generally exists at the beginning and end parts of an utterance.The model is estimated as a Gaussian pdf (, Σ) in the cepstral domain.In general the variance Σ is estimated as a form of diagonal matrix, resulting in a vector  2 .Next, the V largest components { 1 ,  2 , … ,   } in the variance vector  2 are selected.They are named Variational Components, which are considered highly variable components in a size-ordered rank.
Finally, a variation of the mean vector is generated by selectively applying the perturbation factor   on the determined variational components of the cepstral coefficients  1 to   as follows, where   = 0, −α or +α and the α is a small positive value which we determine heuristically.The obtained model collection { ̃ = ( ̃, Σ)} consists of a total 3  number of generated variational models as a result of combinations of the 3-type gains (i.e., 0, −α or +α) of the V variational components.

Noise Estimation Employing Variational Model Composition Method
In this section, a novel noise estimation method is proposed, which employs the Variation Model Composition presented in Sec. 2. In our previous study, Parallel Combined Gaussian Mixture Model (PCGMM) based feature compensation method was proposed, showing robust speech recognition performance in various types of background noise conditions [8].A series of experiments in that study confirmed that the noise corrupted general speech model (i.e., Gaussian mixture model) employed by the PCGMM method effectively represents the input noise corrupted speech.Based on this motivation, we integrate the PCGMM-based model estimation method for obtaining the speech model into our noise estimation method in this study.

Speech Model Estimation
The distribution of the clean speech feature x in the cepstral domain is represented with a Gaussian Mixture Model consisting of K components as follows: (;  , , Σ , ). ( In this study we have multiple noise models obtained by the VMC method presented in Sec. 2. Therefore multiple noise-corrupted speech models are generated through a model combination procedure using the clean speech model and each noise model ( ̃, Σ) of the variational model collections.

Noise Estimation
The multiple noise models obtained by the Variational Model Composition method are used to generate the multiple environmental models {  }.They are estimated through the model combination procedure using the clean speech GMM and the obtained variational noise models as described in Sec.3.1.With V number of variational components, 3  (= ) environmental models are generated.
The utilization of multiple environmental models is considered to be effective for compensating input features adaptively under time-varying noisy conditions [8].In the multiple model method, a sequential posterior probability of each possible environment is estimated over the incoming noisy speech.Given the input noisy speech feature vectors   = [ −+1 ,  −+2 , … ,   ]  over a d interval, the sequential posterior probability of a specific environment GMM   among all models can be written as, where and P(  ) is a prior probability of each environment   represented as a GMM.
Based on Eq. ( 5), the noise signal in the cepstral domain at frame t is estimated by the weighted combination of the mean parameters of the variational noise models obtained from a set of E multiple environments using the posterior probability as follows, We believe that the noise estimate  ̃ obtained by Eq. ( 6) would represents the change of the unseen noise signal during the speech segments in the time-varying background noise.

Spectral Subtraction with Noise Estimate
The obtained noise estimate  ̃ is the cepstral domain, therefore it needs to be converted to the linear spectral domain for applying to the Spectral Subtraction.It can be first converted to the log-spectral domain using an inverse DCT (Discrete Cosine Transform) as follows, The noise signal for the Spectral Subtraction is obtained as follows,  ̃, = exp ( ̃, {} ) ,   ∈  th Mel filter-bank.(8)

Experimental Results
Our evaluations of the proposed method were performed within the Aurora 2.0 evaluation framework as developed by the European Language Resources Association (ELRA) [7] [10].The task is connected English-language digits consisting of eleven words.The acoustic models of the speech recognizer were trained using a database that contains 8,440 utterances of clean speech from the Aurora 2.0 database.In order to evaluate performance under time-varying background noise conditions, speech babble condition was selected from the Aurora 2.0 test database, and a new test data set was generated by combining clean speech samples with background music which consists of prelude parts of ten Korean popular songs with varying degrees of beat and tempo.Each test set consists of 1,001 samples at five different SNRs: 0, 5, 10, 15, and 20 dB.
In our experiment two types of background noise estimation methods were employed for the conventional Spectral Subtraction.In the first method, the noise signal is estimated from the beginning and end non-speech parts of every input speech signal as the same way where the basis noise model is estimated for the proposed noise estimation employing the Variational Model Composition in this study.We used the first 12 frames and the last 12 frames, which are the identical numbers to ones used by the proposed method.The second noise estimation method for the conventional Spectral Subtraction employs minimum statistics based estimation [11], where the previous 25 frames was used for estimation of minimum statistics.SS and MSS indicate the Spectral Subtraction employing noise estimation using non-speech segments (SS) and minimum statistics (MSS) respectively in this paper.
Table 1 shows speech recognition performance (i.e., Word Error Rate, WER) of the baseline system (i.e., no processing), the conventional algorithms (SS and MSS) and the proposed method (VSS) over each SNR condition of speech babble noise.The performance combined with CMN is also shown.In the table, the Spectral Subtraction employing noise estimation during non-speech segments (SS) is slightly better compared to the proposed speech enhancement method (VSS).
Here the proposed method (VSS+CMN) shows the best performance compared to the conventional methods (SS and MSS) over all SNR conditions except 0 dB SNR.The SS significantly outperforms the MSS, however, the MSS is considerably better compared to the SS when combined with CMN.We would believe that the SS is effective to increase SNR by suppressing the noise, however, the reconstructed speech is distorted due to the residual noise, resulting in low recognition performance for combination with CMN.It is worth to note that the proposed speech enhancement method is effective for the both cases (i.e., without and with CMN) in the speech babble noise condition.
Table 2 shows speech recognition performance without CMN and with CMN respectively over each different SNR of background music condition.Here the SS is consistently best for the both cases, while the MSS is not effective even compared to the baseline (i.e., no processing).The proposed method shows the comparable performance to the SS at low SNR conditions such as 10, 5 and 0 dB SNR when combined with CMN.
It is interesting to compare the performance of SS and MSS for speech babble and music noise when combined with CMN in Table 1 and 2. The MSS is highly effective for the speech babble noise, however it shows lowest recognition performance (i.e., highest WER) for the background music noise.On the contrary, the SS shows the best performance for the music noise, however it presents lowest performance for the speech babble.Such results show that the conventional noise estimation method for the Spectral Subtraction is not consistently effective for different time-varying background noise conditions.The proposed noise estimation employing the VMC with the Spectral Subtraction shows the consistent effectiveness for both noise conditions.
Table 3 shows speech recognition performance without CMN and with CMN respectively over each different SNR in average for speech babble and music noise conditions.Here the proposed method (VSS+CMN) shows relative improvements 9.02% and 4.86% in WER for SS+CMN and MSS+CMN respectively.These results demonstrate that the proposed noise estimation is effective at improving speech enhancement for speech recognition system in different types of time-varying noise conditions such as speech babble and background music noise.. environmental models employing the model combination method.The noise estimate was obtained by using the posterior probability of the multiple environmental models.The proposed noise estimation method was employed for the Spectral Subtraction (SS).The proposed speech enhancement scheme was evaluated within the Aurora 2.0 evaluation framework over speech babble and background music conditions.Experimental results demonstrated that the proposed method is effective at increasing speech recognition performance in time-varying background noise conditions.