Acoustic Echo Cancellation Techniques for Far-End Telephony Speech Recognition in Barge-In Situations

In this paper, we present some techniques of acoustic echo cancellation for far-end (server-side) telephony speech recognition during barge-in situations. We develop a normalized least mean square algorithm for the adaptive filter of an echo canceller, and a double-talk detector for the online speech recognition services. In particular, we devise a voice activity detector for estimating the initial delay due to communication networks. In addition, we propose a hybrid method that uses the log-spectral distance measure, as well as the cross-correlation coeffi1114 Jong Han Joo et al. cients, to estimate the initial delay. From the simulation and the experiments in real environments, we conclude that the developed techniques can be successfully used for far-end telephony speech recognition services.


Introduction
In a far-end (server-side) or a near-end (terminal-side) speech recognition service, if the server and the user speak simultaneously, the user's voice and the server's message signal will reach the near-end microphone at the same time.This barge-in situation results in the serious degradation of speech recognition performance [1].Therefore, we use echo cancellation to remove echoes from the message signal that is input into the microphone.The quality of the echo canceller depends on the speed of convergence and the accuracy of the adaptive filter [2].Echo paths consist of an initial time delay with no echo signal, and active regions in which the echo signal is present.To save computational costs and increase echo cancellation performance, we use an adaptive filter to match the echo path impulse response only in the active region.To accomplish this, we must develop an algorithm to estimate the initial delay and to identify the active region.
First, we devised an automatic voice activity detector (VAD) to find the active region of the message prompt signal and make a reference segment.Then, we developed methods to estimate the initial time delay.
Cross-correlation coefficients (CCCs) between the reference segment and an input signal segment are computed in a conventional manner, and the initial delay is estimated as a function of the index of the peak value of the cross-correlation lags.However, the CCC-based method may exhibit poor performance when used with colored input signals such as speech signals [3].In this research work, we propose a hybrid method that uses the log-spectral distance (LSD) measure as well as the CCC to estimate the initial delay.
Since the echo signal is usually modeled as the convolution of the transmitted message signal and an echo path impulse response, an adaptive filter is used to estimate the echo path impulse response.However, the characteristics of the echo path vary depending on the surrounding conditions, and therefore, the echo canceller generally updates the filter coefficients using an adaptive algorithm [4].We utilize a normalized least mean square (NLMS) algorithm as the adaptive algorithm.
The remainder of this paper is organized as follows: We describe the barge-in situation and our developed echo cancellation techniques in Section 2. Section 3 is a description about performance evaluation.Finally, we conclude this manuscript in Section 4.

Barge-In Situation and Developed Echo Cancellation Techniques
Fig. 1 shows the echo cancellation system in our research work.In the figure,

Voice Activity Detection and Delay Estimation
First, as shown in Fig. 2 (a), we developed a VAD algorithm to detect active regions in the message signal.The VAD flag is set if the average energy of some of the frames exceeds the predefined threshold.Then, a speech segment of the message signal is saved for delay estimation. . ( The CCC has a peak value when the segment { }, and the CCC flag is set if the peak CCC value is greater than a predefined threshold.Moreover, we propose a hybrid method that uses the LSD measure as well as the CCC to estimate the initial delay.The LSD is a distance measure between two spectra, and is obtained by

NLMS Algorithm
The NLMS algorithms are a class of adaptive filter used to mimic the desired filter by finding the filter coefficients [5,6].In this research work, the NLMS algorithm is adopted to estimate the impulse response ) (n h of the room echo path, as shown in Fig. 1.The filter coefficient ) ( ˆn h is continuously updated for each sample using Eq. ( 3), where μ is the step size that determines the convergence speed [7]. where

Double-Talk Detection
In a barge-in phenomenon, when the near-end speech and far-end speech occur simultaneously, the so-called double-talk (DT) mode, the adaptation of the adaptive filter will be severely disturbed by the near-end signal [8].If the recognition system stops transmitting the message signal when DT is detected, recognition performance will be greatly improved.In addition, the DT detection (DTD) result can be the criterion if the filter coefficients need to be updated.
The variable ) (n e has relatively lower and higher values before and after the user's voice occurs, respectively, as shown in Eqs. ( 4) and ( 5).
We determine the DTD point when the normalized error energy ) (n E err in Eq. ( 6) exceeds a predefined threshold, as shown in Fig. 3.In the simulation, we used a woman's voice as the message signal ) (n x , and the voice signals of five men and five women as the input signal ) (n v .The audio files were sampled at 16 kHz.We assumed that the distance between the micro-phone and speaker was 10 cm.In addition, we generated an artificial echo path impulse response for the simulation experiments.
First, we verified the proposed hybrid method that uses both the LSD measure and the CCC.Fig. 4 shows the CCC and LSD values obtained using two speech samples of the message and microphone input.We found that the maximum value of the CCCs and the minimum LSD value were at approximately the same position.Therefore, we concluded that the message signal matched closely with the microphone input signal at that position.
In order to find the appropriate filter length and step size, and to evaluate how well the acoustic echo signal was removed, we utilized the echo-return loss enhancement (ERLE) [9], which is defined as where ) ( ˆn y represents the estimated value of ) (n y .Fig. 5 shows the ERLE results when the filter lengths are 128, 256, and 512, and the step size belongs to (0.01, 0.1].From the results, we determined that the proper filter length and step size were 256 and 0.01 -0.02, respectively.

Experiments in Real Environments
We integrated the developed techniques: VAD, delay estimation, echo cancellation, and double-talk detection.Then, we implemented the integrated program in a real environment.As shown in Fig. 6, each technique works well, and the implemented echo cancellation technology successfully operates in a real online environment.

Conclusion
We described some developed techniques of acoustic echo cancellation for far-end telephony speech recognition in a barge-in situation.We developed the adaptive filter and the double-talk detector for online speech recognition services.In particular, we devised a VAD to estimate the initial time delay in communication networks.Furthermore, we proposed a hybrid method that uses the LSD measure as well as the CCCs to estimate the initial delay.The simulation results and the experiment results in real environments showed that the developed techniques can be used successfully for far-end telephony speech recognition services.

Fig. 1 .
Fig. 1.System block diagram of the echo canceller Fig. 2 (b) shows the block diagram when the segment of microphone input signal { ) ( s n d } matches closely with the saved message segment.To figure out that, we use the cross-correlation coefficient (CCC) denoted as

Fig. 4 .
Fig. 4. Values of (a) CCCs and (b) LSD at the message signal, in accordance with the microphone input signal