Ultra Dual-Path Compression For Joint Echo Cancellation and Noise Suppression

Accepted by Interspeech 2023

Authors: Hangting Chen, Jianwei Yu, Yi Luo, Rongzhi Gu, Weihua Li, Zhuocheng Lu, Chao Weng

Tencent AI Lab, Audio and Speech Signal Processing Oteam

Email: erichtchen@tencent.com

Abstract: Echo cancellation and noise reduction are essential for full-duplex communication, yet most existing neural networks have high computational costs and are inflexible in tuning model complexity. In this paper, we introduce time-frequency dual-path compression to achieve a wide range of compression ratios on computational cost. Specifically, for frequency compression, trainable filters are used to replace manually designed filters for dimension reduction. For time compression, only using frame skipped prediction causes large performance degradation, which can be alleviated by a post-processing network with full sequence modeling. We have found that under fixed compression ratios, dual-path compression combining both the time and frequency methods will give further performance improvement, covering compression ratios from 4x to 32x with little model size change. Moreover, the proposed models show competitive performance compared with fast FullSubNet and DeepFilterNet.

Core source code is available.

Poster

A Quick Review

Background

Model performance relates to (1) architecture and (2) # params. & computational cost. When the architecture is determined, the # params as well as the computational cost can be tuned by hyperparameters.

AEC&ANS for full-duplex communication needs low computational cost when deployed on PC/Phones. Thus, an important research topic is to lower computational cost while lowering performance degradation.

Experimental setup

Models are trained on a simulated dataset with 100K iterations. Though the model performance can be further improved with a larger iteration number, we adopt such an iteration number for fast comparison.

Models are tested on a semi-simulated dataset, where the echos are real-recorded ones from the AEC challenge.

Models have a window size and a hop size of 20ms and 10ms, respectively. Additional look ahead is not allowed due to causal processing.

Framework & Method

The whole framework consists of a linear AEC (Kalman-filter based), an online DPT-FSNet, compression and decompression modules and a post network.

Compression and Decompression are operated on the input spectra and the output features, respectively. In detail, the compression module compresses the input spectra first along the time axis and then the frequency using linear transform. The time compression uses the past and the current frames while the frequency compression is conducted with the Mel scale. The decompression module decompresses the output feature first along frequency and then the time axis using linear transform. The time decompression predicts the current feature and copies it for future frames. A post network (1-layer GRU) is needed to eliminate the degradation caused by the time compression.

Results

1. Time compression needs an ultra-light post-net.

2. Dual-path compression performs better than single-path compression. For example, from 1.8G MACs/s to 140M MACs/s, the model achieves a 13x compression ratio but only has 0.22 PESQ drop, 0.16 and 0.14 PESQ higher than single-path compression.

Detailed results

Double talk

Uncompressed model	Dual-path	Dual-path
	T2 x F2	T4 x F4
MACs: 1.8G/s	MACs: 486M/s	MACs: 140M/s

Near-end single talk

Uncompressed model	Dual-path	Dual-path
	T2 x F2	T4 x F4
MACs: 1.8G/s	MACs: 486M/s	MACs: 140M/s

Far-end single talk

Uncompressed model	Dual-path	Dual-path
	T2 x F2	T4 x F4
MACs: 1.8G/s	MACs: 486M/s	MACs: 140M/s

Real-time factor (RTF) test

RTF test on Intel(R) Core(TM) i7-9700 CPU @ 3.00GHz using a model with 13x compression rato (T4 x F4)
The test is carried with frame-by-frame process to mimic real-world online processing.

Related works

[1] F. Dang, H. Chen, and P. Zhang, “Dpt-fsnet: Dual-path transformer based full-band and sub-band fusion network for speech enhancement,” in 2022 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2022. IEEE, 202

[2] S. Zhang, Z. Wang, J. Sun, Y. Fu, B. Tian, Q. Fu, and L. Xie, “Multi-task deep residual echo suppression with echo-aware loss,” in IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2022, Virtual and Singapore, 23-27 May 2022. IEEE, 2022, pp. 9127–9

[3] X. Hao and X. Li, “Fast fullsubnet: Accelerate full-band and subband fusion model for single-channel speech enhancement,” arXiv preprint arXiv:2212.09019, 2022