Poster

A Quick Review

Background

Model performance relates to (1) architecture and (2) # params. & computational cost. When the architecture is determined, the # params as well as the computational cost can be tuned by hyperparameters.

AEC&ANS for full-duplex communication needs low computational cost when deployed on PC/Phones. Thus, an important research topic is to lower computational cost while lowering performance degradation.

Experimental setup

Models are trained on a simulated dataset with 100K iterations. Though the model performance can be further improved with a larger iteration number, we adopt such an iteration number for fast comparison.

Models are tested on a semi-simulated dataset, where the echos are real-recorded ones from the AEC challenge.

Models have a window size and a hop size of 20ms and 10ms, respectively. Additional look ahead is not allowed due to causal processing.

Framework & Method

The whole framework consists of a linear AEC (Kalman-filter based), an online DPT-FSNet, compression and decompression modules and a post network.

Compression and Decompression are operated on the input spectra and the output features, respectively. In detail, the compression module compresses the input spectra first along the time axis and then the frequency using linear transform. The time compression uses the past and the current frames while the frequency compression is conducted with the Mel scale. The decompression module decompresses the output feature first along frequency and then the time axis using linear transform. The time decompression predicts the current feature and copies it for future frames. A post network (1-layer GRU) is needed to eliminate the degradation caused by the time compression.

Results

1. Time compression needs an ultra-light post-net.

2. Dual-path compression performs better than single-path compression. For example, from 1.8G MACs/s to 140M MACs/s, the model achieves a 13x compression ratio but only has 0.22 PESQ drop, 0.16 and 0.14 PESQ higher than single-path compression.

Detailed results

Double talk

MIC
LPB(REF)
Uncompressed model
Dual-path
Dual-path
     
T2 x F2
T4 x F4
   
MACs: 1.8G/s
MACs: 486M/s
MACs: 140M/s

Near-end single talk

MIC
Uncompressed model
Dual-path
Dual-path
   
T2 x F2
T4 x F4
 
MACs: 1.8G/s
MACs: 486M/s
MACs: 140M/s

Far-end single talk

MIC
LPB(REF)
Uncompressed model
Dual-path
Dual-path
     
T2 x F2
T4 x F4
   
MACs: 1.8G/s
MACs: 486M/s
MACs: 140M/s

Real-time factor (RTF) test

RTF test on Intel(R) Core(TM) i7-9700 CPU @ 3.00GHz using a model with 13x compression rato (T4 x F4)
The test is carried with frame-by-frame process to mimic real-world online processing.

Related works

[1] F. Dang, H. Chen, and P. Zhang, “Dpt-fsnet: Dual-path transformer based full-band and sub-band fusion network for speech enhancement,” in 2022 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2022. IEEE, 202

[2] S. Zhang, Z. Wang, J. Sun, Y. Fu, B. Tian, Q. Fu, and L. Xie, “Multi-task deep residual echo suppression with echo-aware loss,” in IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2022, Virtual and Singapore, 23-27 May 2022. IEEE, 2022, pp. 9127–9

[3] X. Hao and X. Li, “Fast fullsubnet: Accelerate full-band and subband fusion model for single-channel speech enhancement,” arXiv preprint arXiv:2212.09019, 2022