Complexity Scaling for Speech Denoising

Accepted by ICASSP 2024

Authors: Hangting Chen & Jianwei Yu & Chao Weng

Tencent AI Lab, Audio and Speech Signal Processing Oteam

Email: chenhangting17@mails.ucas.ac.cn

Abstract: Computational complexity is critical when deploying deep learning-based speech denoising models for on-device applications. Most prior research focused on optimizing model architectures to meet specific computational cost constraints, often creating distinct neural network architectures for different complexity limitations. This study conducts complexity scaling for speech denoising tasks, aiming to consolidate models with various complexities into a unified architecture. We present a Multi-Path Transform-based (MPT) architecture to handle both low- and high-complexity scenarios. A series of MPT networks present high performance covering a wide range of computational complexities on the DNS challenge dataset. Moreover, inspired by the scaling experiments in natural language processing, we explore the empirical relationship between model performance and computational cost on the denoising task. As the complexity number of multiply-accumulate operations (MACs) is scaled from 50M/s to 15G/s on MPT networks, we observe a linear increase in the values of PESQ-WB and SI-SNR, proportional to the logarithm of MACs, which might contribute to the understanding and application of complexity scaling in speech denoising tasks.

Paper link: https://arxiv.org/pdf/2309.07757v1.pdf

This paper is accepted by ICASSP2024.

A Quick Review

Motivation

(1) Most previous works cover a narrow range of computational costs. Researchers must design models with different target computational costs to satisfy various resource requirements for real-world applications, which might lead to distinct neural network architectures.

(2) Neural scaling law indicates empirical relationships between parameters, facilitating designing and understanding neural network models. However, studies have yet to be conducted to present scaling laws in speech signal processing.

Experimental setup

The experiments were conducted on (1) a simulated dataset using Librispeech for ablation and (2) DNS challenge dataset to cover the complete scaling.

Framework & Method

(1) Multi-Path Transformer (MPT) network is proposed to satisfy different computational complexities towards the denoising tasks.

(2) Practical network scaling techniques are uncovered for better performance.

Results

(1) MPT network is the first to cover multiply-accumulate operations (MACs) from 50M/s to 25G/s and to show competitive performance on all computational complexity when tested on the DNS challenge data.

(2) Scaling law illustrates that wideband perceptual evaluation of speech quality (PESQ-WB) and scale-invariant signal-to-noise ratio (SI-SNR) will have around 0.09 and 0.36 dB improvement as the computation cost is doubled when the multiply-accumulate operations per second (MACs/s) are below 15G.

Detailed results

Denoising (w/o reverb)

MACs/s	Causal	1	2	3	4	5	6	7	8
Mixture	N/A
48M/s	✓
102M/s	✓
195M/s	✓
301M/s	✓
502M/s	✓
1G/s	✓
14G/s	✓
23G/s	✗
Clean	N/A

Denoising (w/ reverb)

MACs/s	Causal	1	2	3	4	5	6	7	8
Mixture	N/A
48M/s	✓
102M/s	✓
195M/s	✓
301M/s	✓
502M/s	✓
1G/s	✓
14G/s	✓
23G/s	✗
Clean	N/A

Real-recorded audio test

Note that all models here we used were trained on only the DNS challenge data. Further improvement can be achieved by adding music, noise, and speech data.

MACs/s	Causal	1	2	3
Mixture	N/A
100M	✓
300M	✓
1000M	✓
14G	✓
23G	✗