A Quick Review

Motivation

(1) Most previous works cover a narrow range of computational costs. Researchers must design models with different target computational costs to satisfy various resource requirements for real-world applications, which might lead to distinct neural network architectures.

(2) Neural scaling law indicates empirical relationships between parameters, facilitating designing and understanding neural network models. However, studies have yet to be conducted to present scaling laws in speech signal processing.

Experimental setup

The experiments were conducted on (1) a simulated dataset using Librispeech for ablation and (2) DNS challenge dataset to cover the complete scaling.

Framework & Method

(1) Multi-Path Transformer (MPT) network is proposed to satisfy different computational complexities towards the denoising tasks.

(2) Practical network scaling techniques are uncovered for better performance.

Results

(1) MPT network is the first to cover multiply-accumulate operations (MACs) from 50M/s to 25G/s and to show competitive performance on all computational complexity when tested on the DNS challenge data.

(2) Scaling law illustrates that wideband perceptual evaluation of speech quality (PESQ-WB) and scale-invariant signal-to-noise ratio (SI-SNR) will have around 0.09 and 0.36 dB improvement as the computation cost is doubled when the multiply-accumulate operations per second (MACs/s) are below 15G.

Detailed results

Denoising (w/o reverb)

MACs/s
Causal
1
2
3
4
5
6
7
8
Mixture
N/A
48M/s
102M/s
195M/s
301M/s
502M/s
1G/s
14G/s
23G/s
Clean
N/A

Denoising (w/ reverb)

MACs/s
Causal
1
2
3
4
5
6
7
8
Mixture
N/A
48M/s
102M/s
195M/s
301M/s
502M/s
1G/s
14G/s
23G/s
Clean
N/A

Real-recorded audio test