Tuesday, November 11, 2014

Speech enhancement 3: oversubtraction factor, noise power estimation by minimum statistics

If we can estimate the noise power automatically, we don't need VAD (voice activity detection) any more to decide which time frame is noise or speech. And we could also implement some spectral subtraction procedures after the noise power estimation. In article "Spectral Subtraction Based on Minimum Statistics", Rainer Martin introduced a method to estimate noise power by using the smoothed short-time noisy power spectrum and the SNR estimator.

Martin explained not very well his noise power estimation method (section 2.3 of his article) in my opinion. There is a master diploma work "Kalman filtering and speech enhancement" by Jan Kybic which explained it better. A function called "specsubm" in Matlab toolbox "Voicebox" implemented this method.

When I was implementing Martin's algorithm, three oversubtraction factor calculation methods made me confused. So I explain here their differences.

1. Berouti's method
In Berouti's article "ENHANCEMENT OF SPEECH CORRUPTED ACOUSTIC NOISE", a simple formula of calculating the oversubtraction factor has been given:
oversubtraction factor given by Berouti
where $\lambda$ and $k$ are frame index and frequency bins. We could see that a high SNR ratio brings about a low oversubtraction factor. And when SNR is in the interval $[-5,20]$, this factor is its linear function.

2. Jan Kybic's method
Kybic gave a formula in his diploma work to calculate the oversubtraction factor $\delta$:
oversubtraction factor given by Jan Kybic
where $q_L=1,q_H=100,\delta_L=1,\delta_H=4$. The weird thing is that the value calculated by this method is just opposite of Berouti's. Because a high SNR ratio here brings about a high oversubtraction factor.

3. Voicebox's specsubm
To calculate oversubtraction factor, this Matlab function generates firstly a curve $osf(k)$ for frequency weighting:
frequency weighting curve
Then, it use the formula below to calculate the oversubtraction factor:$$osub(\lambda,k)=1+osf(k)\frac{P_n(\lambda,k)}{P_x(\lambda,k)+P_n(\lambda,k)}$$where $P_n$ and $P_x$ are noise power and noisy speech power. The part $\frac{P_n(\lambda,k)}{P_x(\lambda,k)+P_n(\lambda,k)}$ can be seen as the inverse of SNR. So this method is consistent with Berouti's.

4. Comparison
By listening the subtracted speech results, I found the Berouti's oversubtraction factor is more drastic than the Voicebox one. Using the same parameters, the first one eliminates more noise but also suppress more low energy speech than the second one.
click to enlarge
5. Matlab Code

No comments:

Post a Comment