Rong GONG's music/speech processing blog: 2014

Friday, November 21, 2014

Find chorus section in lyrics

I did a simple algorithm to find the chorus section of a song. There are several assumptions for this algorithm:

1. the processing lyrics should be complete, which means there is no abbreviation like "repeat 2 times chorus" etc.
2. the lyrics should be pre-processed by segmenting sections with blank line.
3. the chorus parts should be generally the same.

There are three steps in the algorithm:

1. calculate phrase-level similarity matrix, and it's lag-time form,
2. find repetitive sections which are the consecutive horizontal lines in the matrix,
3. group these repetitive sections.

Let's start!

1. Phrase-level similarity matrix
The similarity metric is defined as:$$metric=\frac{intersection\,words\,number}{largest\,word\,number\,of\,two\,phrases}$$And the similarity is:$$similarity=\left\{\begin{matrix} 0 & \mathrm{if} & metric < 0.8 \\ 1 & \mathrm{if} & metric \ge 0.8. \end{matrix}\right.$$Applying this similarity with each phrase pair in lyrics, we get the similarity matrix:

Fig. 1 phrase-level similarity matrix

Then we transform it to the lag-time form:

Fig. 2 lag-time form similarity matrix

It's quite clear that the horizontal lines represent the repetitive sections in lyrics, and the discrete points represent the repetitive phrases. Due to the variations between some repetitive sections (which breaks the third assumption), some points are not consecutive. We are going to deal with this problem afterwards.

2. Find the repetitive sections
Finding the repetitive sections is simple. At this stage, we only consider the consecutive horizontal lines. The number of the beginning and the ending phrase of repetitive sections are stored in the below matrix. Mind that we unwrap each horizontal line by adding the lag-time to its $x$ coordinates. The (gene) means generative section which can be considered as the parent section generating the horizontal lines in Fig. 2.

Table.1 beginning and ending phrases of horizontal lines
	begin	end	begin (gene)	end (gene)
repetition 1	43	50	34	41
repetition 2	34	41	17	24
repetition 3	43	50	17	24

By observing this table, we found that these three repetitions belong to the same group because they share either the repetitive section or the generative section. So in next step, we mange to group these repetitions and re-search missing segment.

3. Group repetitive sections
Alter grouping the repetitive sections and re-searching the missing sections, we end up with four group:

Table.2 after grouping and re-searching missing segment
	begin	end
group 1	17	24
	34	41
	43	50
group 2	1	15
group 3	26	32
group 4	52	64

Until now, it seems our job is done. But remember we said that we were going to deal with the discrete points in Fig. 2. Now we can be sure that these points are in our group 2, 3 or 4 and that also means one or several these groups belong to group 1 (remember we said these discrete points stand for the variation of repetitive section?). So we can set up a rule to re-group further the groups in Table. 2. For example, we could do the pair comparison of the first three words in the first lines of two groups.

Tuesday, November 18, 2014

Musical structure segmentation: Joan Serrà method

Goto's RefraiD method is accurate for detecting the repetitive sections (verse, chorus etc.) of a song. But it's not a efficient method due to its tedious post-processing procedures, for example, line segments grouping. And it can't detect automatically segment boundaries or group similar segments.

Joan Serrà published first time his musical structure segmentation method in 2012 in article "Unsupervised Detection of Music Boundaries by Time Series Structure Features" where he dealt with the segment boundary detection. In 2014, he published another article "Unsupervised Music Structure Annotation by Time Series Structure Features and Segment Similarity" which presents his segment labelling method.

1. Method overview
The method contains three main steps:

1. Descriptor extraction
2. Segment boundaries detection
3. Segment labeling

The descriptor used in article is HPCPs (Harmonic Pitch Class Profile) which is the an extension of Chroma descriptor. For the latter, we use all the spectrum content filtered by Chroma filter bank. Whereas, we just pick the harmonic peaks in the spectrum to calculate HPCPs, which theoretically immune to the spectrum noise content assuming the accuracy of harmonic peak picking.

Considering the contribution of each harmonic to the pitch perception, Harmonic summation method is integrated in the calculation of HPCPs.

2. Segment boundaries detection
The so-called music descriptor time series is the feature vector (HPCPs) of STFT spectrum. I think his innovation is incorporating the delayed coordinates to the feature vector (concatenate the near past feature vectors with the current one) and this does improve enormously the segment labeling accuracy.

Then he calculated the recurrence plot of the dissimilarity matrix which I think is the most ingenious detail of this method. The recurrence plot highlights the mutual similar segments (the similarities of frame $i$ to $j$ and $j$ to $i$ should both be above a threshold) by filtering out the dissimilarities with a dynamic threshold. The result of this plot is similar to Goto's de-noised self-similarity matrix, but with much less computation time.

Recurrence plot

He afterwards transformed the recurrence plot to lag-time form and smoothed the latter with a gaussian kernel which is similar to the gaussian-tapered kernel used by J. Foote. The difference here is that the convolution should be done on the whole matrix (plot) instead of only the diagonal.

Smoothed lag-time recurrence plot

Finally the boundary curve is obtained by calculating the distance of consecutive frames ($||frame_{i+1}-frame_{i}||$). And the peak positions on this curve are selected as the segment boundaries.

3. Segment labeling
This procedure is very clear in article. The enforcement of transitivity by matrix multiplication is one of my favorite tricks in this method. However, I don't understand why the $Q_{max}$ measure can stand for the global similarity. If someone could point it out I will be so grateful.

4. Matlab Code
Those who want the Matlab code of this method please leave a message below or send me an email. :)

Saturday, November 15, 2014

Melody extraction: Goto's and Salamon's methods

If someone works on the melody extraction algorithms, he definitely knows Goto's (PreFEst) and Salamon's methods. I don't know exactly why the first one is well known, perhaps it's a first well-performed melody extraction algorithm. Salamon's method (first published in 2011) has the best overall performance until now, and it's also an easy implemented method.

1. Salience function
Both methods are based on the calculation of the Salience function, but with different definitions.

1.1 Goto's F0 pdf
Goto constructed firstly two harmonic-structure tone models which have the same fundamental frequency (F0) but different harmonic amplitudes $\mu_{1,2}$. Each tone model is a weighted gaussian mixture where its harmonic components are modeled by gaussian function. Another weight coefficient $w_{1,2}$ is introduced to balance the overall amplitude of the tone models mixture.

Afterwards, he utilized EM algorithm to estimate the coefficients $\mu_{1,2}$ and $w_{1,2}$ assuming that each frame of spectrogram is a probability density function (pdf) and can be generated by this tone models mixture.

1.2 Salamon's salience function
Salamon's salience function is based on harmonic summation method. He assumed the frequency $f$ of each peak detected on the spectrogram is a harmonic partial of the fundamental frequency $f_0=f/h$. He calculate afterwards the distance between $f/h$ and the salience frequency bins $b$, which is weighted by the harmonic number $h$. Higher $h$ brings about lower contribution of the peak at frequency $f$.

This calculation is based on the Hermes's (Measurement of pitch by subharmonic summation) pitch detection theory. If we count the peak frequency $f$ as the $h_{th}$ harmonic of the fundamental frequency $f/h$, it's contribution to the detection of pitch $f/h$ is $\alpha^{h-1}$ which is a decreasing sequence.

Below is salience functions of a same audio clip:

Goto's F0 pdf (salience)

Salamon's salience function

2. F0 contour tracking
Due to less disturbed melody line in Goto's salience function, peaks detected in each frame are relatively less, which results in less F0 candidate contours in Goto's salience function.

Though Salamon set up two magnitude thresholds to filter out abundant peaks (one for each frame, another for overall spectrogram), there are still a lot of contours been tracked. Apart from the inherent characteristic of his salience function, the reasons of large amount of peaks can be the small hop size of STFT (which leads to more frames) and small pitch continuity limit (within 80 cents) in Salamon's implementation.

Goto's contour tracking

Salamon's contour tracking

3. Melody selection
Two methods have their own rules of selecting the melody from the F0 candidate contours. Goto chooses the highest total salience power contour as the melody. However, Salamon's rules are way more complicated. Voicing detection, elimination octave error and outlier melody based on contour characteristics (pitch mean, standard deviation; salience mean, standard deviation etc.) are integrated in this final selection step.

Goto's melody selection

Salamon's melody selection

4. Matlab Code
Those who want the Matlab code of these two methods please contact with me. Leave a message below or send me an email. :=)

Tuesday, November 11, 2014

Speech enhancement 3: oversubtraction factor, noise power estimation by minimum statistics

If we can estimate the noise power automatically, we don't need VAD (voice activity detection) any more to decide which time frame is noise or speech. And we could also implement some spectral subtraction procedures after the noise power estimation. In article "Spectral Subtraction Based on Minimum Statistics", Rainer Martin introduced a method to estimate noise power by using the smoothed short-time noisy power spectrum and the SNR estimator.

Martin explained not very well his noise power estimation method (section 2.3 of his article) in my opinion. There is a master diploma work "Kalman filtering and speech enhancement" by Jan Kybic which explained it better. A function called "specsubm" in Matlab toolbox "Voicebox" implemented this method.

When I was implementing Martin's algorithm, three oversubtraction factor calculation methods made me confused. So I explain here their differences.

1. Berouti's method
In Berouti's article "ENHANCEMENT OF SPEECH CORRUPTED ACOUSTIC NOISE", a simple formula of calculating the oversubtraction factor has been given:

oversubtraction factor given by Berouti

where $\lambda$ and $k$ are frame index and frequency bins. We could see that a high SNR ratio brings about a low oversubtraction factor. And when SNR is in the interval $[-5,20]$, this factor is its linear function.

2. Jan Kybic's method
Kybic gave a formula in his diploma work to calculate the oversubtraction factor $\delta$:

oversubtraction factor given by Jan Kybic

where $q_L=1,q_H=100,\delta_L=1,\delta_H=4$. The weird thing is that the value calculated by this method is just opposite of Berouti's. Because a high SNR ratio here brings about a high oversubtraction factor.

3. Voicebox's specsubm
To calculate oversubtraction factor, this Matlab function generates firstly a curve $osf(k)$ for frequency weighting:

frequency weighting curve

Then, it use the formula below to calculate the oversubtraction factor:$$osub(\lambda,k)=1+osf(k)\frac{P_n(\lambda,k)}{P_x(\lambda,k)+P_n(\lambda,k)}$$where $P_n$ and $P_x$ are noise power and noisy speech power. The part $\frac{P_n(\lambda,k)}{P_x(\lambda,k)+P_n(\lambda,k)}$ can be seen as the inverse of SNR. So this method is consistent with Berouti's.

4. Comparison
By listening the subtracted speech results, I found the Berouti's oversubtraction factor is more drastic than the Voicebox one. Using the same parameters, the first one eliminates more noise but also suppress more low energy speech than the second one.

click to enlarge

5. Matlab Code
https://github.com/ronggong/voiceenhance

Monday, November 10, 2014

Speech enhancement 2: iterative AR coefficients estimation Kalman filter

I talked about the spectral subtraction in my last article, its principal drawback is the filtered artifact (musical noise). So I am seeking these days for a better method which can immune from this artifact. Kalman filtering seems to be an attractive method, because almost everybody who did speech enhancement in 90s was talking about it. So I thought why not give it a try, maybe it could solve the problem of musical noise.

1. K. K. Paliwal's white noise Kalman filter
K. K. Paliwal's "A SPEECH ENHANCEMENT METHOD BASED ON KALMAN FILTERING" might be the first implementation of Kalman filter on speech enhancement. His assumptions are that the speech can be represented by an autoregressive (AR) process and the noise is a gaussian white noise. Base on that, he derived the algorithm of equations 15 to 19 in his article.

Esfandiar Zavarehei put the Matlab code of this algorithm on his website. (youpi again!) But he just assumed that the clean speech is available, that means that he estimated the AR coefficients from the clean speech instead of from the noisy one. So this brought about another problem: how to estimate noisy speech's AR coefficients?

2. Iterative AR coefficients estimation
An easy way of AR coefficients estimation is doing it iteratively. This method is introduced in the article "FILTERING OF COLORED NOISE FOR SPEECH ENHANCEMENT AND CODING":

1. chop the noisy signal into frames,
2. estimate the AR coefficients and the variances by Linear prediction coding (LPC) on these noisy frames,
3. Kalman filter the signal with these AR coefficients and variances, obtain the filtered signal frames,
4. do the LPC estimation again on the filtered frames to get new AR coefficients and variances,
5. iterate the steps 3 and 4 several times to obtain the clean speech.

Iterative AR coefficients estimation

In addition to coefficients estimation, this article solved another problem which is de-noising the colored noise. The author made an assumption that the noise is also an autoregressive process, so we can estimate its AR coefficients and variances by LPC in the same way.

3. Results

I tested this algorithm with the same noisy speech sample used in the last article. The SNR of this speech sample is quite low so it can evaluate the quality of the algorithm. However, this algorithm can't remove the artifact of musical tone. And when I increase the iteration time to eliminate more noise, the intelligibility degrades rapidly.

There is some variations or extensions of this simple Kalman filter on the internet, but they seem to all need complex mathematical derivation and a lot of computation time. So I suddenly lose interest in this method.

4. Matlab Code
I put the white noise and colored noise versions algorithms in https://github.com/ronggong/voiceenhance. Notice that the input noise signal should be stationary in this implementation.

Thursday, November 6, 2014

Speech enhancement 1: spectral subtraction, wiener filtering

I am working today on my personnel project which needs some algorithms of speech enhancement or source separation to highlight the speech/singing voice part. I bumped into some classical enhancement methods, like, spectral subtraction, Wiener filter. These kinds of methods are designed to eliminate the noise component in noisy speech signal.

1. Spectral subtraction
It's funny how scientist at the years of 80s utilises this rudimentary method for de-noising. The principle is so simple: do FFT to the noisy speech ($X(k)$), do FFT to a pure noise ($N(k)$), subtract the magnitude of these two spectrum ($|\hat{S}(k)|=|X(k)| - |N(k)|$), and do IFFT to reconstruct the temporel signal by add the phase information of $X(k)$.

More details can be referred to Boll's "Suppression of Acoustic Noise in Speech Using Spectral Subtraction". He included some pre/post processing method to improve the speech intelligibility, for instance, magnitude averaging, residual noise reduction, additional signal attenuation during nonspeech activity.

Pure noise spectrum profile should be build before the spectral subtraction step, then each time VAD (voice activity detection) detect a noise frame, this profile will be updated. This is not a bad idea, huh? :D But his VAD detector compares only the residual spectrum and the noise profile (proportion $T$). When $T$ < -12 dB, the current frame is indicated as noise, otherwise, it's speech.

I tested with this threshold $T$, and I found -12 dB might not be fit for all the signals:

click for enlarge

It is clearly that when $T$ = -12, the subtractor did nothing but the attenuator took charge of all the works, because all the frames are indicated as noise frame. Personally, I prefer the sound without additional signal attenuation. (Please point me out if you think that I did something wrong with this algorithm :=) )

The big disadvantage of this method is informed by author himself: it can't deal with the non stationary noise, that is, if the noise spectrum profile changes within the speech frames, this method fails.

2. Two variations
In article "Enhancement and Bandwidth Compression of Noisy Speech", we have two variations of this subtraction by using the power spectrum of $|X(k)|^2$ and $|N(k)|^2$:$$|\hat{S}(k)|=(|X(k)|^2-\alpha \mathbb{E}[|N(k)|^2])^{1/2}$$ $$|\hat{S}(k)|=\frac{1}{2}|X(k)|+\frac{1}{2}(|X(k)|^2-\mathbb{E}[|N(k)^2|])^{1/2}$$
The author proved that these two formulas can be deduced from the parametric implicit Wiener filtering. I tried these two, the first one gives a reasonable result, but the second one is really bad. I think that's due to the noisy component $\frac{1}{2}|X(k)|$ in this formula.

3. A priori SNR estimation Wiener filtering
The Signal-to-noise ratio measure in frequency domaine Wiener filter could be a posteriori or a priori. If it's a posteriori, it could be easily computed by:$$SNR_{post}=\frac{|X(k)|^2}{\mathbb{E}|N(k)|^2}$$because we know $|X(k)|$ is the noisy spectrum and $\mathbb{E}|N(k)|^2$ is the average magnitude of noise signal when there is no speech activity. The two variations of parametric implicit Wiener filtering utilise exactly this a posteriori SNR ratio.

The a priori one is defined by:$$SNR_{prio}=\frac{\mathbb{E}|S(k)|^2}{\mathbb{E}|N(k)|^2}$$However, we do know the $S(k)$ which is exactly the clean speech we want to obtain. Article "SPEECH ENHANCEMENT BASED ON A PRIORI SIGNAL TO NOISE ESTIMATION" introduced a iterative method to estimate the $SNR_{prio}$ which is called "decision-direct" estimate by the author.

The Matlab code of this method written by Esfandiar Zavarehei can be easily download from his website (youpi!). He translated the formulas of the article into code except having changing some notations. For the reason of legibility, I changed them back.

The "NoiseMargin" variable in his function "vad" is worth paying attention to. Because it indicates that a short-time frame would be considered as noise or speech. For instance, if the SNR ratio of noisy speech is 0dB, we assign 12dB to NoiseMargin, it turns out that almost all the frames will be indicated as noise.

A priori SNR estimation Wiener filter result, without pre/post processing

The result of this method is more enjoyable to me than the last two, but it still has a some artificial traces.
4. Matlab code
https://github.com/ronggong/voiceenhance

Wednesday, November 5, 2014

Chorus detection, Goto's algorithm implementation

I am working recently on a project of lyrics-audio alignment where an important step is segment a song so that we can align the lyrics to the audio on section level. Music segmentation itself is a hard subject that many researchers devote to it, the ideal output of the segmentation is the clusters of each section of the music, like 'verse', 'chorus'. A great summary article could be found on the internet AUDIO-BASED MUSIC STRUCTURE ANALYSIS.

1. J. Foote Novelty curve
The first segmentation method is proposed by J. Foote in his article "Automatic audio segmentation using a measure of audio novelty". He use a diagonal matrix which he called it "checkerboard" kernel to compute the Novelty of MFCC self-similarity matrix. The implementation of this method is really easy and the function Mirnovelty in Mirtoolbox can do this task by several code lines.

However, the result of this method is far from perfect. We can hardly do the next step processing to segment each section of the song from the novelty curve.

Fig. 1 Novelty curve of a pop song, red vertical lines are the ground truth of segmentation parts

Above (Fig. 1) is a novelty curve calculated from a pop song, and there is no clue to separate the chorus from the verses.

2. Goto's RefraiD

The fact of this poor result of novelty curve made me realized that segmenting precisely each section is quite a hard task, and considering of the current segmentation method can reach just 0.7 F-measure correctness, I decide to use a method which might deal with an easier task but can achieve a more robust result.

The method I found is called "RefraiD" which introduced by Masataka Goto in his article "A Chorus Section Detection Method for Musical Audio Signals and Its Application to a Music Listening Station". This method extract repetitive sections by detecting the 'stripe' in the Chroma self-similarity matrix.

3. General process of RefraiD

The method is no need to repeat here, because Goto explained it precisely in his article. I want to point out here some modifications of this method in my implementation to improve the detection performance. Below is the general process of this algorithm:

1. STFT of audio signal
2. calculate chromagram by chroma filter
3. lag time self-similarity matrix (SSM) of chromagram
4. remove noise to highlight repetitive line segments in SSM, de-noised SMM
5. extract line segments from de-noised SSM
6. integrate repeated sections (line segments) into groups
7. redetect missing line segments and integrate them into groups
8. remove inappropriate peaks (missing line segments)
9. unfold line segments in each group
10. integrate line segments inter-groups
11. post-processing line segments inner-group
12. adjust self-similarity probability of line segments

13. select chorus sections

The 2nd step “chromagram calculation” is done by a chroma filter, this is different from Goto’s method. In step 5 and 7, “dynamic threshold” of equation.8 in original article is replaced by the “True Envelope” threshold (step 5) and “low pass filter envelope” (step 7). Some line segments integration rules have been added in step 10. A post processing procedure has been added in step 11. In step 12, the chorus measure is not only weighted by average value of line segment length, but also by the length of each line.

4. Some details

Steps 5 and 7 Equation.7 and 10 has found not working to detect peaks of $R_{all}(t,l)$ and $R_{[T_s,T_e]}(l)$. The reason is these formulas low pass signal and cause diminution of peak amplitude and deviation of peak location. Besides, the dynamic threshold of equation.8 is not efficient for detecting all the peaks.

Fig. 2 True Envelope of Rall(t,l), and peak picking

Instead, “True envelop” is proved experimentally as an efficient threshold for the peak picking. Considering the calculation complexity of “True envelope” method, too much use might slow the algorithm. So in step 7, a low pass filter envelope is used instead of “True envelope”.

The peaks detected above this threshold (Fig. 2) are those have important amplitudes, which means there might be long repetitive line segments in their corresponding sections. So these important amplitude peaks need to be extracted.

Fig. 3 Rules of integrating line segments inter-groups

Step 10 The rules of integrating line segments inter-groups has been defined (Fig. 3):

line segment is overlapping with a generative line segment of another group. Generative line segment is which generate other line segments inside a group.
a group section includes another group section
two group sections are close (< threshold), but not overlapped. This rule can be only implemented once, and there is
a length limit for these two concatenated line segments.

Step 11 The post-processing are:

remove line segments which are too short : shorter than 1/3 of original group section
integrate inner line segments which are included by other segments in groups.

Step 13 The equation 16 in article has been changed to:

Because in our implementation, the line segments of the same group may have different lengths, whereas, in Goto’s article, all the line segments of the same group are equal length.

It is important to mention that the group which has the most great chorus measure might not be the chorus. Because this algorithm detect not only the chorus but in fact the repetitive sections in audio signal. Any repetitive section which has great self-similarity probability and which are long enough could have the most great score. If the most great score group has been proved not be the chorus section, we can try the second great score group.

5. Code Matlab
I put the code matlab at github https://github.com/ronggong/lyricsaudio, feel to use and change. Don't forget to leave a comment if you got nice ideas. :D

Rong GONG's music/speech processing blog