Wednesday, November 5, 2014

Chorus detection, Goto's algorithm implementation

I am working recently on a project of lyrics-audio alignment where an important step is segment a song so that we can align the lyrics to the audio on section level. Music segmentation itself is a hard subject that many researchers devote to it, the ideal output of the segmentation is the clusters of each section of the music, like 'verse', 'chorus'. A great summary article could be found on the internet AUDIO-BASED MUSIC STRUCTURE ANALYSIS.

1. J. Foote Novelty curve

The first segmentation method is proposed by J. Foote in his article "Automatic audio segmentation using a measure of audio novelty". He use a diagonal matrix which he called it "checkerboard" kernel to compute the Novelty of MFCC self-similarity matrix. The implementation of this method is really easy and the function Mirnovelty in Mirtoolbox can do this task by several code lines.

However, the result of this method is far from perfect. We can hardly do the next step processing to segment each section of the song from the novelty curve.

 Fig. 1 Novelty curve of a pop song, red vertical lines are the ground truth of segmentation parts
Above (Fig. 1) is a novelty curve calculated from a pop song, and there is no clue to separate the chorus from the verses.

2. Goto's RefraiD
The fact of this poor result of novelty curve made me realized that segmenting precisely each section is quite a hard task, and considering of the current segmentation method can reach just 0.7 F-measure correctness, I decide to use a method which might deal with an easier task but can achieve a more robust result.

The method I found is called "RefraiD" which introduced by Masataka Goto in his article "A Chorus Section Detection Method for Musical Audio Signals and Its Application to a Music Listening Station". This method extract repetitive sections by detecting the 'stripe' in the Chroma self-similarity matrix.

3. General process of RefraiD
The method is no need to repeat here, because Goto explained it precisely in his article. I want to point out here some modifications of this method in my implementation to improve the detection performance. Below is the general process of this algorithm:

1. STFT of audio signal
2. calculate chromagram by chroma filter
3. lag time self-similarity matrix (SSM) of chromagram
4. remove noise to highlight repetitive line segments in SSM, de-noised SMM

5. extract line segments from de-noised SSM
6. integrate repeated sections (line segments) into groups
7. redetect missing line segments and integrate them into groups
8. remove inappropriate peaks (missing line segments)
9. unfold line segments in each group

10. integrate line segments inter-groups
11. post-processing line segments inner-group

12. adjust self-similarity probability of line segments
13. select chorus sections

The 2nd step “chromagram calculation” is done by a chroma filter, this is different from Goto’s method. In step 5 and 7, “dynamic threshold” of equation.8 in original article is replaced by the “True Envelope” threshold (step 5) and “low pass filter envelope” (step 7). Some line segments integration rules have been added in step 10. A post processing procedure has been added in step 11. In step 12, the chorus measure is not only weighted by average value of line segment length, but also by the length of each line.

4. Some details

Steps 5 and 7 Equation.7 and 10 has found not working to detect peaks of $R_{all}(t,l)$ and $R_{[T_s,T_e]}(l)$. The reason is these formulas low pass signal and cause diminution of peak amplitude and deviation of peak location. Besides, the dynamic threshold of equation.8 is not efficient for detecting all the peaks.
 Fig. 2 True Envelope of Rall(t,l), and peak picking
Instead, “True envelop” is proved experimentally as an efficient threshold for the peak picking. Considering the calculation complexity of “True envelope” method, too much use might slow the algorithm. So in step 7, a low pass filter envelope is used instead of “True envelope”.

The peaks detected above this threshold (Fig. 2) are those have important amplitudes, which means there might be long repetitive line segments in their corresponding sections. So these important amplitude peaks need to be extracted.
 Fig. 3 Rules of integrating line segments inter-groups
Step 10 The rules of integrating line segments inter-groups has been defined (Fig. 3):
1. line segment is overlapping with a generative line segment of another group. Generative line segment is which generate other line segments inside a group.
2. a group section includes another group section
3. two group sections are close (< threshold), but not overlapped. This rule can be only implemented once, and there is
a length limit for these two concatenated line segments.
Step 11 The post-processing are:
1. remove line segments which are too short : shorter than 1/3 of original group section
2. integrate inner line segments which are included by other segments in groups.
Step 13 The equation 16 in article has been changed to:
Because in our implementation, the line segments of the same group may have different lengths, whereas, in Goto’s article, all the line segments of the same group are equal length.

It is important to mention that the group which has the most great chorus measure might not be the chorus. Because this algorithm detect not only the chorus but in fact the repetitive sections in audio signal. Any repetitive section which has great self-similarity probability and which are long enough could have the most great score. If the most great score group has been proved not be the chorus section, we can try the second great score group.

5. Code Matlab
I put the code matlab at github https://github.com/ronggong/lyricsaudio, feel to use and change. Don't forget to leave a comment if you got nice ideas. :D