Rong GONG's music/speech processing blog: Musical structure segmentation: Joan Serrà method

Goto's RefraiD method is accurate for detecting the repetitive sections (verse, chorus etc.) of a song. But it's not a efficient method due to its tedious post-processing procedures, for example, line segments grouping. And it can't detect automatically segment boundaries or group similar segments.

Joan Serrà published first time his musical structure segmentation method in 2012 in article "Unsupervised Detection of Music Boundaries by Time Series Structure Features" where he dealt with the segment boundary detection. In 2014, he published another article "Unsupervised Music Structure Annotation by Time Series Structure Features and Segment Similarity" which presents his segment labelling method.

1. Method overview
The method contains three main steps:

1. Descriptor extraction
2. Segment boundaries detection
3. Segment labeling

The descriptor used in article is HPCPs (Harmonic Pitch Class Profile) which is the an extension of Chroma descriptor. For the latter, we use all the spectrum content filtered by Chroma filter bank. Whereas, we just pick the harmonic peaks in the spectrum to calculate HPCPs, which theoretically immune to the spectrum noise content assuming the accuracy of harmonic peak picking.

Considering the contribution of each harmonic to the pitch perception, Harmonic summation method is integrated in the calculation of HPCPs.

2. Segment boundaries detection
The so-called music descriptor time series is the feature vector (HPCPs) of STFT spectrum. I think his innovation is incorporating the delayed coordinates to the feature vector (concatenate the near past feature vectors with the current one) and this does improve enormously the segment labeling accuracy.

Then he calculated the recurrence plot of the dissimilarity matrix which I think is the most ingenious detail of this method. The recurrence plot highlights the mutual similar segments (the similarities of frame $i$ to $j$ and $j$ to $i$ should both be above a threshold) by filtering out the dissimilarities with a dynamic threshold. The result of this plot is similar to Goto's de-noised self-similarity matrix, but with much less computation time.

Recurrence plot

He afterwards transformed the recurrence plot to lag-time form and smoothed the latter with a gaussian kernel which is similar to the gaussian-tapered kernel used by J. Foote. The difference here is that the convolution should be done on the whole matrix (plot) instead of only the diagonal.

Smoothed lag-time recurrence plot

Finally the boundary curve is obtained by calculating the distance of consecutive frames ($||frame_{i+1}-frame_{i}||$). And the peak positions on this curve are selected as the segment boundaries.

3. Segment labeling
This procedure is very clear in article. The enforcement of transitivity by matrix multiplication is one of my favorite tricks in this method. However, I don't understand why the $Q_{max}$ measure can stand for the global similarity. If someone could point it out I will be so grateful.

4. Matlab Code
Those who want the Matlab code of this method please leave a message below or send me an email. :)

Rong GONG's music/speech processing blog

Tuesday, November 18, 2014

Musical structure segmentation: Joan Serrà method

1 comment:

social network

Total Pageviews

Subheader

Blog Archive