Google+

Tuesday, February 20, 2018

Retrieve the final hidden states output of variable length sequences in Tensorflow

Assume you have an input batch which contains variable length sequences. The batch dimension is:

input: [batch_size, max_time, dim_feature]

and you also stored the length of each sequence in a vector, say sequence_length. Now you can easily get the states output by:

_, state = tf.nn.dynamic_rnn(some_RNN_cell, input, sequence_length=sequence_length)

then you can get both the hidden and cell states output:

state.h: hidden states output, [batch_size, hidden_states_size]
state.c: cell states output

I give credit to these two sources:
https://danijar.com/variable-sequence-lengths-in-tensorflow/
https://github.com/shane-settle/neural-acoustic-word-embeddings/blob/4cc3878e6715860bcce202aea7c5a6b7284292a1/code/lstm.py#L25


Sunday, January 14, 2018

Sheet music and audio multimodal learning

https://arxiv.org/abs/1612.05050

Toward score following in sheet music: use classification to find note head position in the sheet music. Given an audio spectrogram patch, classify the location bucket.

https://arxiv.org/abs/1707.09887

Learning audio - sheet music correspondences for score identification and offline alignment: pair wise ranking objective and contrastive loss (siamese), what's the difference?

Wednesday, January 3, 2018

If I were to write this paper... Drum transcription CRNN

https://ismir2017.smcnus.org/wp-content/uploads/2017/10/123_Paper.pdf

(1) I will specify the dropout size used for the BGRU layers, unless we can attribute the better performance of the CBGRU to overfitting.

(2) I will report the parameter numbers of different models. For sure, a model with more parameters will have more capacity. In such way, the better performance of CBGRU-b than the CNN-b could be attributed its larger parameter size.

(3) The CNN-b seems to perform really well. I will fix the Conv layers in CNN-b model, switch the Dense layers to GRU layers to see if GRU can really outperform.

Wednesday, December 27, 2017

Capacity and trainability of different RNNs

In the paper "Capacity and trainability in RNNs": https://arxiv.org/pdf/1611.09913.pdf

The author claims that all common RNNs have similar capacity. The Vanilla RNN is super hard to train. If the task is hard to learn, one should choose gated architectures, in which GRU is the most learnable for shallow networks, +RNN (Intersection RNN) performs the best for the deep networks. Although LSTM is extremely reliable, it doesn't perform the best. If the training environment is uncertain, the author suggests using GRU or +RNN.

Another paper "On the state of the art of evaluation in neural language models" https://arxiv.org/pdf/1707.05589.pdf The authors also found that the standard LSTM performs the best among 3 different architectures (LSTM, Recurrent highway networks and Neural architecture search). The models are trained using a modified ADAM optimizer. Hyperparameters including learning rate, input embedding ratio, input dropout, output dropout, weight decay, are tuned by batched GP bandits.

It is also shown that, in the Penn Treebank experiment, for the recurrent state, the variational dropout helps, the recurrent dropout indicates no advantage.

Sunday, December 24, 2017

Deep learning practice and trends, some key points

I went through the 1st part of the tutorial: practice. Below are some key points in Oriol's talk:

CNN:

(1) The slide 7 Deep learning: zooming in is amazing! He listed the deep learning model construction elements and sorted them into different categories: Non-linearities, Optimizer, connectivity pattern, loss and hyper-parameters.

(2) The slide 21 which shows the convolution animation is great! very intuitive to understand the convolution mechanism.

(3) Slide 27 building very deep ConvNets: using deeper architecture and small filter size 3*3 will result in a large receptive field and less parameter size than using large filters.

(4) Slide 35 U-net: for image segmentation, bottleneck encoder-decoder with skip connection.

Seq2seq:

(1) Attention!

(2) Slide 62: tricks!

Video:

Slides: https://docs.google.com/presentation/d/e/2PACX-1vQMZsWfjjLLz_wi8iaMxHKawuTkdqeA3Gw00wy5dBHLhAkuLEvhB7k-4LcO5RQEVFzZXfS6ByABaRr4/pub?slide=id.g2a19ddb012_0_654

Monday, November 27, 2017

Formant and resonance in western and Kunqu opera singing, litterature

http://somaticvoicework.com/resonance-strategies-and-formant-tuning/
blog讲了美声应该调整比如第一formant适应基频,第二formant来适应第二谐波。塑造音色的靠前5个formants。但是跟学生讲这些jargon是没有用的,因为学生不知道怎么做。调整声音的变量特别多,比如:

The formants align because of the pitch, the volume and the vowel sound, and the shape we make while singing one. There are multitudes of possibilities with vowel sound shapes and very small differences can make the sound “maximally efficient” or not quite “good enough”. The jaw, the tongue, the mouth/lips, the back of the mouth (velo-pharyngeal port), the height of the back of the tongue, the height of the larynx and the amount of open/closed quotient as well as the depth of the vocal folds during vibration all play a part in the overall sound we hear when someone sings. The “at rest” position of the length of the folds, the size of the larynx, the size (both diameter and length of the vocal tract) of the throat and mouth cavities, and the bones of the head and face all play a part as well. And “resonance” as a destination isn’t needed in anything but classical repertoire and some kinds of music that might be done acoustically.

blog指出找到好音色的方法还是学生跟着老师不断练习。老师要通过学生演唱的听觉和视觉信息来纠正。就好比老师是一个函数,输入是学生的演唱音频和视频,输出是如何纠正与提高的方法。

https://www.ncbi.nlm.nih.gov/pubmed/23453594
Sundberg讨论第一共振峰的位置和基音的关系。他测量了美声和非美声歌手唱几个不同的vowel,不同的频率,并没有发现共振峰跟随基音有系统性的改变。同时,高阶的共振峰也没有发现跟随基音有系统改变。

https://www.ncbi.nlm.nih.gov/pubmed/24902631
CONCLUSIONS: Formant tuning may be applied by a singer of the OM (old man) role, and both CF (color face) and OM role singers may use a rather pressed type of phonation, CF singers more than OM singers in the lower part of the pitch range. Most singers increased glottal adduction with rising F0.

https://www.ncbi.nlm.nih.gov/pubmed/24131362
测量了10个昆曲演员的long-term-average-spectrum,5个行当。没有发现singing formant, 发现花脸在3kHz位置有speaker formant. LTAS跟普通说话人的差别很大,跟美声的差别也很大。

Saturday, November 25, 2017

Optimizing DTW-based audio-to-MIDI alignment and matching, Colin Raffel paper

This paper introduced a method of optimizing various DTW parameters on a synthetic MIDI dataset. He optimized the mean absolute alignment error by Bayesian optimization and the confidence score by exhaustive search.

Some interesting points in the paper:
(1) The best alignment systems don't use beat-synchronous feature.

(2) He introduced two penalties. The first one to penalize the "non-diagonal move", the second to ensure the entire subsequence is used when doing subsequence alignment. Best systems use median values for both penalties.

(3) The synthetic midi method includes change tempo, crop midi segment, delete the vocal track, change instrument timbre and change velocity. All is done by pretty_midi.

(4) He evaluated the matching confidence score by calculating the Kendell rank correlation between the score and the alignment absolute error, which means the error is the ground truth matching confidence score.

(5) All of the systems achieved the highest correlation when including the penalties in the score calculation, normalizing by the path length, and normalizing by the mean distance across the aligned portions.