Unsupervised Incremental Learning and Prediction of Audio Signals: Supplementary Material

Ricard Marxer12, Hendrik Purwins13

1. Music Technology Group, Universitat Pompeu Fabra, Roc Boronat, 138, 08018 Barcelona, Spain

2. Speech and Hearing Research Group, University of Sheffield, Regent Court, 211 Portobello Street, Sheffield, S1 4DP, UK

3. Audio Analysis Lab, Aalborg Universitet Kobenhavn, A.C. Meyers Vaenge 15, 2450 Copenhagen SV, Denmark

Abstract

Artful play with listeners' expectations is one of the supreme skills of a gifted musician. We present a system that analyzes an audio signal in an unsupervised manner in order to generate a musical representation of it on the fly. The system performs the task of next note prediction using the emerged representation. The main difference between our system and other existing music prediction systems is the fact that it dynamically creates the necessary representations as needed. Therefore it can adapt itself to any type of sounds, with as many timbre classes as there may be. The system consists of a clustering algorithm coupled with an algorithm for sequence learning that adapts its structure to the dynamically changing clustering tree. The flow of the system is as follows: 1) segmentation by onset detection, 2) timbre representation of each segment by Mel frequency cepstrum coefficients, 3) discretization by incremental clustering, yielding a tree of different sound classes (e.g. instruments) that can grow or shrink on the fly driven by the instantaneous sound events, resulting in a discrete symbol sequence, 4) extraction of statistical regularities of the symbol sequence, using hierarchical N-grams and the newly introduced conceptual Boltzmann machine, and 5) prediction of then next sound event in the sequence. The system is tested on drum loops and voice recordings. We assess the robustness of the performance with respect to complexity and noisiness of the signal. Given that the number of estimated timbre clusters is not necessarily the same as the number of different ground truth timbre classes, we evaluate the performance of the system with the adjusted Rand index (ARI). We evaluate separately the different steps in the process and finally the system as a whole as well as the interacting components of the complete system. Clustering in isolation yields an ARI of 82.7% /85.7% for data sets of singing voice and drums. Onset detection jointly with clustering achieve an ARI of 81.3% /76.3% (voice/drums). The prediction of the entire system yields an ARI of 27.2%/39.2% (voice/drums).

Audio-Visual Material

Videos

New clusters (orange circles, yellow diamonds) emerge on the fly. A scatter plot of the projection of the MFCC vectors (timbre representation) onto their first two principal components (above) and the incremental clustering tree (below left), and a tree-map representation of the number of instances per cluster (below, right) are shown. (Fig. 10)

Two clusters (circles, squares) merge into one cluster (squares). A scatter plot of the projection of the MFCC vectors (timbre representation) onto their first two principal components (above) and the incremental clustering tree (below left), and a tree-map representation of the number of instances per cluster (below, right) are shown. (Fig. 9)

Audio

Voice data

Informal low quality and short voice recordings of very simplified beat boxing, used to evaluate the system in Section 3.3. (Testing of Processing Stages with Audio Recordings).

Example from Fig. 8

Example from Fig. 7

ENST data

We used a subset from the ENST-Drums data base. In our simulations, we employed the audio files in 44.1 kHz/stereo. However, here we can only make the files accessible downsampled to 8kHz/mono. For obtaining the original audio data, please confer their website.

Examples

Example of how two clusters merge into one cluster in Section 3.4. (Examples, Fig. 9, audio used in video above)

Example of how new clusters are generated in Section 3.4. (Examples, Fig. 10, audio used in video above).