Final project for the UPC Postgraduate Course Artificial Intelligence with Deep Learning, edition Spring 2021
Team: Anne-Kristin Fischer, Joan Prat Rigol, Eduard Rosés Gibert, Marina Rosés Gibert
Advisor: Gerard I. Gállego
To install the project we recommend using a virtual environment (venv). Steps to follow:
sudo apt update
sudo apt install python3 python3-venv python3-dev
python3 -m venv .venv --prompt aidl-lyrics-recognition
source .venv/bin/activate
pip install -r requirements.txt- Introduction
- Data set
- Working Environment
- General Architecture
- First Tests
- Results
- Results improvement
- The Google Cloud instance
- Conclusions
- Next steps
- References
To this day few research is done in music lyrics recognition which is still considered a complexe task. For its approach two subtasks can be determined:
- The singing voice needs to be extracted from the song by means of source separation. What seems to be an easy task for the human brain, remains a brain teaser for digital signal processing because of the complexe mixture of signals.
- The second subtask aims to transcribe the obtained audio text of the singing voice into written text. This can be thought of as a speech recognition task. A lot of progress has been made for standard speech recognition tasks. Though, experiments with music made evident that the recognition of text of a singing voice is more complexe than pure speech recognition due to its increasing acoustical features.
Practical applications for music lyrics recognition such as the creation of karaoke versions or music information retrieval tasks motivate to tackle the aforementioned challenges.
Our decision for a lyrics recognition task with deep learning techniques is the attempt to combine several of our personal and professional interests. All team members have a more or less professional background in the music industry additionally to a particular interest in source separation tasks and natural language processing.
- Extract the voice of a song and transcribe the lyrics with Demucs + Wav2Vec
- Analysis of results
- Deploy a web app for lyrics extraction
- Suggestions for further studies and investigation
To reach our goal, we set up the following milestones:
- Find a suitable data set
- Preprocess the data for its implementation into the model
- Define the model
- Implement the model
- Train the model
- Analyse the obtained results
- Implement the project inside a web application
- Make suggestions for further investigation
To train our model we opted for the [DALI data set] (https://github.com/gabolsgabs/DALI), published in 2018. It is to this day the biggest data set in the field of singing voice research which aligns audio to notes and their lyrics along high quality standards. Access was granted to us for the first version, DALI v1, with 5358 songs in full duration and multiple languages. For more information please check as well [this article] (https://transactions.ismir.net/articles/10.5334/tismir.30/), published by the International Society for Music Information Retrieval.
To develop the base model with 395 MM parameters, we used Google Colab as it was fast and easy for us to access. To train our model we made the first free tests with [wandb] (https://wandb.ai/site). For the full training with 580 MM parameters we then switched to a VM instance on Google Cloud.
Few research is done so far for music lyrics recognition in general and mostly spectrograms in combination with CNNs are used. In the context of this project we explore the possibility of a highly performing alternative by combining two strong models: the Demucs model for the source separation task in combination with a Wav2Vec model for the transcription task. Demucs is currently the best performing model for source separation based on waveform and so far the only waveform-based model which can compete with more commonly used spectrogram-based models. Wav2Vec is considered the current state-of-the-art model for automatic speech recognition.
For training and validation we opted for CTT loss.
Preprocessing the data set correctly for our purpose was proven to be one of the major obstacles we encountered. We focused on songs in English only, that is 3491 songs in full duration. Preprocessing included omitting special characters as well as negative time stamps and transforming the lyrics in upper case only. To make sure to obtain meaningful results after training and to avoid cut-off lyrics, we prepared chunks. For these chunks we discarded words split among multiple notes at the beginning and end of each chunk and we cut out silent passages without voice. To make data accessible for our model, the audio waveform needed to be resampled to a sample rate of 44100 Hz. As alignment is done automatically in DALI and groundtruth is available only for few audio samples, we followed the suggestions for train/validation/test split by the authors. That is
| Correlations | tracks (v1)
----------- | ----------------- | --------------------- Test |NCCt >= .94 | 1.0: 167 Validation |.94 > NCCt >= .925 | 1.0: 423 Train |.925 > NCCt >= .8 | 1.0: 4768
where NCCt is a correlation score which indicates how accurate the autmatic alignment is. Higher means better. The number of tracks refers to the whole data set, including as well songs in other languages.
Today as then the community laments a lack of well structered, aligned, large data sets for music information retrieval tasks.
Further research could be done for:
- melody extraction
- chords transcription
- adding a language model to improve the results of the transcription task
- summary of the lyrics
- pitch recognition
- contribute to larger datasets of high quality
https://towardsdatascience.com/wav2vec-2-0-a-framework-for-self-supervised-learning-of-speech-representations-7d3728688cae https://ieeexplore.ieee.org/abstract/document/5179014 https://www.isca-speech.org/archive/Interspeech_2019/pdfs/1318.pdf https://ieeexplore.ieee.org/document/6682644?arnumber=6682644 https://www.researchgate.net/publication/42386897_Automatic_Recognition_of_Lyrics_in_Singing https://europepmc.org/article/med/20095443 https://asmp-eurasipjournals.springeropen.com/articles/10.1155/2010/546047 https://arxiv.org/abs/2102.08575




