Home

Audio to MIDI Transcription with Transformers

Overview

This project is an attempt to improve on the results obtained by Hawthorne et al. in "Sequence-to-Sequence Piano Transcription with Transformers" [1]. We plan to improve the model by pretraining it on a large synthetic dataset of paired raw audio and midi files and fine-tuning it later for more specific tasks (e.g., piano transcription with the MAESTRO dataset).

Current SOTA in music transcription comes from a 2021 paper out of Google Research. They show that a generic encoder-decoder architecture achieves high performance translating spectrogram inputs into MIDI-like output events. The results of this paper suggest that the focus of research for improving transcription results should be focused more on dataset creation and labeling than custom model design.

With this in mind we will attempt their transcription results using a similar model architecture that is trained on a large synthetic dataset of aligned MIDI and raw audio. We will then fine-tune our model on the MAESTRO[2] dataset so we can directly compare our results with Google's transformer.

For our first venture into pretraining a transformer for music transcription, we will focus on piano. However, in the future we envision pretraining a model on a variety of synthetic instruments and fine-tuning it for them as well. Currently lack of datasets is the largest obstacle preventing this idea.

Motivation

Symbolic representations of music are much more flexible than their raw audio counterparts. Musicians, producers, composers, and teachers make use of MIDI (a common symbolic music format) frequently in their core tasks. Benefits of using symbolic representations include:

Symbolic representations provide accurate instructions on pitch, rhythm, and dynamics and provide visual clarity, enabling quick understanding of complex musical structures.
Symbolic representations of music are useful as practice tracks as it is easy to speed up or slow down the tempo and remove parts or make them louder or softer.
Symbolic representations such as MIDI allow for easy arrangement changes.

Another important motivator for this project is the relative dearth of symbolic music datasets and matched raw audio/symbolic datasets. A high-fidelity audio transcription model could be part of a pipeline to create large and/or genre specific datasets for a wide variety of instruments and purposes.

Proposed Method

Dataset Creation:

In order to create our large dataset of synthetic data, we used the Lakh MIDI Dataset[3].

Questions

How much faster will dataloading be if spectrograms are saved instead of raw audio? Is it more space efficient?
Data distribution of: sparsity? Track length?

"Raw audio is excellent for emotional expression, live performance, and creativity, while symbolic representations are valuable for precision, notation, collaboration, and formal learning."

References

[1] Hawthorne, Curtis, et al. "Sequence-to-sequence piano transcription with transformers." arXiv preprint arXiv:2107.09142 (2021).

[2] Hawthorne, Curtis, et al. "Enabling factorized piano music modeling and generation with the MAESTRO dataset." arXiv preprint arXiv:1810.12247 (2018).

[3] Colin Raffel. "Learning-Based Methods for Comparing Sequences, with Applications to Audio-to-MIDI Alignment and Matching". PhD Thesis, 2016.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly