-
Notifications
You must be signed in to change notification settings - Fork 1
Synthetic Dataset
To obtain a large amount of matched audio and MIDI data, we rely on the Lakh MIDI Dataset. The LMD is a collection of 176,581 unique MIDI files scraped from publicly-available sources on the internet.
We downloaded the LMD-full
collection of deduped MIDI files from Colin Raffel's site. Then we separated each song into individual (non-drum) tracks, resulting in ~1.3 million files. We then filtered out any tracks in this set that were excessively sparse.
Once we had narrowed down our list of MIDIs, we converted them to raw audio using the software synthesizer FluidSynth. We used the GeneralUser GS SoundFont and the "Bright Grand Piano" instrument. In the future, we could synthesize many other instruments as well to create a more robust transcription model.