How Spotify trained an AI to transcribe music

Basic Pitch, an open-source tool on the web, can take sound recordings and turn them into a computer-recognizable MIDI score.
sheet music on piano
Parsoa Khorsand / Unsplash

Before electronic music became an umbrella category for a distinct genre of modern music, the term referred to a technique for producing music that involved transfering audio made by real-life instruments into waveforms that could be recorded on tapes, or played through amps and loudspeakers. During the early to mid-1900s, special electronic instruments and music synthesizers—machines hooked up to computers that can electronically generate and modify sounds from a variety of instruments—started becoming popular. 

But there was a problem: almost every company used their own computer programming language to control their digital instruments, making it hard for musicians to pull together different instruments made by different manufacturers. So, in 1983, the industry came together and created a communications protocol called musical instrument digital interface, or MIDI, to standardize how external audio sources transmit messages to computers, and vice versa.

MIDI works like a command that tells the computer what instrument was played, what notes were played on the instrument, how loud and how long it was played for, and with which effects if any. The instructions cover the individual notes of individual instruments, and allow for the sound to be accurately played back. When songs are stored as MIDI files instead of a regular audio file (like mp3 or CD), musicians can easily edit the tempo, key, and instrumentation of the track. They can also take out individual notes, entire instrument sections, change the instrument type, or duplicate a main vocal track and turn it into a harmony. Because MIDI keeps track of what notes get played at what times by what instruments, it is essentially a digital score, and softwares like Notation Player can effortlessly transcribe MIDI files into sheet music. 

[Related: Interface The Music: An Introduction to Electronic Instrument Control] 

Although MIDI is convenient for a lot of reasons, it usually requires musicians to have some sort of interface, like a MIDI controller keyboard, or knowledge on how to program notes by hand. But a tool made publicly available by engineers from Spotify and Soundtrap this summer, called Basic Pitch, promises to simplify this process, and open up this tool for musicians who lack specialty gear or coding experience. 

“Similar to how you ask your voice assistant to identify the words you’re saying and also make sense of the meaning behind those words, we’re using neural networks to understand and process audio in music and podcasts,” Rachel Bittner, a Spotify scientist who worked on the project, said in a September blog post. “This work combines our ML research and practices with domain knowledge about audio—understanding the fundamentals of how music works, like pitch, tone, tempo, the frequencies of different instruments, and more.”

Bittner envisions that the tool can serve as a “starting point” transcription that artists can make in the moment that saves them the trouble of writing out notes and melodies by hand. 

This open source tool uses machine learning to convert any audio into MIDI format. See it in action here

[Related: Why Spotify’s music recommendations always seem so spot on]

Previous research into this space has made the process of building this model easier, to an extent. There are devices called Disklaviers that record real-time piano performances and store it as a MIDI file. And, there are many audio recordings and paired MIDI files that researchers can use to create algorithms. “There are other tools that do many parts of what Basic Pitch does,” Bittner said in the podcast NerdOut@Spotify. “What I think makes Basic Pitch special is that it does a lot of things all in one tool, rather than having to use different tools for different types of audio.” 

Additionally, an advantage it offers over other note-detection systems is that it can track multiple notes from more than one instrument simultaneously. So, it can transcribe voice, guitar, and singing all at once (here’s a paper the team published this year on the tech behind this). Basic Pitch can also support sound effects like vibrato (a wiggle on a note), glissando (sliding between two notes), bends (fluctuations in pitch), as well, thanks to a pitch bending detection mechanism. 

To understand the components in the model, here are some basic things to know about music: Perceived pitch is the fundamental frequency, otherwise known as the lowest frequency of a vibrating object (like a violin string or a vocal chord). Music can be represented as a bunch of sine waves, and each sine wave has its own particular frequency. In physics, most sounds we hear as pitched have other tones harmonically spaced above it. The hard thing that pitch tracking algorithms have to do is to wrap all the extra pitches down into a main one, Bittner noted. The team used something called a harmonic constant-Q transform to model the structure in pitched sound by harmonic, frequency, and time. 

The Spotify team wanted to make the model fast and low-energy, so it had to be less computationally expensive and make fewer inputs go further. That means the machine learning model itself had to have simple parameters and few layers. Basic Pitch is based on a convolutional neural network (CNN) that has less than 20 MB peak memory and fewer than 17,000 parameters. Interestingly, CNNs were one of the first models that were known to be good at detecting images. For this product, Spotify trained and tested its CNN on a variety of open datasets for vocals, acoustic guitar, piano, synthesizers, orchestra, across many music genres. “In order to allow for a small model, Basic Pitch was built with a harmonic stacking layer and three types of outputs: onsets, notes, and pitch bends,” Spotify engineers wrote in a blog post

[Related: Birders behold: Cornell’s Merlin app is now a one-stop shop for bird identification]

So what is the benefit of using machine learning for a task like this? Bittner explained in the podcast that they could build a simple representation of pitch by using audio clips of one instrument played in one room on one microphone. But machine learning allows them to discern similar underlying patterns even when they have to work with varying instruments, microphones, and rooms. 

Compared to a 2020 multi-instrument automatic music transcription model trained on data from MusicNET, Basic Pitch had a higher accuracy when it came to detecting notes. However, Basic Pitch performed worse compared to models trained to detect notes from specific instruments, like guitar and piano. Spotify engineers acknowledge that the tool is not perfect, and they are eager to hear feedback from the community and see how musicians use it.

Curious to see how it works? Try it out here—you can record sounds directly on the web portal or upload an audio file.