How AI is Empowering (and Disempowering) Musicians

Exploring how generative AI is both empowering and disempowering musicians as it starts to reshape the world of music.


Robot with a piano

Image generated using Stable Diffusion.

Generative AI's ongoing disruption of the world raises questions about its impact on musicians. Creative people, hereby referred to as creatives, seem to be the first group whose output has been (arguably) replicated by generative AI. Computers to me have always lacked creativity. Computers are good at computing and following the instructions they're told, but now they're breaking into the creative space and creating art? I find this fascinating, and wanted to explore into what is going on with AI and generative AI in the area of music production.

Music production to me is particularly interesting because 1. I'm a consumer and 2. my best friend is an incredibly talented electronic dance music (EDM) producer. Looking into the music and the creative industry as a whole, it's an unfortunate truth that success does not necessarily correlate to talent. My passion projects are typically in the EDM niche for this reason. My goal being to deliver value in the space by helping talented musicians succeed.

What Do Music Producers Do?

Coming from a layman, the workflow of a producer looks something like the following:

Producers start with writing lyrics and melodies. From there comes the production work of creating vocals and fleshing out the song with sounds (i.e. instruments). Once that is done there is audio engineering work required to mix and master the produced song. This step makes the song sound crisp and clear on different speakers.

Once a song is created, if a producer wants to upload their music to a music streaming platform they need to pair it with another creative asset: cover art.

In summary the creative workflow for a producer looks something like:

* Music
    * Writing
        * Lyrics
        * Melodies
    * Producing
        * Vocals
        * Sounds
    * Mixing & mastering
* Creating complementary art

Using AI to Improve the Music Production Workflow

Now we have a general idea of the music production journey, so let's see some ways we can improve/optimize it with AI.

Generating Vocals

Getting good vocals for songs is a huge roadblock for producers. In order to get professional quality vocals, producers often need to pay freelance singers which can range anywhere from $100s to $1000s of dollars.

There are some absolutely insane AI generated vocals on the internet nowadays. For example check out this Gangsta's Paradise cover 'by' Frank Sinatra.

The cover is cool, but is such technology to modern day producers? The answer: yes. It's extremely accessible - all the software used is free and open source. There is a Google Colab Notebook with instructions on how to isolate existing vocals and change the voice to a different person. The catch is the vocals are limited to the available models and the model can only be applied to existing vocals. Regardless the capability to do this will most likely have large impacts on the music industry. Where artists' sound can be completely digitized and replicated by anyone.

There are some tools that create speech just from text such as murf.ai and uberduck.ai. But from my tinkering they seem pretty limited and not really geared towards lyrics. I imagine generating vocals from just lyrics that fit a song is difficult. There is tempo, key, and the general vibe of the instrumental of the song that must be compatible. I am hopeful to see more progress in this approach in the future.

Mixing and Mastering

Mixing and mastering is another skillset that is largely separate from producing. "Mixing is when an engineer carves and balances the separate tracks in a song to sound good when played together. While mastering a song means putting the finishing touches on a track by enhancing the overall sound, creating consistency across the album, and preparing it for distribution." [1]

Often producers will pay people that are experienced audio engineers to mix and master their songs. This is due to the disconnect in skills to properly mix and master a song versus the production of a song. Naturally this makes it a good opportunity to automate using AI based tools.

Currently this mixing and mastering workflow isn't automated, but it is being improved with AI backed tools such as Ozone by Izotope. Ozone abstracts away some of the complication of mixing and mastering, to enable music producers to easily tweak sounds to their liking. Another area where I'm hopeful for further innovation.

Generating Visual Art

An example of visual art necessary for musicians is cover art for their songs. Every single song on music streaming platforms is paired with cover art. Cover art is uniquely important for musicians, because often cover art is the first impression a listener has of a song. Often being the reason a listener chooses to listen to the song.

Considering how important cover art is for a music producer, yet how completely disparate the skillsets are of creating music and creating art, it's clear that there is some workflow improvements to be made. Generative AI in the world of 2023 can create photo realistic images, so I'm sure it can help with this use case.

I recently got Stable Diffusion installed locally on my computer (guide here). Immediately I asked my edm producer friend for the vibe of a song he's finished. His response, "idk maybe some silhouette of a girl trapped in a tunnel or something." With that illustrative description, I prompted Stable diffusion with: cover art for an EDM song. The vibe of the song is silhouette of a girl trapped in a tunnel and got the following result in about 15 seconds:

silhouette of a girl trapped in a tunnel

The image isn't perfect, but with further prompt tuning and combing through the output I'm sure it could be close. The previous workflow for a producer would consist of either creating the image by themselves or hiring someone to. Both of which take a lot of time. With that said, I am positive that generative AI will be the go to for cover art creation.

Generative AI in Music

We talked about how music producers can improve their workflows, but can AI just generate the entirety of a song instead?

It is important to note that some songs have gone viral as "songs created by AI", but typically they just use the vocal generating method that we used above. In other words, everything besides the sound of the voice is human made.

Let's see what kind of generative AI exists for music.

OpenAI

This song is generated by OpenAI's Jukebox. Jukebox is a neural network that generates raw audio. According to OpenAI, "A typical 4-minute song at CD quality (44 kHz, 16-bit) has over 10 million timesteps. For comparison, GPT-2 had 1,000 timesteps." [2] Attempting to generate raw audio is an incredibly compute heavy task, even for an incredibly compute heavy space. OpenAI goes into detail stating, "It takes approximately 9 hours to fully render one minute of audio through our models." [3] To add to this complexity, in contrast to something like visual art, music is high fidelity and a small mistake in a song is extremely apparent.

After listening through the OpenAI SoundCloud, my opinion of the music is the audio quality is subpar, vocals sound human but are jumbled, it lacks the feeling of an end to end song, and also the sound of the music doesn't sound like unique instruments but just one incohesive sound.

At the time of writing this the Jukebox project seems to be no longer maintained or actively worked on.

Meta

Recently Meta released MusicGen, a model for music generation. There is an online demo tool available on hugging face. Overall MusicGen is a model that takes text as an input and outputs a music file based on the prompt.

Meta also takes the approach of using raw audio to train its model. Specifically its dataset consists of 10,000 licensed songs from Shutterstock and Pond5. I'm unfamiliar with Shutterstock's presence in the audio space, but to me that feels like a recipe to get some bland music. And I can say after personally playing with it, the generated songs do feel very surface level. The music generated feels like it'd best be fitted for use in the background of other content. Not as a standalone piece of music.

AIVA

In the generative music space there is a lesser known name, AIVA (Artificial Intelligence Virtual Artist). An AIVA generated song was recently used in Nvidia's keynote.

Unlike Meta and OpenAI's models, AIVA's models are not publicly accessible. There isn't even much information available on the web into how AIVA's models work. Matter of fact their blog hasn't released an article since 2018.

I was able to find this video interview with the founder, where he explains that AIVA's generative AI works by training on digitized written notes on a scar that are digitized into a midi format. This is a clear distinction from OpenAI and Meta's models where they were training using raw audio. Potentially this distinction could be the cause of the discrepancy in quality between them.

AIVA is hands down on another level compared to OpenAI and even Meta's recently released model. AIVA's music genuinely provokes emotions, provides a story like atmosphere, and overall just sounds great. It is super exciting to see such great work coming out of AIVA, especially when going against behemoths like OpenAI and Meta.

Conclusion

Overall generative AI is here to stay and as innovation continues industries will be optimized & improved. Or potentially just completely replaced.

[1] https://www.izotope.com/en/learn/what-is-the-difference-between-mixing-and-mastering.html#:~:text=Mixing%20is%20when%20an%20engineer,and%20preparing%20it%20for%20distribution [2] https://openai.com/research/jukebox#motivation-and-prior-work [3] https://openai.com/research/jukebox#Limitations