AI music generators could be a boon for artists — but also problematic

9th October, 2022.      //   General Interest, Technology  // 

Dance Diffusion robot

It was only five years ago that electronic punk band YACHT entered the recording studio with a daunting task: They would train an AI on 14 years of their music, then synthesize the results into the album “Chain Tripping.”

“I’m not interested in being a reactionary,” YACHT member and tech writer Claire L. Evans said in a documentary about the album. “I don’t want to return to my roots and play acoustic guitar because I’m so freaked out about the coming robot apocalypse, but I also don’t want to jump into the trenches and welcome our new robot overlords either.”

But our new robot overlords are making a whole lot of progress in the space of AI music generation. Even though the Grammy-nominated “Chain Tripping” was released in 2019, the technology behind it is already becoming outdated. Now, the startup behind the open source AI image generator Stable Diffusion is pushing us forward again with its next act: making music.

Creating harmony

Harmonai is an organization with financial backing from Stability AI, the London-based startup behind Stable Diffusion. In late September, Harmonai released Dance Diffusion, an algorithm and set of tools that can generate clips of music by training on hundreds of hours of existing songs.

“I started my work on audio diffusion around the same time as I started working with Stability AI,” Zach Evans, who heads development of Dance Diffusion, told TechCrunch in an email interview. “I was brought on to the company due to my development work with [the image-generating algorithm] Disco Diffusion and I quickly decided to pivot to audio research. To facilitate my own learning and research, and make a community that focuses on audio AI, I started Harmonai.”

Dance Diffusion remains in the testing stages — at present, the system can only generate clips a few seconds long. But the early results provide a tantalizing glimpse at what could be the future of music creation, while at the same time raising questions about the potential impact on artists.

The emergence of Dance Diffusion comes several years after OpenAI, the San Francisco-based lab behind DALL-E 2, detailed its grand experiment with music generation, dubbed Jukebox. Given a genre, artist and a snippet of lyrics, Jukebox could generate relatively coherent music complete with vocals. But the songs Jukebox produced lacked larger musical structures like choruses that repeat and often contained nonsense lyrics.

Google’s AudioLM, detailed for the first time earlier this week, shows more promise, with an uncanny ability to generate piano music given a short snippet of playing. But it hasn’t been open sourced.

Dance Diffusion aims to overcome the limitations of previous open source tools by borrowing technology from image generators such as Stable Diffusion. The system is what’s known as a diffusion model, which generates new data (e.g., songs) by learning how to destroy and recover many existing samples of data. As it’s fed the existing samples — say, the entire Smashing Pumpkins discography — the model gets better at recovering all the data it had previously destroyed to create new works.

Kyle Worrall, a Ph.D. student at the University of York in the U.K. studying the musical applications of machine learning, explained the nuances of diffusion systems in an interview with TechCrunch:

“In the training of a diffusion model, training data such as the MAESTRO data set of piano performances is ‘destroyed’ and ‘recovered,’ and the model improves at performing these tasks as it works its way through the training data,” he said via email. “Eventually the trained model can take noise and turn that into music similar to the training data (i.e., piano performances in MAESTRO’s case). Users can then use the trained model to do one of three tasks: Generate new audio, regenerate existing audio that the user chooses or interpolate between two input tracks.”

It’s not the most intuitive idea. But as DALL-E 2, Stable Diffusion and other such systems have shown, the results can be remarkably realistic.

Artist perspective

Jona Bechtolt of YACHT was impressed by what Dance Diffusion can create. “Our initial reaction was like, ‘Okay, this is a leap forward from where we were at before with raw audio,’” Bechtolt told TechCrunch.

Unlike popular image-generating systems, Dance Diffusion is somewhat limited in what it can create — at least for the time being. While it can be fine-tuned on a particular artist, genre or even instrument, the system isn’t as general as Jukebox. The handful of Dance Diffusion models available — a hodgepodge from Harmonai and early adopters on the official Discord server, including models fine-tuned with clips from Billy Joel, The Beatles, Daft Punk and musician Jonathan Mann’s Song A Day project — stay within their respective lanes. That is to say, the Jonathan Mann model always generates songs in Mann’s musical style.

And Dance Diffusion-generated music won’t fool anyone today. While the system can “style transfer” songs by applying the style of one artist to a song by another, essentially creating covers, it can’t generate clips longer than a few seconds in length and lyrics that aren’t gibberish (see the below clip). That’s the result of technical hurdles Harmonai has yet to overcome, says Nicolas Martel, a self-taught game developer and member of the Harmonai Discord.

“The model is only trained on short 1.5-second samples at a time so it can’t learn or reason about long-term structure,” Martel told TechCrunch. “The authors seem to be saying this isn’t a problem, but in my experience — and logically anyway — that hasn’t been very true.”

YACHT’s Evans and Bechtolt are concerned about the ethical implications of AI — they are working artists, after all — but they observe that these “style transfers” are already part of the natural creative process.

Dance Diffusion art

“That’s something that artists are already doing in the studio in a much more informal and sloppy way,” Evans said. “You sit down to write a song and you’re like, I want a Fall bass line and a B-52’s melody, and I want it to sound like it came from London in 1977.”

But Evans isn’t interested in writing the dark, post-punk rendition of “Love Shack.” Rather, she thinks that interesting music comes from experimentation in the studio — even if you take inspiration from the B-52’s, your final product may not bear the signs of those influences.

“In trying to achieve that, you fail,” Evans told TechCrunch. “One of the things that attracted us to machine learning tools and AI art was the ways in which it was failing, because these models aren’t perfect. They’re just guessing at what we want.”

Evans describes artists as “the ultimate beta testers,” using tools outside of the ways in which they were intended to create something new.

“Oftentimes, the output can be really weird and damaged and upsetting, or it can sound really strange and novel, and that failure is delightful,” Evans said.

  • Linkedin

  • Pinterest

  • Youtube