New Project, Riffusion, Applies AI Image Generation Technology To Music, Makes Disturbing Sample Fodder


By now, most readers will be aware of recent innovations in the area of image generation with artificial intelligence (AI).

With tools like Stable Diffusion, you can generate original images by supplying text prompts, like “photograph of an astronaut riding a horse”, as shown in the upper right image.

Stable Diffusion does this by starting with random noise, and comparing the random image to an index of images that match the prompt text. The application chooses the image that has qualities closest to the images in its index that match or are close to the prompt text, and then it repeats this process. With each iteration, the image’s qualities get closer and closer to images with the desired tags or text prompt.

This artificial intelligence – so the program is not intentionally drawing a picture of an astronaut on a horse – it’s generating an image with qualities that are similar to the qualities of images that it’s indexed for ‘astronaut’ and ‘horse’.

This is an important difference, and helps explain why these AI images can be amazing, but are very likely to have some weirdness to them.

Did you notice that the horse only has three legs?

Riffusion is a new project that builds on the success of recent AI image generation work, but applies it to sound.

The way Riffusion works is by first building an indexed collection of spectrograms, each tagged with keywords that represent the style of music captured in the spectrogram.

Once it is trained on this body of spectrograms, the model can use the same approach as Stable Diffusion, interating on noise to get to a sonogram image that has similar qualities to sonograms that match your text prompts.

If you ask for ‘swing jazz trumpet’, it will generate a sonogram that is similar to sonograms that closely match your prompt. The application then converts the sonogram to audio, so you can listen to the result.

The results are currently crude, but demonstrate that the process does result in original audio that matches the text prompts. But the process is limited by a small index of spectrograms, compared to the 2.3 billion images used to train Stable Diffusion. And it’s limited by the resolution of the spectrograms, which give the resulting audio a lofi quality.

It’s unlikely that this process will result in AI generating anything conventionally musical in the near future, because the process does not account for form, the idea that music is sound, intentionally organized in time to create an artistic effect.

The approach shows potential, though. It’s currently up to the task of generating disturbing sample fodder – similar to the way AI image generation, even 6 months ago, was limited to generating lofi creepy images. This suggests that – with a much larger index, and higher resolution spectrograms – it’s likely that AI audio generation could make similar leaps in quality in the next year.

What do you think? Is there a future in music for AI directly synthesizing audio? Share your thoughts in the comments!

via John Lehmkuhl


Please enter your comment!
Please enter your name here