Riffusion makes music by fine-tuning Stable Diffusion on spectrograms

Riffusion, released on December 15, 2022 by Seth Forsgren and Hayk Martiros as a hobby project, generated music through an unexpected detour: it made pictures. The pair fine-tuned Stable Diffusion v1.5, the open text-to-image model, to produce spectrograms - images in which the horizontal axis is time, the vertical axis is frequency, and brightness is loudness. A text prompt like “jazzy saxophone solo” yields a spectrogram image, which is then converted back into audio.

Because Stable Diffusion was built to generate images, Riffusion inherited its entire toolkit for free: prompting, blending between styles, and image-to-image editing all worked on sound once it was represented as a picture. The web app let users type prompts and hear short clips generated in near real time, smoothly interpolating from one prompt to the next to create continuous, evolving music.

The project went viral and the founders later incorporated Riffusion as a startup, raising seed funding in 2023. Its clever reuse of an image model became a frequently cited example of how general-purpose generative tools can be redirected to new media.

Why business readers should care: Riffusion showed that a breakthrough in one modality can be cheaply repurposed for another by changing the data representation rather than building a new model from scratch. That kind of leverage - turning sound into images to borrow an image model’s power - is a recurring source of fast, low-cost innovation in generative AI.

Riffusion makes music by fine-tuning Stable Diffusion on spectrograms

Sources

Related