Music in many ways has been intertwined with the progress of society and technology. We started with our voices, and slowly built other means to create music from drums, stringed-instruments the world over like the sitar, veena, erhu, shamisen, and eventually complex instruments like harpsichords, pianos and harmoniums.
But there were just the beginning, in the last 150 years since the advent of electricity, using electronic instruments have been the in-thing. Most of our music enjoy the benefits of this technological explosion. Of course that doesn’t mean we don’t enjoy a fancy pianoforte (if the interstellar theme is any indication below) - but even those are often mastered to make them sound perfect (no acoustical blips)
Most of the music we listen to is synthesized in DAWs (digital audio workstations) and mastered to make them sound perfect to our ears. As a human species with computational tools we have mastered the ability to craft exactly the waveform we want. Yet we still don’t have a perfect objective for consonance and dissonance.
So when it comes to creating music from scratch often we still start with the same fundamentals - chords, melodies and samples. These samples can come from other musicians we jam with, or songs we listen to everyday. Chords and melodies have some fundamentals in Eastern and Western classical music, we generally can use chord progressions we know are consonant.
Once we have these basic components we can put any filter on these and modulate these waveforms however we want
Now, where does that leave us with AI — well, AI is definitely a different beast. The mathematical cascade of functions that make up Transformers or any deep learning model, initially was meant to emulate the neurons in our brain, but has created a whole new mechanism for creating music.
Neural networks imitate humans very well but do not think like them — let’s try a text-to-image decoder, text-to-speech decoder and text-to-music decoder to demonstrate with a simple prompt: “a calming beach scene with waves crashing against the shore”.
Agatha speaking ‘a calming beach scene with waves crashing against the shore’ aloud.
‘a calming beach scene with waves crashing against the shore’ as interpreted by a decoder.
Well, it’s not too bad, many of the text and language encoders have a deep understanding of these concepts like “calming, beach, etc” and these concepts are quite common in the unsupervised pre-training data of these encoders. And since each of these models use a variation of those, it’s not surprising to see it doing a good job. Of the 3 however, music is the one that seems to not quite match what I was looking for. It is chill and repetitive with a rhythm and beat like music should be, but it doesn’t have that organic wave crashing sound I was looking for.
It’s safe to say to Music as a medium is more complex and nuanced and language is not often the means by which we describe it. That is, data that describes music in detail is not as widespread, so while language encodings are helpful, mapping them to sequential outputs in a spectrogram is not as straightforward.
Musicians create music by “jamming” with each other — going back and forth with different musical samples with language as a guide to direct attention rather than fully describe a piece of music.
So…with this in mind, how can we leverage the latest in generative AI? Well, first we should try a small project to test out whether it can generate useful music. I’ll start with creating some Lofi Bollywood Music since I enjoy it and will explore how generative models can help me in the process. I have no idea where this will take me….but I hope eventually we can develop representations that enable musicians to “jam” with AI in a productive way to expand possibilities and create “better” (as defined by humans) music.
To be continued…..