The Dawn of AI-Assisted Music Creation: Embracing the Future of Generative AI

The rapid advancement of generative artificial intelligence, particularly with the emergence of services like Udio, is poised to revolutionize the music production landscape. As these technologies become more accessible and sophisticated, they have the potential to democratize music production and reshape the role of human creators.

Michael G Wagner Michael G Wagner
Veröffentlichung
Reading time
11 Minuten
Listen to article
Loading the Elevenlabs Text to Speech AudioNative Player...

I am late submitting this article. When I was originally asked to write something for Sounding Future, I was very excited. I like contributing to the creation of knowledge and understanding. After all, I am a teacher at a research-intensive university, and I have been one for pretty much all my life. And now I'm late submitting my work.

And yes, of course, there are the usual suspects: an extremely demanding day job, the fact that I foolishly decided to run a YouTube channel on spatial audio, for which I am forcing myself to create a new video every week. And then all the personal things need to be done and organized. Like probably all of you out there, I do a lot of stuff and sometimes things fall through the cracks.

But this time the reason for me being late is different, and it is as profound as it can get. Things are happening right now that will change music forever in ways we have not even begun to fully understand yet. And on top of everything, they are happening at a speed we have never seen before. These developments will have a massive impact on everything we do, professionally as well as personally.

I am, of course, talking about the rise of generative artificial intelligence.

Udio is Here

And so here it goes: Over the past couple of days, I could not get myself to disconnect from udio.com. For those of you who have not heard of this service yet, Udio is a new music creation platform that uses generative artificial intelligence to produce music with very little human intervention.

Udio is not the first system of its kind. Not that long ago another service called Suno made its rounds, which works reasonably well. All the experimental AI music systems have been around for decades at this point. And let's not forget all the systems that are currently in development by the big AI companies and organizations, including OpenAI, Stability AI, and the like.

But Udio is different. Not only was it founded by a group of extremely capable engineers and music professionals and backed by one of the most influential venture capital firms, but it is also extremely good at what it does. The quality of music Udio produces, while not perfect, is something I did not expect to see in the generative artificial intelligence space for at least another 5 years.

So, for the last couple of days, my evenings have looked pretty much like this: I sit down at my computer to do a final check on my email. I then decide to open up Udio to play around for just a few minutes - and about 5 hours later, with a little bit of remixing and remastering (more about that later), I end up with a track that sounds better and more professional than anything I have ever created before.

It is at this point where I should probably talk a little bit about my background, because, while I would consider myself a passionate "bedroom producer", I am not a professional in the traditional sense.

My expertise is more in digital media in general. I started as an applied mathematician, then turned into a computer scientist, and from there into a game design educator and finally, through game audio, into what some consider a spatial audio expert. Somewhere on my CV, you will even find research publications about artificial intelligence, which I wrote during my computer science period almost 30 years ago.

In other words, I know a lot of things, but my practical skills in music production are most certainly no match for the capabilities of professional producers and engineers who have developed their craft for years, if not decades.

But this is exactly where it gets interesting. At my skill level, Udio is a true game-changer. It allows me to create songs that are "good and professional sounding enough" for my taste. And it allows me to do that as a reasonably skilled hobbyist producer on the side at the end of a busy day.

Disruptive Innovation

One could say that Udio is the first service that is truly capable of fully democratizing music production. Sure, it is still lacking in quality, and it is therefore no match for professionally produced music. But in a time where most people are listening to songs on their mobile devices, quite often even without headphones, it is important to remember that it simply is good enough for most music consumers.

In his pivotal work "The Innovator's Dilemma", Harvard Business School scholar Clayton Christensen describes the concept of disruptive innovation as "a process by which a product or service takes root initially in simple applications at the bottom of a market and then relentlessly moves up market, eventually displacing established competitors."1

Udio is the perfect example of how generative artificial intelligence is taking root at the bottom of the market in the music production space. But this is only the beginning. There is very little doubt that these innovations will move up the market as relentlessly as Christensen predicts, eventually displacing competitors and even current production practices.

So, with that in mind, I decided to scrap my original plan to write about spatial audio and talk about generative artificial intelligence in the music production space instead. I plan to do that in a way that is easily accessible to anyone regardless of their prior knowledge of artificial intelligence or even computing in general. Because, while the underlying math is complex, the basic principles of these systems are not.

Neural Networks

At the heart of the newly emerging artificial intelligence technology lie the principles of neural networks. The basics are extremely simple. A neural network aims to mimic the functionality of our brain, in particular how neurons interact with other neurons to process information inside of our head. Simply put, a neuron in a neural network is a computational object that receives an input and distributes that input to other neurons in the network.

It does that by applying very simple computations, thereby passing on varying amounts of the incoming information to other neurons. How much information is passed on is defined by the parameters of the neuron.

Information that is entered into the neural network is first transformed into digital information that is then passed on to the neurons that form the input layer of the network. The information is then channeled through the network, eventually creating an output at the neurons that form the output layer.

The quality of the output depends on the parameters of the neurons. If the parameters are chosen randomly, the output will most likely make very little sense at all. However, if the parameters are chosen carefully, the expectation is that the neural network should process the input in a meaningful way.

Training then is the process by which these parameters are optimized for a particular task the neural network is supposed to solve. Simply put, during training, the network is given many inputs along with the expected outputs, and the parameters are then adjusted such that the next time one of these inputs is presented to the network, it will produce something close to the expected output.

Different terminologies are used for this process. Depending on how the process is set up, it will be called machine learning, deep learning, or something similar. For this discussion, we do not need to understand the nuances between these terminologies. It is sufficient to understand the basic functionality of a neural network and how it is trained.

There is one important aspect that I need to add here. During the training process, the parameters of the network are adjusted, but the data that is used for this purpose is never injected into the network. When trained correctly, the network will be able to take an input and produce a matching output, but neither the input nor the output is stored within the parameters of the network.

But neural networks have been around for a very long time, so what is the big deal and why did they suddenly become so important?

The answer to that question is scale and how much computers have evolved over the decades. When I worked with neural networks almost 30 years ago, we could simulate neural networks with a few dozen parameters. The problem that prevented us from scaling up was the mathematical computations needed to adjust the parameters during training. Back in the day, computers were not fast enough to work with a larger number of parameters during training.

This has changed dramatically. Modern neural networks work with billions and sometimes trillions of parameters. ChatGPT, for example, is reported to work based on a network with 1.7 trillion parameters. By comparison, our current understanding is that the human brain needs approximately 700 trillion parameters to do its work. In other words, the complexity of computational neural networks is starting to enter the percentage range of the complexity of the human brain.

At this point, I need to add that the story is a little bit more complicated than just comparing the number of parameters. The human brain also has a significantly higher complexity in terms of how neurons are interconnected. And we also need to be aware that we do not yet fully understand how exactly the brain processes information. However, it is still remarkable that computational neural networks have achieved that level of complexity.

Sushi – Japan + Germany = Bratwurst

Maybe the biggest driver in the continuously accelerating development of generative artificial intelligence is the discovery of Generative Pre-trained Transformers or GPTs. I am purposefully using the term "discovery" here, because while we have a very good understanding of how this technology works, we know very little about why it works. It just happens to work, and that is extremely well.

The technology itself is very mathematical and explaining it is not possible within the context of this article. Mathematics YouTuber 3blue1brown recently published an extremely well-made introductory video on the functionality of GPTs. I highly recommend anybody interested to learn more to watch his video.

Video URL
But what is a GPT? Visual intro to transformers | Chapter 5, Deep Learning

In essence, a GPT is an extremely large neural network, usually containing many billions of parameters, with a very particular network structure. It functions by converting content into contextual information. It does so by first breaking down any content that it receives into little units of content called tokens, and then converting these tokens into a mathematical representation that somehow encapsulates the context of these tokens.

In the above video, 3blue1brown gives an example that I think explains this best. As it turns out, if you take the mathematical description of the term "Sushi" within a well-trained GPT, subtract the mathematical description of "Japan", and add the mathematical description of "Germany", you end up with a mathematical description of something that is astonishingly close to the mathematical description of "Bratwurst".

The GPT has somehow learned how these concepts are related to each other. It is important to reiterate that this is unexpected, and we currently do not know why a GPT can do that. We know how it does it, but not why.

GPTs can be thought of as advanced pattern recognition systems. In their most common application, they are used to generate advanced chatbots such as ChatGPT or Claude. The way this works is that the GPT processes the input prompt along with what is called a system prompt that provides additional information invisible to the user, and the GPT then simply tries to identify the most likely next word.

It then does that repeatedly, word by word, and by doing so produces cohesive text and meaning. And yes, we do not exactly know why it can do this, and this is as astonishing as it is mysterious.

Because GPTs can extract context out of content and thereby interpret meaning, they have applications in many creative disciplines. They are the underlying technology that drives many generative AI systems, including systems for image creation, video production, game design, animation, and much more. And, as seen with Udio, they can also be used in music production.

The Emerging Era of Content Remixing

Which brings us back to where we started. Music is about to change forever in ways we cannot even comprehend yet, but what does this mean for people currently working in this field?

The unfortunate answer is that because we do not fully understand why this technology works in the first place, we also do not really have a good understanding of where its limits are and how far it can potentially develop. There is one thing, however, that is becoming increasingly apparent. And that is regardless of how well these systems will work in the future, humans will always need to stand out with their work among the vastness of AI-generated artifacts.

The more everybody will have access to AI-assisted production workflows, such as the one provided by Udio, the more valuable people with advanced skills will become. And these advanced skills will require nuanced and critical thinking. Future professionals will need to understand where the AI did well and where it failed.

And, most importantly, they will need to understand how something the AI failed to do can be fixed with traditional methods. I recently published a YouTube video where I called this the "return of the true artist".

The way I use Udio is different from how most people use it right now. I do not simply generate a music track and then post it on social media. Instead, I develop a musical idea with Udio as my AI collaborator. When Udio and I are done, I download the audio, bring it into my DAW, and remix it to my liking to achieve the final result.

In its current iteration, Udio still has many issues. It usually washes out transients, tends to be inconsistent in developing a musical idea, and sometimes generates annoying digital artifacts, just to name a few. But all these issues can be dealt with through post-production and post-processing.

And this is true not only for AI-generated music. The same also applies to anything else in the generative AI space, whether it be text, images, video, music, or other forms of media that can be created through generative artificial intelligence. The more traditional production workflows will be automated through artificial intelligence, the more we will transition into an era where most of our work will consist of remixing and remastering content that has been generated by AI.

Generative AI will fundamentally reshape the landscape of music production and many other creative fields. The rise of services like Udio points to a future where AI becomes an integral part of the creative process, democratizing access to powerful tools and enabling new forms of expression.

However, it's crucial to recognize that the role of human creators will not diminish; rather, it will evolve. As AI takes on more of the heavy lifting, the true value of human artistry will lie in our ability to curate, refine, and remix the raw materials generated by these systems. We will need to develop new skills and adapt to new workflows, but in doing so, we will also unlock unprecedented possibilities for creative expression.

The path ahead is uncertain, but one thing is clear: the future of music will be shaped by the interplay between human ingenuity and artificial intelligence.

And that future is already unfolding before our eyes.

Michael G Wagner

I am a Professor of Digital Media and Head of the Digital Media Department at the Antoinette Westphal College of Media Arts & Design at Drexel University. I currently also serve as the Program Director of the Digital Media PhD degree program as well as the eSports undergraduate minor. Before my affiliation with Drexel, I held academic teaching, research, and management positions at Vienna University of Technology, Austria, at the Department of Computer Science at Arizona State University, at Danube University Krems, Austria, and the KPH Vienna/Krems, Austria, for which I served as Rektor. My work focuses on the theory and practice of the educational use of digital media, immersive audio, computer games, and Blockchain technology.

Original language: English
Article translations are machine translated and proofread.