Text is cool, but audio & video unlocks even more value

In the past weeks, the new category of generative AI has been justifiably put in the spotlight. The ability of unsupervised AI/ML models to generate new types of content is unprecedented and marks a significant jump in the world. Most articles have been focused on the initial use cases of text and image generation (GPT-3, DALL-E 2, Imagen, Stable Diffusion). Less has been discussed of the effects of ML on audio and video.

Is the hype of Generative AI justified?

The popularity of generative AI in the startup world is caused by both new AI models announced in recent weeks and the success of the first wave of companies built on-top of AI models:

Jasper announced $100M in revenue and $1.5B valuation for their AI marketing tool - all done in 18 months. They built a text-to-text application to generate marketing for text based channels (sales emails, marketing copy) by building on-top of the GPT-3 engine. This shows that a unicorn company can be built without owning the underlying “blackbox”
Stability AI raised $101M for their AI platform for image-generation. This platform will allow the next wave of companies to leverage their model to produce text-to-image. More companies such as Jasper, who build-out a vertical use-case application will emerge.

VC’s are declaring these technologies as the next big technological breakthrough which will unlock new opportunities - which I completely are warranted. See the footnotes to read manifestos published by Bessemer, Sequoia, NFX and Elad Gil with their predictions. We’ll see many more successes pop-up in the next 5 years around this.

My top 7 bets on companies who are well positioned to become the “Jasper for audio/video” and realize similar success:

1. WellSaid (text-to-audio)

Founded in 2018 | Status: Series A | Raised $10M

WellSaid Labs is working on converting on text-to-audio conversion for B2B voiceover usage. They offer a catalog of unique voices companies can use to create new audio pieces. Customers can also upload their own custom voices and use an API to programmatically access these capabilities. This tech the potential to disrupt the traditional voiceover and dubbing markets.

Example of some of WellSaid voices that are generated automatically

2. Resemble.AI (text-to-audio)

Founded in 2018 | Status: Seed | Raised $4M

Resemble are focusing their text-to-audio capabilities on capturing the full range of human emotion (their whispering voice is pretty good!). Another product focus is localization - record a single voice and hear it playback in any language through their neural text to speech engine. One of their interesting projects is the voiceover in Netflix's documentary, The Andy Warhol Diaries.

Example of some of ResembleAI voices that are generated automatically

3. Runway (text-to-video)

Founded in 2018 | Status: Series B | Raised $45M

Runway is an interesting ML company working on various content creation challenges in the text, photo and video world. In the video world, they are working on a text-to-video model (waitlist) and a tool that allows to mask objects in videos. This is a great example of a day-to-day task that video creators do which can be simply automated.

Example of masking an object in a video - extracting and pasting on a new background

4. HourOne (text-to-video)

Founded in 2019 | Status: Series A | Raised $25M

HourOne is building a text-to-video model using virtual presenters, like a newscaster in a studio reading out the news. This replaces the need for humans to record video and can generate a similar effect of presenter-led videos. This can create a lot value within training courses, news publishers, training and courses.

Example of HourOne dynamically created videos

5. Synthesia (text-to-video)

Founded in 2017 | Status: Series B | Raised $66M

Likely one of the largest players in the text-to-video space, Synthesia are challenging traditional video production with their AI content generation platform. Their main use cases are training videos, how-to’s and marketing videos. They offer multiple avatars and voices and promise results within minutes, not weeks.

Behind the scenes on how Synthesia works

6. AudioLabs (audio-to-video)

Founded in 2021 | Status: Pre-Seed | Undisclosed funding

AudioLabs is working on audio-to-video models. What’s interesting about this company is that they don’t make you start with a blank canvas. Their application connects to existing libraries of audio content such as podcasts & audiobooks. Each video is dynamically created using models such as DALL-E-2 and Meta’s Make-A-Video to generate unique videos optimized for algo-based recommendations. This is especially useful in the “TikTok era” as enterprises look to publish short-form content at scale.

[ADD YT VIDEO OF DEMO]

7. Descript (text-to-audio)

Founded in 2017 | Status: Series B | Raised $50M

Descript is one of the most interesting companies in this space. Their product was among the first that allowed editing of audio as easy as deleting a keyword. Their product has evolved into both video and audio editing with dabbles of AI magic. One of their key features includes text-to-audio “overdub” for filling in words or mistakes made in podcasts. Their products are mostly used for post-production of audio and video content in e-learning and podcasting.

The broader use cases of audio & video:

While most articles discuss how the text/image models will change the way we work, the below section outlines various business use cases around audio and video which will become more developed:

#1) Text-to-Audio: Synthetic, Editing, Dubbing & Highlights

The world is changing as more of our day to day activities involve audio (smart homes, podcasting, social audio). Creators, enterprises and media companies are all producing audio at a high pace with double digit YoY growth within podcasting, audiobooks, digital audio and the still-nascent social audio. This demand will bring new solutions within:

Synthetic audio - generating new voice snippets is coming along, with companies like WellSaid, HourOne and Descript leading the way. These ML modules can be trained to read out audio in a specific pre-learned voice or a custom voice trained with enough training data. These capabilities will enable new personalized listening experiences. Millions of new podcasts can be generated for each longtail user segment. Listen to the latest podcast with Joe Rogan and Steve Jobs to see the potential of these technologies in action.
Podcast editing - one of the leading services today around podcasting is editing. Podcast editors / producers remove filler words, trim the “boring” parts and bring forward the highlights. Although this is 99% manual today, we will see most of these roles automated using AI in the next couple years. Producers will shift their jobs from editing clips to overseeing the automation and keeping the content aligned with the brand. Vertical AI models will be trained on the best performing podcast episodes to learn how to do this efficiently.
Audio Highlights - Hollywood movie trailers help quickly understand what a movie is about, GPT-3 models are used to summarize long blobs of text. As audio grows, we’ll see more demand for audio highlights tools which can condense long-form audio content into a short version for listeners on the go. Who has time to listen to the latest three hour Joe Rogan episode. Give me the 1-minute cliffnotes.

[EXAMPLE HERE]

#2) Audio-to-Video: Synthetic video, B-Roll, Short-form TikTok

While audio is growing, issues around discovery and monetization have made video a relevant as a medium to showcase voice. You can see this by the amount of voice-first content such as podcast clips appearing on video on YouTube and TikTok. As audio becomes more fragmented, a popular tactic utilized by leading audio marketers is to create video derivatives designed for each platform. These come in formats such as snippets & highlights which are published on apps such as YouTube and TikTok. Often times - many of these derivatives reach more users than the original piece. But creating videos from audio-first content is challenging:

Synthetic video. Today only 2% of podcasts have video to accompany the audio. Without video, audio publishers can’t publish these as videos. Then you have the rest of audio-only formats such as audiobooks, digital audio and social audio. We will see more solutions working on generating synthetic video, utilizing frameworks such as Meta’s Make-a-video. Dynamic videos can be generated to accompany audio inputs. This changes the game - from pulling old stored content for the end-user, we can generate new content at the edge of the network. For example, a one hour podcast or 6-hour audiobook could be converted into a long-form feature film for YouTube.
Short-form video - as attention spans get shorter and the success of TikTok’s algo-based model is replicated by other social media platforms, we will see short-form get more dominant. I highly recommend reading Michael Mignano article on The End of Social Media and the Rise of Recommendation Media (Lightspeed, Ex-Spotify, Ex-Anchor). Solutions to create short-form content, especially when they can be converted from existing “anchor” content will be in high demand. Many incumbent content marketing teams will need to adapt to the TikTokification of the world or risk going extinct. AI models will be able to study the patterns of the most successful short-form videos and generate new pieces of content that will see similar success.

#3) Text-to-Video: Editing, Masking, Repurposing, Multimodal

From marketing videos, webinars, training videos and the rest - the growth of video production in recent years has increased demand for services around video production, editing and repurposing. Many solutions have launched to automate parts of the process. The majority are solving these without ML, instead relying on templates and other rule based process. How we see AI will change the game in the video world:

Video Production: creating new video assets based on AI is a lucrative use case. Existing platforms allow to input text and generate presenter-style videos of an avatar saying the text. As the quality increases, these will be used by content teams to generate sales, marketing and educational content at scale.
Video Editing: to upload a 20 minute edited video to YouTube will likely take an experienced human up to 6 hours of sifting through footage and putting together the ideal storyboard. This will be a task that ML can solve given a well trained data set. The producer’s job will shift to monitoring and adjusting the story points, potentially reducing the time spent from 360 minutes down to 15 minutes. The result will be scale of highly effective videos and a general increase in quality.
Repurposing for each social media: As the content landscape gets more fragmented, we’ll see each channel prioritize content created in their format. Content marketing teams will need to work to optimize content at a channel-level. AI solutions will be built to assist with this. Early players in this space include Munch, a platform to repurpose long-form YouTube videos into short-form.

There’s a lot to be optimistic about

Generative AI is one of those macros tech shifts really is 10x better than the alternatives. From zero to one solutions as we’ve seen today in tech, this wave of ML can unlock zero-to-ten solutions that don’t make the process a bit better, but orders of magnitude different. Over the next few years we’ll see these companies mature and disrupt existing solutions who don’t have these AI models embedded. Furthermore, these models are built to improve over time, giving built-in network effects to the companies that create them.

References:

The rise of synthetic media: Bessemer Venture Partners
Generative Tech Begins: NFX
Generative AI: A Creative New World: Sequoia
AI: Startup Vs Incumbent Value: Elad Gil
The End of Social Media and the Rise of Recommendation Media: Michael Mignano

Generative AI: Jasper cracked $100M revenue with text. Who will do the same with video? 🎙️📹