MEMBER INSIGHT

AI in podcasting: Lessons from the ABC’s episode title optimisation pilot

16th October 2024

A new ABC research showed that a majority of audiences find and choose podcasts through the titles and descriptions. The public broadcaster led a month long trial to see whether Gen AI could help ABC’s teams to speed the process of finding titles and descriptions that are the perfect match for their potential audiences.

A yellow background with eight different podcast tiles above.

ABC Podcasts. Credit: ABC

This insight was originally published by the ABC and is republished with permission.

By Andrew Davies and Craig McCosker

The growth in podcast listening has been a significant trend in recent years.

The ABC has been a podcast pioneer in Australia right from the start of the medium, creating a wide range of high-quality, award-winning shows, and consistently has more podcasts in the top 200 of the Australian Podcast Ranker than any other publisher.

There’s a tremendous diversity to the range of podcasts we publish. Our most popular podcast is Conversations, a show that draws you into the life story of someone you may or may not have heard about. Stuff The British Stole was a co-production with the CBC Podcasts team in Canada. It’s had millions of downloads, won numerous awards and was adapted for ABC and CBC TV. If You’re Listening, a weekly explainer show, is consistently in the top 30 on the ranker.

It’s one thing to create great podcasts but another thing to ensure audiences find them. The ABC listen app is a key platform for our audiences to engage with the full range of podcast shows we offer. However, the global podcasting platforms attract millions of potential listeners looking for podcasts. They also have just as many podcasts competing for attention.

It may seem strange when talking about audio, but a listener’s first impression of a show is formed from what they see and read, not what they listen to.

The ABC’s research has found a large section of audiences find and choose podcasts through the titles and descriptions. This applies to people searching for a topic and coming across a show as much as it does to regular followers.

Graph shows different reasons people try a new podcast, with subject, recommendation and host the top three reasons.

According to the 2023 ABC Audio Market Study, 28% of podcast listeners decide what to listen to based on the episode title and description text. It applies as much to people following a show as much as it does to people searching for one. Source: ABC Audio Market Study 2023

The vast majority of downloads for podcasts come from people listening via their phones while they’re at home or on the bus or train.

Podcast titles and descriptions have only a few seconds to catch a potential listener’s attention and convince them to press play. Importantly, it’s not only people we are trying to get the attention of — podcast search and recommendation algorithms analyse the titles and descriptions to decide whether to present the episode to potential listeners.

Podcast titles and descriptions are a lever we can pull to help improve the discovery and recommendation of our podcasts. There has always been a big focus on the quality of podcast titles and descriptions, but teams are always looking for ways to improve.

Can we apply AI to the problem?

ABC Podcast - Episodes Display — The display of episode titles and descriptions on Spotify, ABC listen and Apple Podcasts. This is the most generous display – mostly views in apps are of truncated titles.

The thinking about this opportunity led us to conduct a month-long trial with 10 audio teams from our News, Content and International departments in June to see if we could apply Generative AI to help teams more quickly land on titles and descriptions that are the perfect match for their potential audiences.

Our thought process on how generative AI might provide a solution started with the fact that Large Language Models (LLMs) already know a fair bit about both podcasts and about writing titles and descriptions. ChatGPT can chat about podcasts and give you some decent reviews and recommendations for podcasts to listen to – leaving you disappointed when you find out they don’t exist. And the chatbot’s training would have included a mind-boggling sum of titles and titles. So, we wondered whether using an LLM might give us insights into diverse and effective options for podcast titles and descriptions from its broad training.

LLMs generate their outputs based on text instructions called prompts. Short prompts such as “Write me a title for a podcast about cats and technology” tend to generate very generic, stereotypical responses, but we were after high-quality, precise, attention-grabbing, distinctive writing.

The textual user inputs called “Prompts” that are used to give LLMs tasks can now contain much more information than earlier versions. Subject matter experts can now craft much longer, more detailed structured prompts that shape and guide the LLM towards higher-quality outcomes. We wondered whether we could use the ABC’s expertise in podcasting to develop an LLM prompt that also incorporated the ABC’s take on podcasting best practices and style.

It’s one thing to create great podcasts but another thing to ensure audiences find them. The global podcasting platforms attract millions of potential listeners looking for podcasts. They also have just as many podcasts competing for attention. The ABC’s research has found a large section of audiences find and choose podcasts through the titles and descriptions.

Lastly, LLMs are currently most effectively being used as personal productivity assistants for people working on tasks – the idea also known as ‘co-pilots’ or ‘co-intelligence’. It’s an approach that deals with the risks of the use of LLMs in media. Risks such as the unpredictable and unreliable nature of LLM outputs, biases and stereotypes, and the famous issue of ‘hallucinations’ (where LLMs make stuff up).

This led us towards thinking about how the tool could engage its users in the process and provide some transparency on its workings, instead of just producing a final “answer” for the producer to select from.

After thinking through these considerations, we decided there could be value in creating and testing an interactive AI tool, a tool that leveraged an LLM’s broad knowledge of podcasting and focused on our specific requirements through prompting, that could generate best practice title and description suggestions for podcast producers.

We demoed the idea to members of our audio teams to get their feedback on whether it had value to them. There would be no point moving ahead if they did not understand the idea, did not see any significant value in it, or even rejected the use of AI in content making.

We found staff were generally curious and interested in the potential of AI to save them time and help improve the overall quality of episode titles and descriptions. Some had very little exposure to AI apart from the hype around world domination they were hearing — but came away from demos intrigued by the potential for its application as an assistant. “So, it’s no different to a thesaurus really,” one producer concluded.

Design of the trial

To get the trial going we needed software for staff to run the AI tool. We looked at several options, but Microsoft Copilot was the most accessible tool in the ABC. Staff already had access to the business version of it in their browsers. Fortunately, it also had features that made it a suitable option.

Next, using Copilot, we developed an initial title and description suggestion prompt that incorporated best practices for podcast title and description creation. We based the best practices on our own internal training resources, other publicly available sources and a set of examples of good titles and descriptions from our shows.

We did everything we could to optimise the prompt before handing it over to the audio teams. But there is only so far you can go when (a) each run of the prompt generates different results and (b) the additional inputs of the podcast producers would change the output in unknown ways.

We expected the prompt would just be a starter prompt that would be rapidly modified with the feedback from the producers.

We wanted to know early on how a wide range of producers would accept and use a tool like this in real life. We identified a range of teams so we could cover a wide variety of real-world uses and workflows. We were keen to incorporate a range of shows that had different formats, genres, workflows and target audiences. We were also keen on teams who were willing to give this a go!

There was little previous experience or understanding of AI usage among the group. We trained the teams in generative AI concepts, how to use the tool and LLM gotchas to look out for. Teams were also encouraged to experiment with editing the prompt in Copilot.

Evaluations were done by the teams in a spreadsheet after running the tests.

We asked them to rate the quality of the LLM suggestions overall, highlight any stand-out suggestions that would have been usable without modification, and to record their comments.

We also captured what they gave the LLM in the prompt and the suggestions they got. The idea was to create a dataset that would be useful as an input for decision making, prompt development and evaluating other AI models. We had weekly meetings to share what people had experienced and ideas for prompt improvements.

Risks and mitigation

In any discussion about generative AI, risk and responsible use is a strong consideration. It’s something the ABC has thought a lot about as part of the development of our AI Principles. This pilot was designed from the start to mitigate risks and was reviewed carefully by a cross-disciplinary AI Work Group before getting the green light.

A key component was that there was no public output. We even had the producers writing their published titles first so they wouldn’t be AI influenced. Even if the trial was totally successful, the tool would still only be generating suggestions for the producers as an assistant.

Another key component was the inputs into the AI – the content the producers were likely to add in was considered public information. No personal, or confidential or otherwise sensitive information was given to the LLM.

It’s important to note that the producers were experts in the content they were working with and the outcomes they want for their shows, so they were in the best position to catch AI-generated errors and evaluate the overall quality of the suggestions.

Listen toour podcast

Uncovering and exploring the biggest
issues facing public media

Listen

Prompt design

Prompt Design

A diagram of the process the ABC asked the LLM to work through

We created a detailed prompt for generating podcast titles and descriptions. The prompt was structured to guide the LLM through an expert process that would hopefully lead to more accurate, reliable and higher-quality outputs.

It was based on research into best practices for writing podcast titles and descriptions and broke the title and description generation task down into a process of smaller steps that mirror the goals, context and standards that an expert human would work through. Audio teams were involved in shaping the prompt at the beginning and also during the evaluation.

To build this prompt, we made use of a feature in MS Copilot called NoteBook that supports editing detailed sets of instructions and lengthy data about the podcast for the LLM to chew on. Since we did the pilot Microsoft appears to have dropped this Notebook feature in its most recent Copilot update. However you can now do much the same thing from the central Copilot interface.

The prompt provided the LLM with context about its role, its goal, detailed task instructions and examples to learn from to help deliver customised output. The prompt gave the LLM the instructions to go through the process set out in the previous diagram and show that process to the producer.

Prompts are just text and can be easily copied, pasted, shared and changed. Producers were asked to copy and paste the prompt into their Copilot and then add in their podcast episode notes and any extra instructions for the LLM to work with. They would review the LLM’s output and, if time permitted, optionally adjust their prompt and run again.

A key part of this trial was getting feedback from teams with different workflows so we could get an early understanding of how this could fit into those —or not.

What were the suggestions like?

The average usefulness score of the suggested titles from the overall trial was 2.3 out of 5. However, some teams gave a number of title suggestions ratings of 3.8 and 4. These figures were drawn from the logbooks producers were asked to keep.

Three shows gave the highest title ratings overall and the teams that found it least useful were mostly the ones under the most time pressure with workflows – two of our News shows in particular.

Good examples:

Bad examples:

Colons and capitalisation were examples of issues that were surprisingly hard to control from the prompt, they are very popular formats for podcast titles, particularly in the US. We encountered one really strange misspelling (Algopseak) which wasn’t something we saw with the other suggestions, and one example of hallucination was noted.

What were the overall findings?

After the pilot concluded, one of the journalists who participated sent us some feedback that summed up the overall experience perfectly — better than we, or AI, could:

“The downside (and somewhat encouraging) learning from this experiment for me has been that AI doesn’t seem to be better than humans at headline writing. While it occasionally comes up with a usable (if uninspired) headline, it’s a long way from what we humans can come up with.

Having said this, it is still enormously useful on those busy days when you need some inspiration as a starting point. It also seems to be incredibly useful at summarising content which can be useful for writing prospect notes or even just pitching your latest program in emails.

Over the past four weeks of the experiment, I’ve come to think of AI as useful as a possible second brain for the menial and often time-consuming tasks I perform every day that require a little creativity and thought, but can bog me down and keep me from the day’s priorities. Like an enthusiastic and very diligent, but slightly untrustworthy, assistant.

— Mario Christodoulou, investigative journalist and producer, Background Briefing

In a nutshell, our use of Copilot for generating podcast titles and descriptions did not meet our producers’ requirements for nuanced creativity, tone, and editorial judgement.

Some runs of the prompt instructions were way better than others, but it was a game of prompt roulette, and our podcast producers wanted time savings as well as inspiration.

Despite our best efforts optimising the prompt, there was a tendency to sensationalise, there was a lack of originality, and many of the suggestions were overly simplistic. Some of the words used by teams to describe the suggested titles and descriptions included: generic, corny, sappy and cringe.

Our prompt explicitly tried to coax a wide diversity of approaches from Copilot to provide creative inspiration for podcast producers. We saw that as a potential benefit of using an LLM that has seen a massive range of approaches. But Copilot’s suggestions in most runs of the prompt were very similar in wording, style and format.

A consistent theme from many teams was the value of AI as a creative assistant. The teams clearly saw it as a useful starting point for ideas, suggestions and possible angles for episode titles and descriptions.

The biggest frustration for everyone involved the over-use of punctuation. Despite our best efforts to stamp them out via the prompt instructions, we couldn’t stop the number of suggested titles with colons in them.

With due respect to all the subeditors out there, we didn’t expect a relatively small thing like a colon would prove such a big challenge! We feel that solving the colon bias problem would have influenced the scores significantly.

Copilot also sometimes didn’t use the podcast guest names it had been given in the prompt, which, in keeping with all the other points, was against explicit instructions on the value of using prominent guest names.

One useful finding was that more focused input increased quality. When producers crafted concise “Episode Notes” for the episodes we saw better results. The broader inputs (e.g. podcast transcripts, listing all the segments from multi-segment shows) had a negative impact – we saw broader, more generic and cliched responses.

The importance of human judgement was evident when it came to what to give the LLM to work with.

Also on the human side of the work, as we progressed through the pilot we saw different requirements and preferences for titles and descriptions expressed by each of the teams.

All the shows had unique styles and workflows and the producers involved developed suggestions for customisations of the prompt that could better fit the needs of their show, down to different types and formats of series and episodes they published.

The two news teams involved reported that the process was slower and of less value than if they had written the titles and descriptions themselves.

One show publishes 12-13 segments per day and the other one publishes 6-7 per day, so that’s a significant time investment when weighed against the volume of content each one produces. The producers get very good at working quickly.

But as we saw in the feedback from Background Briefing’s Mario, a consistent theme from many teams was the value of AI as a creative assistant. The teams clearly saw it as a useful starting point for ideas, suggestions and possible angles for episode titles and descriptions.

Next steps

The podcast producers found the AI tool broadly useful. The focus now is to bring the average ranking of the suggestions the LLM provides up. We could just wait and come back later as the LLMs improve, but the feedback of the trial suggested some directions that we can dig into now.

We would like to target show-level prompt customisation to see if title prompts that directly focus on the specific style and tone of a particular show produce better suggestions. We’re curious to see if this narrower and more specific focus around a couple of shows may help meet producer expectations and get more reliable responses.

We used Microsoft Copilot for the pilot, but we want to now test other leading models to see if there are any improvements in prompt steerability, understanding of the source content and writing quality. We can use the data gathered in this project to rerun the tests and see roughly how the results compare.

Some of the producers involved in this trial came up with their own use cases for chatbots using their new prompting skills. We’d like to think more about workflows where AI serves as an interactive assistant rather than a replacement.

The feedback from the teams involved was that the initial training in using AI was very useful to them, so expanding training could help producers explore the potential productive uses of AI themselves. Producers made comments that they were much less nervous about the impact of AI on their jobs now that they had actually used it.

A number of other public service media organisations (PSMs) around the world are also experimenting with AI as part of their workflows so we’re interested to see if we can compare what we learned with colleagues at other PSMs as part of this trial.

About the authors

Andrew Davies is the manager, Audio-On-Demand Impact and Strategy in the ABC’s Content division

Craig McCosker is the group product manager for Future Focus in the ABC Digital Product team.