The biggest theater festival in the world takes place every July in the south of France. It lasts a whole month, and brings together ~300 000 participants for 1500 daily shows in the beautiful medieval city of Avignon.
With 1500 shows, even the most dedicated attendee won’t see even 10% of the program. If only there was a way to explain my preferences to a helpful advisor and then ask them to review all 1500 shows for things I could possibly like 🤔.
Let’s ask GPT
Large Language Model–based tools like ChatGPT and Bard are good at language. We can explain the types of shows we like and dislike, and then ask for it’s advice about one specific show.
See for example:
GPT gets it right. For a contemporary show with elements of improv that I loved, it responds:
for a comedy play I wasn’t interested in, it responds:
Note: the descriptions of the shows I used are in French while the rest of the prompt is in English. The LLM doesn’t seem to care at all 💫.
Scaling challenge
The only remaining problem is that we can only ask ChatGPT or Bard about one show at a time. What we really want is to ask AI to review all 1500 shows and find the ones that look best for my preferences.
Ask and you shall receive doesn’t quite work. If we modify the prompt above:
then ChatGPT confesses its limitations: Unfortunately, as an AI, I don’t have direct access to browse the internet or specific websites like the Festival of Avignon. (And then it nevertheless hallucinates a few suggestions of plays that don’t exist.)
Bard does much better in the sense that it suggests plays that actually play in Avignon :). But the results is a handful of plays that seem to be selected from web searches:
This is nice, but I’d like a more comprehensive report that ranks every play in the program with the likelihood that I’d enjoy it.
Building a robot advisor
Our recommendation system will be composed of two parts:
- The crawler. A program that will go through the festival website and record show description of every show
- The ranker. A program that will go through the extracted descriptions one by one, and for each show ask a GPT model to estimate the probability that I will like it
At the end we will sort the results by the rank and I will go to see one of the shows that came out on top :).
The crawler
The crawler has one job: visit every page of the Off Avignon festival program and capture the description of each play.
We could have done it by hand, visiting the pages one by one and copying the description to a text file. Assuming 10s to load the page, 10s to save the description and 10s to click to the next show, this would take around 12h with no breaks. We’re on vacation so let’s save some time.
The Python library scrapy makes it pretty quick:
class CrawlerSpider(scrapy.Spider):
name = "crawler"
start_urls = ["https://www.festivaloffavignon.com/programme/2023/"]
allowed_domains = ["festivaloffavignon.com"]
custom_settings = {"FEED_EXPORT_ENCODING":'utf-8'}
rules = (
Rule(LinkExtractor(allow=(r"programme/2023/",))),
)
def parse(self, response):
# TODO: process page
The parse
method will be called for every page retrieved by the crawler. All
we need to do is:
- decide if the page contains a description of a show
- if yes, extract the description
- follow any links in the page that may link to other shows
Part 1. and 2. require some manual review of the HTML structure of the page
we’re going to crawl. We can do it by right-clicking on the webpage in Chrome
and selectong “Inspect”. It seems that pages containing show descriptions have
the .image-spectacle
class set on the show cover picture.
So to decide if the specific page contains the show info, we can simply check for that class being set anywhere in the response:
class CrawlerSpider(scrapy.Spider):
def parse(self, response):
if response.css('div.image-spectacle'):
# TODO: extract show info
To extract the show description, we notice that it’s set in a paragraph immediately following a header that says “Resume du spectactle”:
<h4 class="category text-black">Résumé du spectacle</h4>
<p>Seul-en-scène improvisé. Dans un bar, alors qu’il noie sa mélancolie à une heure tardive ...</p>
This is a bit harder to express cleanly, but this xpath
selector will teach the crawler to grab the next paragraph
after a header that says “Resume du spectacle”:
response.xpath("//*[contains(text(), 'Résumé du spectacle')]/following-sibling::p[1]/text()").getall()
While at it, we will also record the name and the URL of each show.
def parse(self, response):
if response.css('div.image-spectacle'):
show_resume_chunks = response.xpath("//*[contains(text(), 'Résumé du spectacle')]/following-sibling::p[1]/text()").getall()
show_resume = ' '.join(t.strip() for t in show_resume_chunks)
yield {
"url": response.url,
"title": response.css('h1.page-titre::text').get(),
"resume": show_resume
}
Finally, we add the bit that will follow the relevant links in the page, so that the crawler keeps crawling:
def parse(self, response):
# (...)
for link in response.css("a::attr(href)").getall():
if link and not (link.strip().startswith("https://www.festivaloffavignon.com/programme/2023/")
or link.strip().startswith("/programme/2023/")):
continue
yield response.follow(link, callback=self.parse)
Since the crawler is making about one request per second, I started it and went to see a theater show :). Upon coming back I had a neat JSON file containing the description of each show:
[
{
"url": "https://www.festivaloffavignon.com/programme/2023/slapstick-s-brothers-s33165/",
"title": "Slapstick's Brothers",
"resume": "Slapstick's Brothers raconte l'histoire de la naissance du cinéma, (...)"
},
{
"url": "https://www.festivaloffavignon.com/programme/2023/boops-sisters-cabaret-show-s34302/",
"title": "Boops Sisters' Cabaret Show",
"resume": "Un duo clownesque musical et déjanté !! Deux sœurs tout droit venues (...)"
},
{
"url": "https://www.festivaloffavignon.com/programme/2023/paquita-s33433/",
"title": "Paquita !",
"resume": "Janvier 1939. Paquita, 11 ans, doit fuir l’Espagne suite à l’arrivée (...)"
},
(...)
]
The ranker
Now that we know the description of every show that’s playing in Avignon, we can ask GPT to score them for us.
We’re going to use a prompt like the one we used when experimenting with ChatGPT/Bard above. Because we have 1500 requests to make, we just need to automate them. I used the OpenAI API and the GPT 3.5 model.
The program is pretty straigtforward, for every show it does two things:
- make a request for GPT to score it
- parse the response
At the end, we want to sort the results by the score.
Making the request is just combining the fixed prompt with the recorded show description and then making an API call:
TEMPLATE = f"""
I'm at a theatre festival and there are a lot of shows. I'd like to see:
- elements of improv and audience participation
- characters in their 30s searching for meaning in life
- dystopian commentaries on society and technology
I'd like to avoid:
- mass-appeal comedy
- shows intended for children or seniors
Based on these preferences, estimate whether I will like the show described
by the following pitch in triple backticks below. Respond with as a probability
number from 0% (no chance I will like it) to 100% (certain that I will like it)
along with a short rationale (single sentence).
"""
def evaluate(show_info):
message = TEMPLATE + "```\n" + show_info.resume + "```\n"
response = openai.ChatCompletion.create(
model="gpt-3.5-turbo",
messages=[{"role": "user", "content": message}]
)
return response.choices[0].message["content"]
To parse the results, we use a regular expression to extract anything that looks like a percentage number from the server response:
def find_percent(string):
regex = r"(\d+)%"
match = re.search(regex, string)
if match is not None:
return int(match.group(0).strip('%'))
return None
Then the part that puts it together:
results = []
for show_info in loaded_show_info_list:
# Ignore server errors, a more robust solution would be to retry
try:
response = evaluate(show_info)
except Exception:
print(f'Skipping {show_info.title}, server issue?')
print(response)
percent = find_percent(response)
if not percent:
print(f'Skipping {show_info.title}, no percentage in response')
continue
show_info.prediction = percent
show_info.rationale = response
results.append(show_info)
ordered_results = sorted(results, key=lambda show_info: -show_info.prediction)
save_show_info(ordered_results, 'predicted_v0.json')
That’s it! Because the ranker also takes a while to go through 1500 shows (and make a blocking call to GPT each time), I once more left to see a theater show while it was doing its ranking :).
Results
The top four shows were:
Name | Prediction | Rationale |
---|---|---|
Sous le Plancher | 100% | Based on the description, the show seems to explore the theme of boredom and the possibilities it holds, which aligns with the idea of characters in their 30s searching for meaning in life. Additionally, the mention of live music, imagery, and the invitation for both children and adults to travel … |
Martyr | 95% | The pitch mentions a chaotic era and a teenager searching for reasons to exist, which suggests elements of dystopian commentary and characters searching for meaning in life. Additionally, the mention of amateur and professional actors of various ages and backgrounds implies potential for audience … |
Ne quittez pas [s'il vous plaît] | 95% | This show seems to fulfill all of your preferences - it includes elements of improv and audience participation, features characters in their 30s searching for meaning in life, and is a dystopian commentary on society and technology. It also avoids mass-appeal comedy and shows intended for children … |
Penetrator | 90% | Based on your preferences, the show described in the pitch seems to align with your interests: The pitch mentions elements of audience participation and characters in their 30s searching for meaning in life. Additionally, it hints at dystopian commentaries on society and technology. Although the … |
Out of those, Penetrator seemed the most intriguing, and that’s the one I went to see.
It turned out to be a Scottish play of the In-yer-face genre (the most prominent representative of which is Trainspotting). Two flatmates receive a surprise visit from an old friend who spent the last few years in the military service. Now he’s back, hiding from a mysterious organization “Penetrator” bent on hunting him down.
It could be a placebo effect, but I loved it :).