The Foulest Mouth of Them All
Your Task
Find the foulest mouthed contestant in UK Taskmaster to date. Bonus points for finding the foulest mouth in each series.
A Side Amble in the Preamble
In the spirit of good code practices, namely avoiding duplicating code and centralising common code into a single location were possible, here is a set of preamble scripts that I will be sourcing at the beginning of each post.
These R
scripts:
- Configures some global output settings that I want to make these posts and graphics aesthetically pleasing.
- Establishing the connection to the
TdlM
database file.
Contents of this preamble code can be found in this git location.
library(here)
here("static")
## [1] "C:/Users/cfhna/Documents/Git_Projects/RStudio/themedianduck/static"
preamble_dir <- here("static", "code", "R", "preamble")
preamble_file <- "post_preamble.R"
source(file.path(preamble_dir, preamble_file))
source(file.path(preamble_dir, "database_preamble.R"))
## [1] "Database Connection, tm_db, is now ready to use."
source(file.path(preamble_dir, "graphics_preamble.R"))
Some Motivation To This Preamble
Prior to this preamble code, I copied this code from post to post and executed; this code contains approximately 40 lines. This is okay for quick, one off purposes, but is not a good practice for longer term purposes. For example, if I want to make a change to this preamble code, I would have to change each instance this code is used (as in each markdown associated with each post).
Going forward, this preamble code file will be sourced within about 5 lines of codes. If I want to change the preamble code, I only have to make this code in one location, and this will naturally propagate to wherever this code is being used.
Profanity Insanity
To answer our questions of interest, we likely need to access the following tables:
profanity
: Table detailing the profanity observed in an episode.people
: High level information on contestants, Greg Davies and Alex Horne (gender, DOB, dominant hand).series
: Snapshot of overall series information.
The Rate of Profanity Caveat
It is worth reminding ourselves that the number of episodes per series has varied over Taskmaster’s brodcast run. Consequently, to compare the profanity use between series, we cannot compare the total profanity used by a contestant (or speaker in general) in a series as contestants from longer series will likely appear as more foul mouthed.
In order to allow for valid comparisons between series, we introduce a new metric, namely the profanity rate which normalises the profanity total by the number of episodes in the series.
\[\texttt{Profanity Rate for Contestant C in series S} = \frac{\sum{\texttt{Profanity by contestant C in series S}}}{\texttt{Number of episodes in series S}} \]
Profanity Rate can be thought of as the average number of times contestant C
will swear in an episode (of series S
).
The Next Level of Profanity
To assist our foul mouthed quest, it would be useful to create new temporary subtables which combine, transform and/or aggregate data from the various database tables we outlined above.
For example, we might want to create:
- an enhanced version of the
profanity
table which contains interpretable information on the speaker and series rather than numerical ids (they are people with names, not numbers). - an aggregate of this enhanced
profanity
table (which is at aseries
,speaker
,task
level granularity), such that we have profanity at a series level for a contestant.
These transformations can be done in SQL
or R
. Based on personal preference, the former will be used for joins and aggregations, and the latter for more technical transformations (for example calculation of new statistics).
Enhanced Profanity
The following query combines the data in the profanity
, people
and series
level. The data still remains at low level granularity, namely, the utterance of the profanity by a particular contestant, in a task.
Output of this query stored directly as an R
object name profanity_enh
:
-- Stored as an R dataframe profanity_enh
SELECT
pf.series,
pf.episode,
pf.task,
pf.speaker as speaker_id,
pp.name as speaker_name,
pf.roots,
pf.quote,
pf.studio,
pp.gender,
pp.hand,
pp.champion,
pp.tmi as speaker_tmi,
sp.name as series_name,
sp.episodes as num_episodes_in_series,
sp.champion as series_champion_id,
sp.special
FROM profanity pf
LEFT JOIN people pp
ON (pf.speaker = pp.id
AND pf.series = pp.series) OR
(pf.speaker = pp.id)
LEFT JOIN series sp
ON pf.series = sp.id
Series Profanity
The following table takes the recently created profanity_enh
dataframe and performs a number of operation which eventually results in a new dataset, series_profanity
at a series, contestant level. Operations include:
- Counting the number of profanities uttered in a given quote.
- Aggregating data to a series and speaker level.
- Sum of the profanities uttered.1
- Number of distinct episodes that the profanities are uttered over.
- Number of episodes in the series.
- Adding a new column which calculate the profanity rate.
library(reticulate)
library(dplyr)
series_profanity <- profanity_enh %>%
rowwise() %>%
mutate(num_profanity = length(reticulate::py_eval(roots))) %>%
# To count the number of profanities utter in a quote.
group_by(series, series_name, special, speaker_id, speaker_name, speaker_tmi, gender, hand) %>%
# Aggregating and summarising data at a series, speaker level.
summarise(
speaker_episode_count = dplyr::n_distinct(episode),
sum_profanity_series = sum(num_profanity),
no_episodes_in_series = max(num_episodes_in_series)
) %>%
mutate(profanity_per_episode = sum_profanity_series/no_episodes_in_series)
The Foulest Mouth of Them All…
We are nearly there at answering our first foul mouth question! There are few more considerations, that will form the basis our of logic to help answer our question:
- we will be considering only standard series of UK Taskmaster and not specials (no New Years Treats and Champion of Champions).
- we will only consider contestants and not Greg Davies or Alex Horne.
- the foulest mouth contestant will have the largest profanity rate.
And with that, our foul mouthed winner is…
series_name | speaker_name | gender | hand | profanity_per_episode |
---|---|---|---|---|
Series 1 | Romesh Ranganathan | M | R | 7.667 |
And there we have it Romesh Ranganathan2 from Series 1 is the foulest mouth contestant on UK Taskmaster, with a profanity rate of 7.667; Romesh is expected to swear about 7.667 times in an episode.
Based on my recollection of Series 1 and Romesh’s angry persona, it is not entirely suprising that he is the most foul mouthed contestant!
A Close Finish?
Some of you readers may be interested in knowing wheter it was a close finish into the profanity rate race.
We can quickly determine this may changing the top_n
function to consider 5 rather than 1 say, when selecting based on profanity rate.
series_name | speaker_name | gender | hand | profanity_per_episode |
---|---|---|---|---|
Series 1 | Romesh Ranganathan | M | R | 7.667 |
Series 6 | Asim Chaudhry | M | R | 7.300 |
Series 6 | Russell Howard | M | R | 5.800 |
Series 2 | Doc Brown | M | R | 4.200 |
Series 3 | Rob Beckett | M | R | 4.200 |
There are no big surprises in these finishing positions although Asim Chaudhry being a close second is somewhat surprising since I don’t remember him being a particularly angry or foul mouthed incident with him3. He’s a relatively mild mannered comedy actor who just wants everyone to know he is a vegan.
Another observation is that the top 5 are all male and right handed. Make of that what you will…
Bonus Task: Foulest Mouth in Each Series
To find the foulest mouth in each Taskmaster series, we can continue use the existing data and logic we have used thus far, but introduce an additional line of logic to rank within a series; the group_by
function is our friend here. The rank
function provides a ranking with respect to profanity rate; for descending order ranking a minus sign is introduced on the variable we want to rank according to.
within_series_profanity <- series_profanity %>%
filter(special == 0 & !(speaker_name %in% c("Greg Davies", "Alex Horne"))) %>%
arrange(series, -profanity_per_episode, -speaker_episode_count) %>%
group_by(series, series_name) %>%
mutate(profanity_rank = rank(-profanity_per_episode, ties.method = "first"), name_prof_rate = sprintf("%s (%#.3f)", speaker_name, profanity_per_episode))
Performing the filter operation of profanity_rank = 1
will provide the foulest contestant by series:
Within Series Foul Mouthed Races
Similar to the overall foul mouthed analysis, we might be interested in seeing how close the profanity race was in each series. We may also find our surprising insights, less surprising by assessing the profanity race.
What Have We Learnt Today?
To count the number of profanities, we rely on the
roots
column and the libraryreticulate
has been employed. An example value in theroots
column might be["little", "alex", "horne"]
which is the form apython
list
object.reticulate
and thepy_eval
function allows use to interpret this as alist
object from withinR
, convert it to itsR
equivalent (anR
vector), and manipulate it as anR
object (namely taking thelength
of it to count number of profanity occurrences). It’s a convenient way for me to deal with these data types which may not be natural inR
, but natural in another language.↩︎aka tree wizard↩︎
I’m looking at you Ed Gamble and Daisy May Cooper for these sorts of outbursts.↩︎