Fig 1.
Viewer rating structure across series.
IMDb vote distributions decomposed into a tri-modal mixture model for each series. Red and green spikes indicate the proportion of 1-star and 10-star ratings respectively, while the Gaussian curve captures central ratings (2–9). Series are labeled with official IMDb scores.
Fig 2.
Principal component analysis of series ratings.
Principal Component Analysis of series-level statistics: mean and standard deviation of Gaussian ratings, percent 1s, and percent 10s, reveals a quality–polarization space that separates high-rated, controversial, and consensus series.
Fig 3.
Task type proportions by series.
The frequency of each task type remains relatively unchanged over time. No consistent trend is observed across the 18 series, underscoring the show’s adherence to a stable format.
Fig 4.
Task skill intensity profiles.
Spider plots of four polarized Taskmaster tasks based on GPT-4o ratings along eight skill dimensions. Each task shows a distinct cognitive or physical emphasis, illustrating the diversity of design space. These plots are used for illustrative purposes only and were not included in statistical models.
Fig 5.
Geographic origins of taskmaster contestants.
A heatmap showing the birthplaces of all 90 Taskmaster contestants, based on manually collected data and geocoded city-level coordinates. The strongest concentration appears in London (29 contestants), with notable secondary clusters across the UK, Ireland, and a small number of international origins. Map created using Natural Earth public domain data and Cartopy/GeoPandas.
Fig 6.
Task type distribution across series.
Each of the 917 competitive tasks was labeled by activity type (Creative or Physical) and judgment type (Objective or Subjective). Physical–Objective tasks were most common overall. The proportions of all four task categories remained stable over the show’s 18 series, suggesting a consistent and deliberate design philosophy.
Fig 7.
Temporal patterns of episode ratings.
Episode-level IMDb ratings tend to rise within each series. A majority of series follow either a Rising or J-shaped pattern, with final episodes significantly outperforming first episodes on average (, p<0.001). These consistent patterns suggest an emergent narrative across series.
Fig 8.
Performance archetypes across series.
Contestant scoring trajectories were clustered into five recurring archetypes based on their task-by-task performance patterns. Each series features one representative of each type. The most common winning profiles are the Steady Performer (top-left), characterized by consistent scoring, and the Late Bloomer (top-right), who improves substantially over time. Other archetypes include the Early Star (bottom-left), who starts strong but declines in later episodes; the Chaotic Wildcard (bottom-right), marked by erratic and unpredictable performance; and the Consistent Middle (center), who maintains a moderate rank throughout without major highs or lows.
Fig 9.
Geometry of task scoring patterns.
Each point represents a unique five-contestant scoring pattern, positioned by its mean, variance, and skewness. Black circles indicate patterns used in the show. Used patterns cluster in a mid-range region, suggesting a preference for scoring dynamics that promote moderate competitiveness and differentiation.
Fig 10.
Sentiment trends across 18 series.
Among seven analyzed sentiment categories, only awkwardness shows a significant increase over time. The upward trend suggests a shift in comedic tone toward more discomfort-based humor, consistent with broader trends in contemporary media.
Fig 11.
Predicting episode ratings using machine learning.
A Random Forest model predicts IMDb episode ratings using 45 features derived from contestant demographics, task attributes, and emotional tone. Cross-validation was performed by holding out entire series to prevent overfitting to seasonal tone or cast-specific patterns. The strongest predictors were contestant age, prior experience, and mean awkwardness. All major features showed positive associations with ratings, except awkwardness, which was negatively correlated. These three features together accounted for over 88% of total model importance.
Fig 12.
Relative feature importances from the Random Forest model, sorted from most to least predictive. Contestant-level features, particularly age, awkwardness, and experience, dominate. Task-related and scoring dynamics variables have minimal influence on predicted ratings.