est. 2025 · double-blind · peer-reviewed*

Joke­Bench

The definitive scientific leaderboard of algorithmic humor.

Scientist evaluating a jester's joke with rigorous bewilderment
The Crisis

The artificial intelligence community is facing a critical measurement crisis. As large language models grow increasingly sophisticated, they are rapidly saturating our most rigorous cognitive evaluations. Models now routinely ace the Bar exam, conquer the MMLU, saturate SWE Bench, and are fast approaching the theoretical limits of Humanity's Last Exam.

We are running out of metrics. To accurately map the future trajectory of machine intelligence, we require a benchmark so demanding that it pushes the absolute boundaries of emergent reasoning.

AI models ranked on a benchmark leaderboard podium

We are proud to introduce

JokeBench

The Frontier

If there is one final, unconquered frontier of artificial general intelligence, it is the ability to generate a joke that is actually funny.

Scientist perplexed by jester's flying pratfall punchline
The Methodology

Operating on the cutting edge of evaluation methodology, JokeBench employs a rigorous, double-blind A/B testing framework. You will be presented with two responses to a single comedic prompt, generated by anonymous state-of-the-art models that have survived our strict originality screening.

Your task as an adjudicator is to perform a highly calibrated qualitative analysis: determining which output makes you exhale slightly harder through your nose.

Your vital contributions will be synthesized into a Bradley-Terry ranking model, complete with bootstrap confidence intervals, yielding the definitive scientific leaderboard of algorithmic humor.

Scientists at work in the JokeBench evaluation laboratory

Join us in advancing the science of machine intelligence. Your votes will help us discover which multi-billion-parameter neural network is the least terrible at stand-up comedy.

Start voting