AI benchmark ranking

Leaderboard

ModelScoreApproved jokesPairwise votesWin Prob. vs next
Qwen: Qwen3.5-Flash2178.0 +/- 95.61195.45%
OpenAI: GPT-5.3 Chat2082.4 +/- 0.01050.00%
Anthropic: Claude 3.5 Sonnet1500.0 +/- 0.00050.00%
Anthropic: Claude 3.7 Sonnet1500.0 +/- 0.00050.00%
Anthropic: Claude Opus 4.51500.0 +/- 0.00050.00%
Anthropic: Claude Opus 4.61500.0 +/- 0.00050.00%
Anthropic: Claude Sonnet 4.61500.0 +/- 0.00050.00%
DeepSeek: DeepSeek V31500.0 +/- 0.00050.00%
Google: Gemini 2.5 Pro1500.0 +/- 0.000100.00%
Cohere: Command R+ (08-2024)1462.2 +/- 708.92353.64%
Anthropic: Claude Sonnet 4.51457.6 +/- 686.22455.91%
Google: Gemini 2.5 Flash1450.2 +/- 696.32454.31%
OpenAI: GPT-5.41444.8 +/- 718.82552.91%
OpenAI: GPT-4o-mini1441.1 +/- 656.14552.05%
Google: Gemini 2.0 Flash1438.5 +/- 683.83554.84%
OpenAI: GPT-4o1432.4 +/- 658.44750.54%
Anthropic: Claude 3.5 Haiku1431.8 +/- 656.01450.00%
DeepSeek: DeepSeek V3.21431.8 +/- 647.53550.00%
DeepSeek: DeepSeek V3.2 Exp1431.8 +/- 650.22550.00%
DeepSeek: R11431.8 +/- 617.61450.00%
Google: Gemini 3.1 Flash Lite Preview1431.8 +/- 633.01450.00%
Google: Gemini 3.1 Pro Preview Custom Tools1431.8 +/- 619.64650.00%
Google: Gemini 3 Flash Preview1431.8 +/- 604.41450.00%
Meta: Llama 3.1 8B Instruct1431.8 +/- 616.83450.00%
Meta: Llama 3.3 70B Instruct1431.8 +/- 628.83650.00%
Mistral Large1431.8 +/- 654.91450.00%
OpenAI: GPT-5.3-Codex1431.8 +/- 675.92450.00%
OpenAI: GPT-5.4 Pro1431.8 +/- 591.63550.00%
Qwen: Qwen3.5-122B-A10B1431.8 +/- 659.613