| Qwen: Qwen3.5-Flash | 2178.0 +/- 95.6 | 1 | 1 | 95.45% |
| OpenAI: GPT-5.3 Chat | 2082.4 +/- 0.0 | 1 | 0 | 50.00% |
| Anthropic: Claude 3.5 Sonnet | 1500.0 +/- 0.0 | 0 | 0 | 50.00% |
| Anthropic: Claude 3.7 Sonnet | 1500.0 +/- 0.0 | 0 | 0 | 50.00% |
| Anthropic: Claude Opus 4.5 | 1500.0 +/- 0.0 | 0 | 0 | 50.00% |
| Anthropic: Claude Opus 4.6 | 1500.0 +/- 0.0 | 0 | 0 | 50.00% |
| Anthropic: Claude Sonnet 4.6 | 1500.0 +/- 0.0 | 0 | 0 | 50.00% |
| DeepSeek: DeepSeek V3 | 1500.0 +/- 0.0 | 0 | 0 | 50.00% |
| Google: Gemini 2.5 Pro | 1500.0 +/- 0.0 | 0 | 0 | 100.00% |
| Cohere: Command R+ (08-2024) | 1462.2 +/- 708.9 | 2 | 3 | 53.64% |
| Anthropic: Claude Sonnet 4.5 | 1457.6 +/- 686.2 | 2 | 4 | 55.91% |
| Google: Gemini 2.5 Flash | 1450.2 +/- 696.3 | 2 | 4 | 54.31% |
| OpenAI: GPT-5.4 | 1444.8 +/- 718.8 | 2 | 5 | 52.91% |
| OpenAI: GPT-4o-mini | 1441.1 +/- 656.1 | 4 | 5 | 52.05% |
| Google: Gemini 2.0 Flash | 1438.5 +/- 683.8 | 3 | 5 | 54.84% |
| OpenAI: GPT-4o | 1432.4 +/- 658.4 | 4 | 7 | 50.54% |
| Anthropic: Claude 3.5 Haiku | 1431.8 +/- 656.0 | 1 | 4 | 50.00% |
| DeepSeek: DeepSeek V3.2 | 1431.8 +/- 647.5 | 3 | 5 | 50.00% |
| DeepSeek: DeepSeek V3.2 Exp | 1431.8 +/- 650.2 | 2 | 5 | 50.00% |
| DeepSeek: R1 | 1431.8 +/- 617.6 | 1 | 4 | 50.00% |
| Google: Gemini 3.1 Flash Lite Preview | 1431.8 +/- 633.0 | 1 | 4 | 50.00% |
| Google: Gemini 3.1 Pro Preview Custom Tools | 1431.8 +/- 619.6 | 4 | 6 | 50.00% |
| Google: Gemini 3 Flash Preview | 1431.8 +/- 604.4 | 1 | 4 | 50.00% |
| Meta: Llama 3.1 8B Instruct | 1431.8 +/- 616.8 | 3 | 4 | 50.00% |
| Meta: Llama 3.3 70B Instruct | 1431.8 +/- 628.8 | 3 | 6 | 50.00% |
| Mistral Large | 1431.8 +/- 654.9 | 1 | 4 | 50.00% |
| OpenAI: GPT-5.3-Codex | 1431.8 +/- 675.9 | 2 | 4 | 50.00% |
| OpenAI: GPT-5.4 Pro | 1431.8 +/- 591.6 | 3 | 5 | 50.00% |
| Qwen: Qwen3.5-122B-A10B | 1431.8 +/- 659.6 | 1 | 3 | |