Comprehensive benchmarks and performance metrics for the latest language models. Make informed decisions with real-time data and detailed comparisons.
Compare performance metrics across leading AI models (2025 data)
OpenAI
Leading language understanding with 90.2% MMLU score
Anthropic
Top coding performance with 62.3% SWE-Bench score
Exceptional reasoning with 84% GPQA and 92% AIME scores
Meta
Industry-leading 10M token context with MoE architecture
Real benchmark scores and API pricing (2025)
| Model | Provider | MMLU | Coding | Reasoning | Math | Context | Price/1M |
|---|---|---|---|---|---|---|---|
| GPT-4.1 | OpenAI | 90.2% | 54.6% | 66.3% | 48.1% | 1M tokens | In: $3-12Out: $3-12 |
| Claude 3.7 Sonnet | Anthropic | 86% | 62.3% | 78.2% | 61.3% | 200K tokens | In: $3Out: $15 |
| Gemini 2.5 Pro | 85.8% | 63.8% | 84% | 92% | 1M+ tokens | In: $1.25-2.50Out: $10-15 | |
| Llama 4 Scout | Meta | 88% | 65% | 80% | 58% | 10M tokens | In: $0.50Out: $2 |
Data sources: MMLU (Language Understanding), SWE-Bench (Coding), GPQA (Reasoning), AIME 2024 (Math). Pricing per 1M tokens as of 2025. Benchmarks may vary based on test conditions.