AI
ModelBench

AI Model Testing & Comparison:
Evaluating Performance Across Leading LLMs

Comprehensive benchmarks and performance metrics for the latest language models. Make informed decisions with real-time data and detailed comparisons.

Live Benchmarks
12+ Models Tracked

Featured Models

Compare performance metrics across leading AI models (2025 data)

GPT-4.1

OpenAI

Active

Leading language understanding with 90.2% MMLU score

Speed88%
Accuracy90.2%
Cost Efficiency75%
Context Window
1M tokens
Text & Vision
Function Calling
JSON Mode

Claude 3.7 Sonnet

Anthropic

Active

Top coding performance with 62.3% SWE-Bench score

Speed85%
Accuracy86%
Cost Efficiency90%
Context Window
200K tokens
Text & Vision
Extended Context
Deep Thinking

Gemini 2.5 Pro

Google

Active

Exceptional reasoning with 84% GPQA and 92% AIME scores

Speed90%
Accuracy85.8%
Cost Efficiency62%
Context Window
1M+ tokens
Multimodal
Long Context
Real-time Data

Llama 4 Scout

Meta

Active

Industry-leading 10M token context with MoE architecture

Speed92%
Accuracy88%
Cost Efficiency25%
Context Window
10M tokens
Open Source
Multimodal
Video Processing

Detailed Comparison

Real benchmark scores and API pricing (2025)

ModelProviderMMLUCodingReasoningMathContextPrice/1M
GPT-4.1OpenAI90.2%54.6%66.3%48.1%1M tokens
In: $3-12Out: $3-12
Claude 3.7 SonnetAnthropic86%62.3%78.2%61.3%200K tokens
In: $3Out: $15
Gemini 2.5 ProGoogle85.8%63.8%84%92%1M+ tokens
In: $1.25-2.50Out: $10-15
Llama 4 ScoutMeta88%65%80%58%10M tokens
In: $0.50Out: $2

Data sources: MMLU (Language Understanding), SWE-Bench (Coding), GPQA (Reasoning), AIME 2024 (Math). Pricing per 1M tokens as of 2025. Benchmarks may vary based on test conditions.

Built with v0