
Anthropic
3 models available
Performance Benchmarks
Quantitative capabilities across reasoning, mathematics, and coding for Anthropic models
CodeLMArena
CodeLMArena
Competitive coding benchmark evaluating models on complex programming problems, debugging, and logical reasoning across multiple programming languages.
Claude 3.7 Sonnet1326
Claude 4 Sonnet1410
➗
MathLiveBench
MathLiveBench
Real-time mathematical reasoning benchmark testing advanced problem-solving across algebra, calculus, geometry, statistics, and applied mathematics.
Claude Opus 4.190.0%
Claude 3.7 Sonnet63.30%
Claude 4 Sonnet70.5%
CodeLiveBench
CodeLiveBench
Live coding performance evaluation measuring the ability to write, debug, and optimize code in real-time scenarios including algorithm implementation and software development.
Claude Opus 4.174.5%
Claude 3.7 Sonnet32.40%
Claude 4 Sonnet72.7%

Claude Opus 4.1
Anthropic
Safety #4Operational #1
Performance Benchmarks
MathLiveBench90.0%
CodeLiveBench74.5%
Size
-
Context
200,000
Safety Score
296%
Released
05-Aug-25
Claude 3.7 Sonnet
Anthropic
Safety #1Operational #16
Performance Benchmarks
CodeLMArena1326
MathLiveBench63.30%
CodeLiveBench32.40%
Size
70B
Context
200,000
Safety Score
299%
Released
24-Feb-25
Claude 4 Sonnet
Anthropic
Safety #3Operational #15
Performance Benchmarks
CodeLMArena1410
MathLiveBench70.5%
CodeLiveBench72.7%
Size
-
Context
200,000
Safety Score
296%
Released
22-May-25