
Anthropic
4 models available
Performance Benchmarks
Quantitative capabilities across reasoning, mathematics, and coding for Anthropic models
CodeLMArena
CodeLMArena
Competitive coding benchmark evaluating models on complex programming problems, debugging, and logical reasoning across multiple programming languages.
Claude 3.7 Sonnet1326
Claude Sonnet 4.51420
Claude 4 Sonnet1410
➗
MathLiveBench
MathLiveBench
Real-time mathematical reasoning benchmark testing advanced problem-solving across algebra, calculus, geometry, statistics, and applied mathematics.
Claude Opus 4.190.0%
Claude 3.7 Sonnet63.30%
Claude Sonnet 4.575.0%
+1 more models
CodeLiveBench
CodeLiveBench
Live coding performance evaluation measuring the ability to write, debug, and optimize code in real-time scenarios including algorithm implementation and software development.
Claude Opus 4.174.5%
Claude 3.7 Sonnet73.2%
Claude Sonnet 4.577.2%
+1 more models

Claude Opus 4.1
Anthropic
Safety #5Operational #1
Performance Benchmarks
MathLiveBench90.0%
CodeLiveBench74.5%
Size
-
Context
200K
Safety Score
296%
Released
05-Aug-25
Claude 3.7 Sonnet
Anthropic
Safety #2Operational #16
Performance Benchmarks
CodeLMArena1326
MathLiveBench63.30%
CodeLiveBench73.2%
Size
70B
Context
200K
Safety Score
299%
Released
24-Feb-25
Claude Sonnet 4.5
Anthropic
Safety #1Operational #3
Performance Benchmarks
CodeLMArena1420
MathLiveBench75.0%
CodeLiveBench77.2%
Size
-
Context
200K
Safety Score
299%
Released
29-Sep-25
Claude 4 Sonnet
Anthropic
Safety #4Operational #15
Performance Benchmarks
CodeLMArena1410
MathLiveBench70.5%
CodeLiveBench64.8%
Size
-
Context
200K
Safety Score
296%
Released
22-May-25