Holistic AI LLM Decision Hub

Helping senior leaders make confident, well-informed decisions about their LLM environment.

Holistic AI provides trusted, independent rankings of large language models across performance, red teaming, jailbreaking safety, and real-world usability. Our insights are grounded in rigorous internal red teaming and jailbreaking tests, alongside publicly available benchmarks. This enables CIOs, CTOs, developers, researchers, and organizations to choose the right model faster and with greater confidence.

Showing 20 of 20 total models
Filter by Provider:
Use horizontal scroll, buttons, or Ctrl+Arrow keys
Scroll Progress:
0%
Swipe horizontally to scroll
Model
Org.
Safety
Rank
Safe
Responses
Unsafe
Responses
Jailbreaking
Resistance
Code
LMArena
Math
LiveBench
GPQA
Code
LiveBench
Multimodal
Support
Size
Parameters
Released
Knowledge
Cutoff
Input
Cost/M
Output
Cost/M
Context
Length
Latency
TTFT
xAI logo
Grok 4
xAI#1961%
39%
(116 / 300)
10%
(10 / 100)
142088.84%88.10%71.34%
Yes
-09-Jul-252025-07$3.00$15.00256,000 tokens~11s
OpenAI logo
GPT-4.1
OpenAI#793%
7%
(22 / 300)
83%
(83 / 100)
138572.00%66.30%54.60%
Yes
-14-Apr-252024-06-01$2.00$8.001,047,576 tokens~0.4s
OpenAI logo
GPT-4.5
OpenAI#2100%
0%
(1 / 237)
97%
(36 / 37)
136269.30%69.50%76.10%
Yes
-27-Feb-252023-10-01$75.00$150.00128,000 tokens~0.5s
OpenAI logo
GPT-5
OpenAI#991%
9%
(26 / 300)
75%
(75 / 100)
145092.77%86.00%75.31%
Yes
-07-Aug-252024-10-01$1.25$10.00400,000 tokensMedium
Anthropic logo
Claude Opus 4.1
Anthropic#499%
1%
(4 / 300)
97%
(97 / 100)
-90.0%83.3%74.5%
Yes
-05-Aug-252025-03-01$15.00$75.00200,000 tokensModerately Fast
OpenAI logo
GPT-4o
OpenAI#893%
7%
(20 / 300)
82%
(82 / 100)
138541.48%46.00%77.50%
Yes
-20-Nov-242023-10-01$2.50$10.00128,000 tokens~0.6s
Anthropic logo
Claude 3.7 Sonnet
Anthropic#1100%
0%
(1 / 300)
99%
(99 / 100)
132663.30%84.80%32.40%
Yes
70B
24-Feb-252024-10-01$3.00$15.00200,000 tokens-
DeepSeek logo
DeepSeek R1
DeepSeek#1089%
11%
(26 / 237)
32%
(12 / 37)
138079.50%74.24%48.50%
No
671B
20-Jan-252024-07-01$0.55$2.19128,000 tokens-
Google logo
Gemini 2.5 Pro Preview
Google#1666%
34%
(102 / 300)
26%
(26 / 100)
139584.19%86.40%73.90%
Yes
-05-Jun-252025-01-31$1.25$10.001,048,576 tokens-
Google logo
Gemini 2.5 Flash Preview 05-20
Google#1863%
37%
(111 / 300)
12%
(12 / 100)
135084.10%82.80%63.53%
Yes
-17-Apr-252025-01-31$0.15$0.601,048,576 tokens-
Google logo
Gemini 2.0 Flash
Google#1766%
34%
(102 / 300)
10%
(10 / 100)
131083.33%62.10%70.70%
Yes
-05-Feb-252024-08-01$0.10$0.401,048,576 tokens-
Google logo
Gemma 2 9B
Google#1284%
16%
(37 / 237)
5%
(2 / 37)
118052.27%-48.94%
No
9B
27-Jun-242024-04-01--8,192 tokens-
xAI logo
Grok-3
xAI#2022%
78%
(235 / 300)
3%
(3 / 100)
132062.75%84.60%73.58%
Yes
300B
01-May-242024-11$3.00$15.00128,000 tokens-
Meta logo
Llama 4 Maverick 128e Instruct
Meta#1479%
21%
(50 / 237)
3%
(1 / 37)
129060.58%-54.19%
No
17B
05-Apr-252024-08-01$0.20$0.60128,000 tokens-
Meta logo
Llama 3.1 Instant
Meta#1379%
21%
(49 / 237)
3%
(1 / 37)
120051.88%-57.26%
No
8B
23-Jul-242023-12-01--128,000 tokens-
Meta logo
Llama 4 Scout 16e Instruct
Meta#1578%
22%
(52 / 237)
3%
(1 / 37)
125048.14%-42.16%
No
17B
05-Apr-252024-08-01--128,000 tokens-
Alibaba logo
QWen-qwq-32b
Alibaba#1187%
13%
(31 / 237)
32%
(12 / 37)
134081.14%-67.18%
No
32B
12-Oct-242024-09-01--128,000 tokens-
OpenAI logo
GPT-OSS-20B
OpenAI#696%
4%
(11 / 300)
89%
(89 / 100)
-69.54%71.5%-
No
20B
05-Aug-25---128,000 tokens-
OpenAI logo
GPT-OSS-120B
OpenAI#597%
3%
(8 / 300)
92%
(92 / 100)
-69.54%80.1%58.80%
No
120B
05-Aug-25---128,000 tokens-
Anthropic logo
Claude 4 Sonnet
Anthropic#399%
1%
(4 / 300)
97%
(97 / 100)
141070.5%75.40%72.7%
Yes
-22-May-252024-04-01$3.00$15.00200,000 tokens~0.5-1s

📊 Data Source

All comparative insights are based on a combination of rigorous red teaming and jailbreaking testing performed by Holistic AI, as well as publicly available benchmark data. External benchmarks include CodeLMArena, MathLiveBench, CodeLiveBench, and GPQA. These were sourced from official model provider websites, public leaderboards, benchmark sites, and other accessible resources to ensure transparency, accuracy, and reliability.