LLM Inference Benchmarks

Independent benchmarks for production deployments. Throughput, latency, and capacity tested across hardware configurations.

Filters
MXFP4
FP8
BF16
MIT
Apache license 2.0
Apache 2.0
Z.ai
Qwen
OpenAI
Mistral AI
1x RTX Pro 6000 Blackwell
1x H200 SXM
1x H100 SXM
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
Model
Format
Parameters
Released
Organization
License
Configs
GLM-4.7-Flash
BF16
30B
1/19/2026
Z.ai
MIT
1
1x H200 SXM
HardwareVRAMPeak ThroughputConcurrency TestedContext Length TestedChatbot User Capacity (32K)Full Report
1x H200 SXM141GB458tok/s1 - 41K - 200K12 usersView
Qwen3-Coder-Next
FP8
80B
2/3/2026
Qwen
Apache 2.0
3
1x RTX Pro 6000 Blackwell, 2x RTX Pro 6000 Blackwell, 1x H200 SXM
HardwareVRAMPeak ThroughputConcurrency TestedContext Length TestedChatbot User Capacity (32K)Full Report
1x RTX Pro 6000 Blackwell96GB306tok/s1 - 41K - 256K5 usersView
2x RTX Pro 6000 Blackwell192GB765tok/s1 - 101K - 256K6 usersView
1x H200 SXM141GB853tok/s1 - 101K - 256K8 usersView
Devstral-Small-2-24B-Instruct-2512
FP8
24B
12/9/2025
Mistral AI
Apache 2.0
3
1x RTX Pro 6000 Blackwell, 1x H100 SXM, 1x H200 SXM
HardwareVRAMPeak ThroughputConcurrency TestedContext Length TestedChatbot User Capacity (32K)Full Report
1x RTX Pro 6000 Blackwell96GB102tok/s1 - 31K - 256K5 usersView
1x H100 SXM80GB274tok/s1 - 31K - 256K8 usersView
1x H200 SXM141GB564tok/s1 - 51K - 256K8 usersView
Qwen3-Coder-30B-A3B-Instruct
FP8
30B
7/31/2025
Qwen
Apache 2.0
3
1x RTX Pro 6000 Blackwell, 1x H100 SXM, 1x H200 SXM
HardwareVRAMPeak ThroughputConcurrency TestedContext Length TestedChatbot User Capacity (32K)Full Report
1x RTX Pro 6000 Blackwell96GB334tok/s1 - 41K - 256K10 usersView
1x H100 SXM80GB584tok/s1 - 61K - 192K15 usersView
1x H200 SXM141GB600tok/s1 - 61K - 256K17 usersView
gpt-oss-120b
MXFP4
117B
8/5/2025
OpenAI
Apache 2.0
4
1x RTX Pro 6000 Blackwell, 1x H100 SXM, 2x RTX Pro 6000 Blackwell, 1x H200 SXM
HardwareVRAMPeak ThroughputConcurrency TestedContext Length TestedChatbot User Capacity (32K)Full Report
1x RTX Pro 6000 Blackwell96GB361tok/s1 - 41K - 128K5 usersView
1x H100 SXM80GB511tok/s1 - 41K - 96K7 usersView
2x RTX Pro 6000 Blackwell192GB664tok/s1 - 61K - 128K8 usersView
1x H200 SXM141GB849tok/s1 - 101K - 128K26 usersView
gpt-oss-20b
MXFP4
22B
8/5/2025
OpenAI
Apache 2.0
3
1x RTX Pro 6000 Blackwell, 1x H100 SXM, 1x H200 SXM
HardwareVRAMPeak ThroughputConcurrency TestedContext Length TestedChatbot User Capacity (32K)Full Report
1x RTX Pro 6000 Blackwell96GB642tok/s1 - 51K - 128K8 usersView
1x H100 SXM80GB2,168tok/s1 - 151K - 128K62 usersView
1x H200 SXM141GB2,471tok/s1 - 181K - 128K55 usersView
Showing 
0
 of 
0
Methodology

How We Test

Each configuration runs 50+ test scenarios across different context lengths and concurrency levels up to what the model and hardware can support. This includes measuring standard metrics such as throughput and latency, as well as conducting capacity tests that increase concurrent requests until performance thresholds are exceeded.

Prompts use calibrated token counts. No prompt caching. No speculative decoding unless specifically noted.

Read Full Methodology →

Not Sure What You Need?

We work with teams to figure out the right model + hardware combination for your throughput, latency, and budget requirements.

Get a Recommendation