LLM Inference Benchmarks

Independent benchmarks for production deployments. Throughput, latency, and capacity tested across hardware configurations.

Filters
NVFP4
MXFP4
FP8
BF16
NVIDIA
Apache 2.0
Modified MIT
MIT
Apache 2.0
NVIDIA
Z.ai
MiniMax
Z.ai
Qwen
OpenAI
Mistral AI
8x RTX Pro 6000 Blackwell
4x RTX Pro 6000 Blackwell
4x H200 SXM
2x RTX Pro 6000 Blackwell
1x RTX Pro 6000 Blackwell
1x H200 SXM
1x H100 SXM
1x RTX Pro 6000 Blackwell
1x H200 SXM
1x H100 SXM
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
Model
Format
Parameters
Released
Organization
License
Configs
gpt-oss-20b
MXFP4
22B
8/5/2025
OpenAI
Apache 2.0
3
1x RTX Pro 6000 Blackwell, 1x H100 SXM, 1x H200 SXM
HardwareVRAMPeak ThroughputConcurrency TestedContext Length TestedChatbot User Capacity (32K)Full Report
1x RTX Pro 6000 Blackwell96GB642tok/s1 - 51K - 128K8 usersView
1x H100 SXM80GB2,168tok/s1 - 151K - 128K62 usersView
1x H200 SXM141GB2,471tok/s1 - 181K - 128K55 usersView
Showing 
0
 of 
0
Methodology

How We Test

Each configuration runs 50+ test scenarios across different context lengths and concurrency levels up to what the model and hardware can support. This includes measuring standard metrics such as throughput and latency, as well as conducting capacity tests that increase concurrent requests until performance thresholds are exceeded.

Prompts use calibrated token counts. No prompt caching. No speculative decoding unless specifically noted.

Read Full Methodology →

Not Sure What You Need?

We work with teams to figure out the right model + hardware combination for your throughput, latency, and budget requirements.

Get a Recommendation