LLM Inference Benchmarks

Independent benchmarks for production deployments. Throughput, latency, and capacity tested across hardware configurations.

Model

Format

Parameters

Released

Organization

License

Configs

▶

GLM-4.7-Flash

BF16

30B

1/19/2026

Z.ai

MIT

1x H200 SXM

Hardware	VRAM	Peak Throughput	Concurrency Tested	Context Length Tested	Chatbot User Capacity (32K)	Full Report
1x H200 SXM	141GB	458tok/s	1 - 4	1K - 200K	12 users	View

▶

Qwen3-Coder-Next

FP8

80B

2/3/2026

Qwen

Apache 2.0

1x RTX Pro 6000 Blackwell, 2x RTX Pro 6000 Blackwell, 1x H200 SXM

Hardware	VRAM	Peak Throughput	Concurrency Tested	Context Length Tested	Chatbot User Capacity (32K)	Full Report
1x RTX Pro 6000 Blackwell	96GB	306tok/s	1 - 4	1K - 256K	5 users	View
2x RTX Pro 6000 Blackwell	192GB	765tok/s	1 - 10	1K - 256K	6 users	View
1x H200 SXM	141GB	853tok/s	1 - 10	1K - 256K	8 users	View

▶

Devstral-Small-2-24B-Instruct-2512

FP8

24B

12/9/2025

Mistral AI

Apache 2.0

1x RTX Pro 6000 Blackwell, 1x H100 SXM, 1x H200 SXM

Hardware	VRAM	Peak Throughput	Concurrency Tested	Context Length Tested	Chatbot User Capacity (32K)	Full Report
1x RTX Pro 6000 Blackwell	96GB	102tok/s	1 - 3	1K - 256K	5 users	View
1x H100 SXM	80GB	274tok/s	1 - 3	1K - 256K	8 users	View
1x H200 SXM	141GB	564tok/s	1 - 5	1K - 256K	8 users	View

▶

Qwen3-Coder-30B-A3B-Instruct

FP8

30B

7/31/2025

Qwen

Apache 2.0

1x RTX Pro 6000 Blackwell, 1x H100 SXM, 1x H200 SXM

Hardware	VRAM	Peak Throughput	Concurrency Tested	Context Length Tested	Chatbot User Capacity (32K)	Full Report
1x RTX Pro 6000 Blackwell	96GB	334tok/s	1 - 4	1K - 256K	10 users	View
1x H100 SXM	80GB	584tok/s	1 - 6	1K - 192K	15 users	View
1x H200 SXM	141GB	600tok/s	1 - 6	1K - 256K	17 users	View

▶

gpt-oss-120b

MXFP4

117B

8/5/2025

OpenAI

Apache 2.0

1x RTX Pro 6000 Blackwell, 1x H100 SXM, 2x RTX Pro 6000 Blackwell, 1x H200 SXM

Hardware	VRAM	Peak Throughput	Concurrency Tested	Context Length Tested	Chatbot User Capacity (32K)	Full Report
1x RTX Pro 6000 Blackwell	96GB	361tok/s	1 - 4	1K - 128K	5 users	View
1x H100 SXM	80GB	511tok/s	1 - 4	1K - 96K	7 users	View
2x RTX Pro 6000 Blackwell	192GB	664tok/s	1 - 6	1K - 128K	8 users	View
1x H200 SXM	141GB	849tok/s	1 - 10	1K - 128K	26 users	View

▶

gpt-oss-20b

MXFP4

22B

8/5/2025

OpenAI

Apache 2.0

1x RTX Pro 6000 Blackwell, 1x H100 SXM, 1x H200 SXM

Hardware	VRAM	Peak Throughput	Concurrency Tested	Context Length Tested	Chatbot User Capacity (32K)	Full Report
1x RTX Pro 6000 Blackwell	96GB	642tok/s	1 - 5	1K - 128K	8 users	View
1x H100 SXM	80GB	2,168tok/s	1 - 15	1K - 128K	62 users	View
1x H200 SXM	141GB	2,471tok/s	1 - 18	1K - 128K	55 users	View

Showing

Methodology

How We Test

Each configuration runs 50+ test scenarios across different context lengths and concurrency levels up to what the model and hardware can support. This includes measuring standard metrics such as throughput and latency, as well as conducting capacity tests that increase concurrent requests until performance thresholds are exceeded.

Prompts use calibrated token counts. No prompt caching. No speculative decoding unless specifically noted.

Read Full Methodology →

Not Sure What You Need?

We work with teams to figure out the right model + hardware combination for your throughput, latency, and budget requirements.

Get a Recommendation