LLM Inference Benchmarks

Independent benchmarks for production deployments. Throughput, latency, and capacity tested across hardware configurations.

Model

Format

Parameters

Released

Organization

License

Configs

▶

Step-3.7-Flash

NVFP4

198B

5/28/2026

StepFun

Apache 2.0

2x RTX Pro 6000 Blackwell

Hardware	VRAM	Peak Throughput	Concurrency Tested	Context Length Tested	Chatbot User Capacity (32K)	Full Report
2x RTX Pro 6000 Blackwell	196GB	218tok/s	1 - 5	1K - 256K	—	View

▶

Qwen3.6-35B-A3B-MTP

FP8

35B

4/16/2026

Qwen

Apache 2.0

1x MI300X, 1x RTX Pro 6000 Blackwell

Hardware	VRAM	Peak Throughput	Concurrency Tested	Context Length Tested	Chatbot User Capacity (32K)	Full Report
1x MI300X	192GB	376tok/s	1 - 3	1K - 128K	2 users	View
1x RTX Pro 6000 Blackwell	96GB	621tok/s	1 - 5	1K - 256K	39 users	View

▶

Qwen3.6-35B-A3B

FP8

35B

4/16/2026

Qwen

Apache 2.0

1x MI300X, 1x RTX Pro 6000 Blackwell

Hardware	VRAM	Peak Throughput	Concurrency Tested	Context Length Tested	Chatbot User Capacity (32K)	Full Report
1x MI300X	192GB	256tok/s	1 - 3	1K - 128K	3 users	View
1x RTX Pro 6000 Blackwell	96GB	449tok/s	1 - 5	1K - 256K	41 users	View

▶

Qwen3.6-27B-MTP

FP8

27B

4/22/2026

Qwen

Apache 2.0

1x MI300X, 1x RTX Pro 6000 Blackwell

Hardware	VRAM	Peak Throughput	Concurrency Tested	Context Length Tested	Chatbot User Capacity (32K)	Full Report
1x MI300X	192GB	254tok/s	1 - 3	1K - 128K	—	View
1x RTX Pro 6000 Blackwell	96GB	337tok/s	1 - 5	1K - 256K	5 users	View

▶

Qwen3.6-27B

FP8

27B

4/22/2026

Qwen

Apache 2.0

1x MI300X, 1x RTX Pro 6000 Blackwell

Hardware	VRAM	Peak Throughput	Concurrency Tested	Context Length Tested	Chatbot User Capacity (32K)	Full Report
1x MI300X	192GB	143tok/s	1 - 3	1K - 128K	—	View
1x RTX Pro 6000 Blackwell	96GB	189tok/s	1 - 5	1K - 256K	3 users	View

▶

Gemma-4-26B-A4B

FP8

26B

4/2/2026

Google

Apache 2.0

1x RTX Pro 6000 Blackwell

Hardware	VRAM	Peak Throughput	Concurrency Tested	Context Length Tested	Chatbot User Capacity (32K)	Full Report
1x RTX Pro 6000 Blackwell	96GB	674tok/s	1 - 10	1K - 256K	13 users	View

▶

Gemma-4-31B

FP8

31B

4/2/2026

Google

Apache 2.0

1x RTX Pro 6000 Blackwell

Hardware	VRAM	Peak Throughput	Concurrency Tested	Context Length Tested	Chatbot User Capacity (32K)	Full Report
1x RTX Pro 6000 Blackwell	96GB	129tok/s	1 - 4	1K - 96K	1 users	View

▶

Gemma-4-31B

NVFP4

31B

4/2/2026

Google

Apache 2.0

1x RTX Pro 6000 Blackwell

Hardware	VRAM	Peak Throughput	Concurrency Tested	Context Length Tested	Chatbot User Capacity (32K)	Full Report
1x RTX Pro 6000 Blackwell	96GB	126tok/s	1 - 4	1K - 128K	2 users	View

▶

Mistral-Small-4-119B-2603

NVFP4

119B

3/17/2026

Mistral AI

Apache 2.0

1x RTX Pro 6000 Blackwell

Hardware	VRAM	Peak Throughput	Concurrency Tested	Context Length Tested	Chatbot User Capacity (32K)	Full Report
1x RTX Pro 6000 Blackwell	96GB	262tok/s	1 - 5	1K - 256K	4 users	View

▶

Nemotron-3-Super-120B-A12B

NVFP4

120B

3/11/2026

NVIDIA

1x RTX Pro 6000 Blackwell

Hardware	VRAM	Peak Throughput	Concurrency Tested	Context Length Tested	Chatbot User Capacity (32K)	Full Report
1x RTX Pro 6000 Blackwell	96GB	178tok/s	1 - 5	1K - 512K	6 users	View

▶

Qwen3.5-397B-A17B

FP8

397B

2/17/2026

Qwen

Apache 2.0

8x RTX Pro 6000 Blackwell

Hardware	VRAM	Peak Throughput	Concurrency Tested	Context Length Tested	Chatbot User Capacity (32K)	Full Report
8x RTX Pro 6000 Blackwell	768GB	244tok/s	1 - 5	1K - 256K	4 users	View

▶

Qwen3.5-122B-A10B

FP8

122B

2/24/2026

Qwen

Apache 2.0

2x RTX Pro 6000 Blackwell

Hardware	VRAM	Peak Throughput	Concurrency Tested	Context Length Tested	Chatbot User Capacity (32K)	Full Report
2x RTX Pro 6000 Blackwell	192GB	237tok/s	1 - 5	1K - 256K	7 users	View

▶

Qwen3.5-27B

FP8

27B

2/24/2026

Qwen

Apache 2.0

1x RTX Pro 6000 Blackwell, 1x H100 SXM

Hardware	VRAM	Peak Throughput	Concurrency Tested	Context Length Tested	Chatbot User Capacity (32K)	Full Report
1x RTX Pro 6000 Blackwell	96GB	102tok/s	1 - 4	1K - 256K	3 users	View
1x H100 SXM	80GB	312tok/s	1 - 5	1K - 256K	6 users	View

▶

Qwen3.5-35B-A3B

FP8

35B

2/24/2026

Qwen

Apache 2.0

1x RTX Pro 6000 Blackwell, 1x H100 SXM, 2x RTX Pro 6000 Blackwell, 1x H200 SXM

Hardware	VRAM	Peak Throughput	Concurrency Tested	Context Length Tested	Chatbot User Capacity (32K)	Full Report
1x RTX Pro 6000 Blackwell	96GB	598tok/s	1 - 10	1K - 256K	34 users	View
1x H100 SXM	80GB	908tok/s	1 - 10	1K - 256K	45 users	View
2x RTX Pro 6000 Blackwell	192GB	1,164tok/s	1 - 15	1K - 256K	27 users	View
1x H200 SXM	141GB	1,479tok/s	1 - 15	1K - 256K	62 users	View

▶

Ministral-3-3B-Instruct-2512

FP8

12/2/2025

Mistral AI

Apache 2.0

1x RTX Pro 6000 Blackwell

Hardware	VRAM	Peak Throughput	Concurrency Tested	Context Length Tested	Chatbot User Capacity (32K)	Full Report
1x RTX Pro 6000 Blackwell	96GB	1,030tok/s	1 - 6	1K - 256K	23 users	View

Showing

Methodology

How We Test

Each configuration runs 50+ test scenarios across different context lengths and concurrency levels up to what the model and hardware can support. This includes measuring standard metrics such as throughput and latency, as well as conducting capacity tests that increase concurrent requests until performance thresholds are exceeded.

Prompts use calibrated token counts. No prompt caching. No speculative decoding unless specifically noted.

Read Full Methodology →

Not Sure What You Need?

We work with teams to figure out the right model + hardware combination for your throughput, latency, and budget requirements.

Get a Recommendation