Independent benchmarks for production deployments. Throughput, latency, and capacity tested across hardware configurations.
| Hardware | VRAM | Peak Throughput | Concurrency Tested | Context Length Tested | Chatbot User Capacity (32K) | Full Report |
|---|---|---|---|---|---|---|
| 1x H200 SXM | 141GB | 458tok/s | 1 - 4 | 1K - 200K | 12 users | View |
| Hardware | VRAM | Peak Throughput | Concurrency Tested | Context Length Tested | Chatbot User Capacity (32K) | Full Report |
|---|---|---|---|---|---|---|
| 1x RTX Pro 6000 Blackwell | 96GB | 361tok/s | 1 - 4 | 1K - 128K | 5 users | View |
| 1x H100 SXM | 80GB | 511tok/s | 1 - 4 | 1K - 96K | 7 users | View |
| 2x RTX Pro 6000 Blackwell | 192GB | 664tok/s | 1 - 6 | 1K - 128K | 8 users | View |
| 1x H200 SXM | 141GB | 849tok/s | 1 - 10 | 1K - 128K | 26 users | View |
Each configuration runs 50+ test scenarios across different context lengths and concurrency levels up to what the model and hardware can support. This includes measuring standard metrics such as throughput and latency, as well as conducting capacity tests that increase concurrent requests until performance thresholds are exceeded.
Prompts use calibrated token counts. No prompt caching. No speculative decoding unless specifically noted.
We work with teams to figure out the right model + hardware combination for your throughput, latency, and budget requirements.