MXFP4
gpt-oss-120b is a 117B parameter Mixture-of-Experts (MoE) model with 5.1B active parameters, released by OpenAI under a permissive Apache 2.0 license. Designed for production-grade reasoning and agentic tasks, it features configurable reasoning effort (low, medium, high) and exposes full chain-of-thought for debugging and transparency. The model supports native agentic capabilities including function calling, web browsing, Python code execution, and Structured Outputs. It was post-trained with MXFP4 quantization and requires the proprietary harmony response format to function correctly.
Compare performance across different hardware configurations. See each full report for detailed metrics.
| Hardware | VRAM | Peak Throughput | Concurrency Tested | Context Length Tested | Chatbot User Capacity (32K) | Full Report |
|---|---|---|---|---|---|---|
| 1x RTX Pro 6000 Blackwell | 96GB | 361tok/s | 1 - 4 | 1K - 128K | 5 users | View |
| 1x H100 SXM | 80GB | 511tok/s | 1 - 4 | 1K - 96K | 7 users | View |
| 2x RTX Pro 6000 Blackwell | 192GB | 664tok/s | 1 - 6 | 1K - 128K | 8 users | View |
| 1x H200 SXM | 141GB | 849tok/s | 1 - 10 | 1K - 128K | 26 users | View |
Per-user generation speed across context lengths. The top edge of each band shows single-user performance; the bottom edge shows performance at maximum tested concurrency.
Time until the first token is generated. The bottom edge of each band shows single-user TTFT; the top edge shows TTFT at maximum tested concurrency. Lower is better for responsive user experience.
Maximum concurrent requests while maintaining acceptable user experience. Thresholds vary by use case. See individual reports for details.
| Hardware | Code Completion(1K context) | Short-form Chat(8K context) | General Chatbot(32K context) | Document Processing(64K context) | Agentic Coding(96K context) |
|---|---|---|---|---|---|
| 1x RTX Pro 6000 Blackwell | 12 | ~59 | 5 | 1 | — |
| 1x H100 SXM | ~18 | ~32 | 7 | 3 | 1 |
| 2x RTX Pro 6000 Blackwell | 27 | 125+ | 8 | 3 | 1 |
| 1x H200 SXM | 20 | 125+ | 26 | 9 | 4 |
"+" indicates capacity exceeded tested concurrency range. "~" indicates estimated from available data.
Benchmark methodology →Get a custom recommendation based on your specific workload, budget, and performance requirements.
1x RTX Pro 6000 Blackwell, 1x H100 SXM, 2x RTX Pro 6000 Blackwell, 1x H200 SXM
The table below maps this configuration's performance to common deployment scenarios. Capacity limits are where TTFT or generation speed falls below accepted thresholds for a comfortable user experience.
The limits shown are conservative. Beyond these points, the system continues functioning with slower response times that may still be acceptable for your specific use case.
Want to validate your specific configuration?
Aggregate token generation across all concurrent requests. Measures output tokens only. Prompt tokens processed during prefill are excluded.
Token generation rate experienced by each individual user. This is the speed at which text streams into their response, also referred to as "decode speed" or "decode throughput." As concurrency increases, per-user speed decreases since GPU resources are shared across requests.
Time from request submission to first response token. The primary metric for perceived responsiveness. TTFT has two components:
At low concurrency, prefill dominates. Under load, queue wait becomes the larger factor. See Queue Wait Times and Prefill Speed in Technical Analysis.
How many concurrent requests can this configuration handle for different workloads? Each chart below shows performance metrics as concurrency increases at a specific context length. Dashed lines indicate quality thresholds, the point where user experience degrades below acceptable levels. The "capacity limit" is the tested or estimated point where the first threshold is reached.
Inline code suggestions in IDEs, like autocomplete. Responsiveness is critical. This test generates 128 output tokens per request (vs. 1024 elsewhere) to match typical autocomplete length. The key metric is end-to-end latency, not TTFT.
Threshold: End-to-end latency < 2,000ms
Quick conversational exchanges: customer support queries, simple Q&A, single-turn requests. 8K context accommodates a few back-and-forth messages plus system prompt. User expectations are more forgiving for these scenarios. 10+ tok/s is acceptable for reading streamed responses from a support chatbot.
Thresholds: TTFT < 10s, generation speed > 10 tok/s
ChatGPT-style chatbot. If you're deploying a multi-turn conversational chatbot, this benchmark shows how many concurrent requests you can support while matching acceptable responsiveness. 32K context matches ChatGPT's limit.
Thresholds: TTFT < 8s, generation speed > 15 tok/s
Summarizing reports, extracting data from contracts, analyzing lengthy documents. 64K tokens handles documents up to roughly 125-160 pages depending on formatting and density.
Users typically tolerate higher latency for document processing since they understand large inputs require more processing time. However, generation speed still needs to stay at or above reading speed.
Thresholds: TTFT < 12s, generation speed > 15 tok/s
Agentic coding workloads: AI assistants that read large portions of a codebase to answer questions, refactor code, or implement features. 96K tokens handles roughly 8,000-9,000 lines of code, enough for significant repository context.
Agentic workflows chain multiple LLM calls (tool use, retrieval, iterative refinement). With caching properly configured, context persists between requests and only new tokens require processing, dramatically reducing TTFT for each step. These results represent worst-case TTFT where all context is processed at once.
Thresholds: TTFT < 12s, generation speed > 20 tok/s
Infrastructure-level metrics that explain user-facing performance. Queue depth, prefill throughput, token generation latency, and scaling efficiency across load conditions. These help diagnose bottlenecks and validate infrastructure decisions.
Time a request waits for GPU availability before processing begins. At low concurrency, queue wait is near zero. As load increases, requests queue and wait times grow.
Queue wait is included in TTFT. Breaking it out separately helps identify whether latency is caused by GPU saturation (high queue wait) or context processing (high prefill time).
Rate at which the model processes input context before generating output. Prefill speed determines the non-queue portion of TTFT. Higher prefill speeds mean faster time-to-first-token at a given context length.
Time between consecutive tokens during generation. Determines the smoothness of streaming responses. Lower latency produces more fluid text output. ITL helps diagnose the underlying token-level behavior.
Percentage of ideal linear scaling achieved as concurrency increases. 100% efficiency means doubling concurrent requests doubles total throughput with no per-user degradation. Real-world efficiency is always lower due to shared GPU resources.
This page shows averages. Full percentile breakdowns (P50–P99) and GPU metrics (utilization, VRAM, temperature) available on request
GPU power draw under varying load conditions. Relevant for operational cost estimation, cooling requirements, and data center power budgeting.
Get a custom benchmark for your configuration, or talk through your requirements with our team.