1. Overview

Most LLM benchmarks measure batch throughput: how much data can you push through a system per hour. These metrics matter for offline workloads, but they don't show how configurations perform for real users. This benchmark focuses on user-facing performance: what individual users experience when sharing a system with others, how many concurrent requests you can support, and at what point things start to slow down. These are the questions that determine whether a configuration works for your application.

A system that generates 100 tokens per second for a single user may only deliver 25 tokens per second to each user when five requests run simultaneously. Our benchmarks measure this behavior directly, providing data for infrastructure decisions based on realistic concurrent workloads rather than peak single-user performance.

We test across a matrix of conditions: context lengths from 1K tokens up to the maximum the model and hardware can support, and concurrency levels from single-user to high-load scenarios. We also run dedicated capacity tests that push concurrency until performance thresholds are exceeded. Together, this produces 50+ distinct test scenarios per hardware configuration. The goal is to characterize not only peak throughput but how gracefully performance degrades under load.

‍

1.1 Design Philosophy

Every methodology decision prioritizes production-realistic measurement over impressive numbers. We prefer conservative, defensible results that match deployment behavior over optimistic figures that fail to materialize under real load. Where our approach differs from common benchmarking practices, we document the rationale in Section 7.

2. Metrics

We collect metrics that answer specific questions about production performance. This section describes each metric, its meaning, and why it matters for deployment decisions.

‍

2.1 Primary Metrics

These metrics appear directly in benchmark reports and drive use case recommendations.

System Throughput is the total tokens per second generated across all concurrent requests. This measures aggregate system capacity. If five users share a system producing 100 tokens per second total, each user effectively receives 20 tokens per second. System throughput typically increases with concurrency as the GPU processes multiple requests in parallel, though per-user speed decreases.

Per-User Generation Speed is the rate at which tokens stream into an individual user's response, measured in tokens per second. This is what determines how fast or slow responses feel during streaming. We measure actual per-user decode speed and do not include queue wait time or time to first token.

Time to First Token (TTFT) is the elapsed time from request submission to first response token. TTFT is the primary metric for perceived responsiveness. It comprises primarily two components: queue wait time (time spent waiting for GPU availability) and prefill time (time to process the input context). At low concurrency, queue wait time is negligible and TTFT is dominated by prefill time. Under load, queue wait time becomes the larger factor.

Inter-Token Latency (ITL) is the time between consecutive tokens during generation. ITL determines whether streaming responses feel smooth or choppy. We measure actual per-token timing rather than deriving from other metrics. This captures the true distribution, including tail latency that affects user experience.

Queue Wait Time measures how long requests spend waiting before processing begins. At low concurrency, this approaches zero. As load increases, requests queue and wait times grow. Queue wait time helps identify system saturation points where adding more concurrent requests yields diminishing returns.

Scaling Efficiency is the ratio of per-user throughput at N concurrent requests to per-user throughput at one request. An efficiency of 100% would indicate no degradation from concurrency. In practice, efficiency decreases as requests compete for GPU compute, memory bandwidth, and KV cache space. This metric shows where diminishing returns begin and helps right-size deployments.

Per-User Prefill Speed is the rate at which the model processes an individual user's input context, measured in tokens per second. Higher prefill speed means lower TTFT at a given context length. Prefill speed typically increases with context length (better GPU utilization on larger batches) until the GPU is saturated, at which point it plateaus or declines.

End-to-End Latency is the total time from request submission to final token received. This equals TTFT plus total generation time. End-to-end latency matters for batch processing workflows where users wait for complete responses rather than streaming output.

Success Rate is the percentage of requests that complete successfully without errors or timeouts. We report success rate so you can see how stable a configuration is across all tested scenarios, including high-load conditions designed to find system limits.

‍

2.2 Additional Metrics

Beyond primary metrics shown in reports, we collect additional data available upon request.

Latency Percentiles. Full percentile breakdowns (P50, P75, P90, P95, P99) for TTFT, ITL, and end-to-end latency. Percentile data is essential for understanding tail latency and planning against SLA requirements.

GPU Utilization. Compute utilization over time, showing how efficiently the hardware is used and where headroom exists for additional load.

VRAM Usage. Memory consumption across context lengths and concurrency levels. Critical for understanding capacity limits and out-of-memory risks at high context or concurrency.

Power and Temperature. GPU power draw and thermal behavior under load. Useful for operational cost estimation, cooling requirements, and data center power budgeting.

Full Capacity Test Metrics. Reports show TTFT and generation speed for capacity tests, but we collect all primary metrics at every concurrency level tested.

3. Test Procedure

Each benchmark follows a consistent procedure designed to produce reliable, comparable results that reflect production behavior.

‍

3.1 Engine Selection

We evaluate multiple inference engines including vLLM, SGLang, and TensorRT-LLM. For each model and hardware combination, we select the engine that delivers the best concurrent request performance. Single-user speed is not the selection criterion because production deployments typically serve multiple users simultaneously. Each engine is configured for optimal throughput at the target concurrency range.

‍

3.2 Warm-Up Phase

Before measurement begins, we run warm-up requests to avoid cold-start overhead on initial requests. Warm-up data is excluded from all reported metrics.

‍

3.3 Test Matrix

We test across a matrix of context lengths and concurrency levels. Context lengths range from 1K tokens up to the maximum the model and hardware can support, with breakpoints chosen to correspond to common use cases like code completions, chatbots, document analysis, and coding assistants.

Concurrency levels always start at a single request. Beyond that, the range depends on what the model and hardware combination can handle. Some configurations max out at 5 concurrent requests; others scale past 100. For capacity testing, we extend concurrency incrementally until performance thresholds are exceeded.

The full matrix produces 50 or more distinct test scenarios per hardware configuration.

‍

3.4 Load Generation

We use gradual ramp-up rather than instant load. Simulated users are added incrementally over a few seconds. This creates a mix of requests in different stages of processing, getting the system into a realistic operating state faster than instant load where requests stay synchronized longer.

Each simulated user operates serially: send request, wait for complete response, immediately send next request. There is no artificial wait time between requests, which produces maximum sustained load for the given concurrency level.

‍

3.5 Stability-Based Termination

Rather than running a fixed number of requests, we run until metrics stabilize. Throughput readings must converge over consecutive measurement windows before the test concludes. This ensures results reflect consistent performance rather than transient behavior.

‍

3.6 Metrics Collection

Performance metrics are collected from the inference engine's exposed metrics where available, and measured directly when they aren't or for additional verification. We poll the inference engine and GPU monitoring software (e.g. DCGM Exporter) directly, collecting GPU metrics like utilization, VRAM, power, and temperature throughout each test.

4. Test Integrity

Benchmark results are useful only if they are reliable and reproducible. This section describes measures taken to ensure measurements reflect true inference performance.

‍

4.1 Calibrated Token Counts

When a test specifies "8K context," the prompt contains exactly 8,192 tokens. We use the model's actual tokenizer to construct prompts that hit precise token counts.

We found in testing that estimating token counts using approximations like "4 characters per token" can be off by 20% or more depending on content. Our prompt construction process tokenizes candidate text, measures actual token count, and adjusts content until reaching the target exactly. This ensures fair comparisons across context lengths and between test runs.

‍

4.2 Caching Disabled

We disable prompt caching at the server level and use unique prompt prefixes for each request. This ensures we measure the full cost of processing each request from scratch.

Production systems with caching enabled will achieve significantly better TTFT for repeated or incrementally-built contexts. Our results represent worst-case TTFT. Applications building context incrementally (such as multi-turn conversations) will see actual TTFT closer to short-context results regardless of total conversation length, since only new tokens require processing.

‍

4.3 Fixed Output Length

All requests generate exactly 1,024 output tokens, except for code completion tests which use 128 tokens to match typical autocomplete length. End-of-sequence detection is disabled to prevent early termination. This eliminates variability from different response lengths and ensures throughput measurements are comparable across scenarios.

The 1,024-token output length represents sizeable responses (detailed explanations, multiple tool-calls, analysis) while remaining practical for test duration. Shorter outputs would over-emphasize prefill relative to decode.

‍

4.4 No Speculative Decoding

Tests measure baseline inference performance without speculative decoding (unless otherwise noted). This provides a consistent baseline across models and hardware configurations. Production deployments with speculative decoding enabled may achieve 1.5x to 2x better per-user throughput depending on the workload and draft model quality.

‍

4.5 Test Isolation

Each test scenario captures metrics independently using delta calculations. We record engine state before and after each test, then compute the difference. An inter-test delay allows the system to return to idle state before the next test begins.

‍

4.6 Reproducibility

Given identical model, hardware, and engine configuration, our benchmarks produce consistent results due to stability-based termination. However, outside factors can affect results: cooling system effectiveness, silicon variance between GPU units, and advances in inference engine technology over time. We retain full test configurations for each benchmark for comparisons.

5. Use Case Thresholds

Benchmark reports include a use case guidance table showing how many concurrent requests a configuration can handle for different workloads. These recommendations are based on user experience thresholds: the point where performance degrades below acceptable levels for each use case.

‍

5.1 Threshold Definitions

Different applications have different tolerance for latency and generation speed. A code completion tool requires sub-second response times. A document analysis system can tolerate longer waits.

Use Case	Context	TTFT Threshold	Speed Threshold	Rationale
Code Completion	1K	< 2s e2e	N/A	Inline suggestions must feel instant
Short-form Chatbot	8K	< 10s	> 10 tok/s	Quick Q&A; support chatbots
General Chatbot	32K	< 8s	> 15 tok/s	Extended conversations; moderate wait acceptable
Long Document Processing	64K	< 12s	> 15 tok/s	Long document analysis; processing time expected
Automated Coding Assistant	96K	< 12s	> 20 tok/s	Large codebase context; multiple tool calls

These thresholds assume worst-case scenarios where all context is processed at once with no caching. In production with caching enabled, users would only experience these wait times when dropping a large context all at once. Subsequent turns in a conversation would be much faster since only new tokens need processing.

The TTFT thresholds are derived from user experience research on acceptable response times for interactive applications. They reflect how long users will wait before perceiving the system as slow or unresponsive. The speed thresholds reflect minimum generation rates for comfortable reading or code review, based on typical human reading speeds.

‍

5.2 Capacity Determination

For each use case, we run dedicated capacity tests that increment concurrency until either threshold is violated. The capacity limit is the highest concurrency level where both TTFT and generation speed remain within acceptable bounds.

Capacity values in reports use the following notation:

Plain numbers (e.g., "5") indicate the threshold was reached at that concurrency level.
Values with "~" (e.g., "~8") indicate estimated capacity based on interpolation between tested concurrency levels.
Values with "+" (e.g., "125+") indicate capacity exceeded our maximum tested concurrency. When predicted capacity reaches into the hundreds or thousands, finding the exact limit isn't practical.

‍

5.3 Custom Thresholds

The thresholds above represent general guidance for common application patterns. Specific deployments may have different requirements. A background batch processing system may tolerate 30-second TTFT. A real-time coding assistant may require sub-200ms response. An internal tool with captive users may accept slower generation speeds than a consumer product.

We can apply custom thresholds to benchmark data on request. This allows capacity planning based on your specific latency and throughput requirements rather than our default assumptions.

Note: Capacity limits represent where user experience begins to degrade, not where the system fails. Beyond these points, the system continues functioning with slower response times that may still be acceptable for specific requirements.

‍

6. Formulas

This section documents the formulas used to calculate each metric. These align with industry-standard approaches used by inference engines and benchmarking tools.

‍

6.1 Throughput

System Throughput

System Throughput (tok/s) = Total Output Tokens / Wall Clock Time

System throughput measures aggregate generation capacity across all concurrent requests. Wall clock time is measured from test start to test end, including ramp-up. Higher concurrency typically increases system throughput while decreasing per-user speed.

‍

Per-User Generation Speed

Per-User Speed (tok/s) = Output Tokens / Decode Time

Per-user speed is calculated from engine-side decode duration, not wall-clock time. This captures actual generation speed excluding queue wait and prefill time. Decode time is obtained from the inference engine's exposed metrics.

‍

Scaling Efficiency

Efficiency = (Throughput at N users / N) / Throughput at 1 user

A value of 1.0 indicates perfect linear scaling with no degradation.

‍

6.2 Latency

Time to First Token

TTFT ≈ Queue Wait Time + Prefill Time

We collect TTFT directly from the inference engine. The equation above generally holds true, but under heavy loads, preemptions and other overhead can cause queue wait and prefill to not sum exactly to TTFT.

‍

Inter-Token Latency

ITL = Time(token[i]) - Time(token[i-1])

We measure actual per-token timing from the inference engine, capturing the real distribution of inter-token delays for accurate percentile reporting.

‍

End-to-End Latency

E2E Latency = TTFT + (Output Tokens / Decode Speed)

Total request duration from submission to final token. For streaming applications, TTFT and ITL are more relevant. For batch workflows, end-to-end latency determines total processing time.

‍

6.3 Prefill

Prefill Speed

Prefill Speed (tok/s) = Input Tokens / Prefill Time

Prefill speed indicates how fast the model processes input context. Higher prefill speed yields lower TTFT at a given context length.

7. Comparison with Other Benchmarks

Results from our benchmarks may differ from vendor numbers, NVIDIA GenAI-Perf results, or internal testing. Most discrepancies stem from methodology choices rather than measurement errors. This section documents key differences.

‍

7.1 Throughput Differences

Our system throughput numbers include the full test window, including 5 seconds of gradual user ramp-up at test start. Benchmarks that measure only from "first request sent to last response received" exclude this overhead and report higher throughput for the same system.

We include ramp-up time because it creates a mix of requests in different stages of processing, getting the system into a realistic operating state faster than instant load where requests stay synchronized longer.

Additionally, our system throughput metric only counts output tokens. Some benchmarks include input tokens in their throughput calculations, which produces numbers that are orders of magnitude higher but don't reflect actual generation capabilities.

‍

7.2 TTFT Differences

We test with prompt caching disabled, measuring the full cost of processing each request's context from scratch. Production systems with caching enabled achieve significantly better TTFT for incremental contexts.

Our numbers represent worst-case TTFT. For applications where context is built incrementally (such as multi-turn chat), actual TTFT will be closer to short-context results regardless of total conversation length.

‍

7.3 Concurrency Model

We use a "virtual user" model where each simulated user sends requests serially: send request, wait for response, send next request. Some benchmarks maintain a fixed number of in-flight requests at all times, immediately replacing each completed request.

‍

7.4 ITL Calculation

We measure actual inter-token timing from the inference engine's metrics. Some benchmarks derive ITL as (End-to-End Latency - TTFT) / (Output Tokens - 1). The derived approach produces an average but loses distribution information. Our approach captures true tail latency visible in percentile breakdowns.

For standard output lengths (1,024 tokens), the numerical difference between approaches is negligible. The distinction matters for understanding latency variance rather than average values.

‍

Summary: If you see different numbers elsewhere, common differences include: instant ramp-up vs. gradual, cached vs. uncached prompts, output-only vs. input+output throughput, and constant request pressure vs. virtual users. These methodology choices explain most discrepancies.

8. Limitations

Every benchmark has limitations. Understanding what our results do not capture helps interpret them correctly.

‍

8.1 Workload Representation

Prompt content. We use representative text that mimics typical user interaction patterns. Specialized content (code with unusual syntax, structured data formats, non-English languages) may tokenize differently and produce different results.

Output length. All tests use 1,024 output tokens. Applications with shorter or longer outputs may see different performance characteristics. Short outputs emphasize prefill; long outputs emphasize decode.

Synthetic load pattern. Real traffic has variable request sizes, arrival patterns, and user behavior. Our controlled load is more consistent than production traffic, which may exhibit burstiness, diurnal patterns, and correlated request characteristics.

‍

8.2 Disabled Optimizations

Prompt caching. We disable prompt caching to measure worst-case performance. Production systems with caching enabled will see substantially better TTFT for repeated or incremental contexts.

Speculative decoding. Tests measure baseline inference without speculation. Systems with speculative decoding may achieve 1.5-2x better per-user throughput depending on workload and draft model quality.

‍

8.3 Configuration Specificity

Single configuration. Results reflect one specific engine configuration (batch size, memory allocation, scheduling parameters). Different tuning may yield different results. We select configurations optimized for concurrent performance, which may differ from configurations optimized for single-user latency.

Hardware variance. Even identical GPU models can show 5-10% performance variation due to silicon quality, thermal conditions, and power delivery. Results are from specific hardware samples and may not match all units of the same model.

‍

8.4 Scope Limitations

Network latency. We minimize network latency between the load generator and inference server where possible, but this may vary between benchmarks. Network latency between your application and inference server will depend on your deployment architecture.

Model quality. We measure inference speed, not output quality or accuracy. Quantized models trade quality for speed. Benchmark results do not indicate whether a model produces correct or useful outputs.

Multi-tenant isolation. We simulate concurrent load but do not test actual multi-tenant isolation mechanisms. Production deployments with per-tenant resource limits or priority queuing will behave differently.

Point-in-time snapshot. Inference engines improve rapidly. Results reflect engine versions at test time. Newer versions may perform differently.

Use these benchmarks as a reference point for comparison, not as a guarantee of production performance. Your actual results will depend on your specific workload, configuration, and infrastructure. For critical decisions, we recommend validating with your own data.

Benchmark Methodology