Independent benchmarks for production deployments. Throughput, latency, and capacity tested across hardware configurations.
Each configuration runs 50+ test scenarios across different context lengths and concurrency levels up to what the model and hardware can support. This includes measuring standard metrics such as throughput and latency, as well as conducting capacity tests that increase concurrent requests until performance thresholds are exceeded.
Prompts use calibrated token counts. No prompt caching. No speculative decoding unless specifically noted.
We work with teams to figure out the right model + hardware combination for your throughput, latency, and budget requirements.