SHARE
TL;DR
zerank-2 delivers consistent, low-latency performance under realistic production conditions. In our testing, 97.3% of requests completed under 500ms with zero failures. This document presents our latency measurements and explains how to properly benchmark reranker performance.
Why Proper Latency Testing Matters
When evaluating reranker latency, it's critical that your testing reflects actual production usage patterns. Real user traffic doesn't arrive at uniform intervals. It comes in bursts and clusters. Testing with sequential requests or artificial patterns will give you misleading results that don't predict real-world performance.
Our tests use Poisson arrival patterns because they model the random, bursty nature of production traffic. This approach reveals how systems behave under realistic load conditions, including queueing effects and concurrent request handling.
Testing Methodology
All tests conducted using:
Poisson arrival patterns at 1-10 requests/second
60-second test duration
50 documents per request
Payload size ≤2KB per document
Performance Results
ZeRank-2 Latency Distribution
Latency Threshold | Requests Exceeding Threshold |
|---|---|
>75ms | 100.0% |
>100ms | 100.0% |
>150ms | 50.5% |
>200ms | 21.2% |
>250ms | 11.3% |
>500ms | 2.7% |
>750ms | 1.4% |
>1s | 0.9% |
>3s | 0.0% |
>5s | 0.0% |
>10s | 0.0% |
>30s | 0.0% |
Failed | 0.0% |
Comparative Performance
Threshold | zerank-2 | Cohere rerank-3.5 | Jina reranker m0 | Voyage rerank-2.5 |
|---|---|---|---|---|
>150ms | 50.5% | 34.3% | 100.0% | 80.5% |
>500ms | 2.7% | 14.3% | 70.8% | 10.9% |
>1s | 0.9% | 11.6% | 57.4% | 9.7% |
>10s | 0.0% | 6.4% | 55.7% | 9.2% |
Failed | 0.0% | 0.0% | 55.7% | 9.2% |
Key Metrics
Zero failures across all test conditions
97.3% of requests completed under 500ms
99.1% of requests completed under 1 second
100% of requests completed under 3 seconds
zerank-2 maintains consistent performance across the entire latency distribution, with no requests exceeding 3 seconds.
Important Note on Rate Limits
When testing zerank-2, keep in mind that our API enforces rate limits to ensure fair resource allocation. If your usage exceeds 2,000,000 bytes per minute, requests will be moved to a slower processing queue, which will negatively impact the latency you observe.
For accurate latency testing, ensure your test traffic stays within these limits. If your production needs require higher rate limits, please contact us or join our Slack to discuss custom arrangements.
Get started with
RELATED ARTICLES





