LLM APIs
LLM Rate Limiting That Actually Works at Scale
Every LLM provider enforces limits — requests per minute, tokens per minute, and concurrent connections. When multiple workers or services share the same API key, you need central coordination. Not retries. Not a proxy gateway. Coordination.
The LLM Quota Problem
LLM requests are expensive — in latency, in compute, and in the work done to build the prompt. When a 429 hits after all that prep, you've wasted it. The underlying cause is always the same: RPM limits, TPM limits, concurrency caps, and multiple workers all sharing the same quota with no coordination between them.
The fix isn't per-worker throttling — it's centralized coordination. One control plane that all workers check before sending a request. When capacity is available, they proceed. When it's not, they wait in line.
Load-Based Rate Limiting for Token Budgets
A TPM limit means different requests consume different fractions of your budget. A 500-token prompt costs far less than a 15,000-token context window. RateQueue's load-scoped limiting lets you pass the actual token count — it deducts the right amount from the budget per request, not a flat "1 per call."
Python
import ratequeue.aio as rq
async with rq.acquire(
"anthropic-claude",
load=prompt_token_count, # actual tokens consumed
api_key=RATEQUEUE_API_KEY
):
response = await anthropic_client.messages.create(
model="claude-opus-4-5",
max_tokens=1024,
messages=[{"role": "user", "content": prompt}]
)Stack Multiple Limits on One Resource
A real LLM resource might enforce all of these at once:
RPM
1,000 requests per minute. Counts each acquire call as 1.
TPM
200,000 tokens per minute. Deducts the load value you pass.
Concurrency
10 active requests at once. Released when the context exits.
All three enforced simultaneously. A request waits if any constraint would be violated.
Works with Any LLM Provider
RateQueue isn't provider-specific. Any API with rate limits works:
- →OpenAI (GPT-4o, o1, o3)
- →Anthropic Claude
- →Google Gemini
- →Cohere Command
- →Mistral
- →Amazon Bedrock
- →Azure OpenAI
- →Any provider with rate limits
Coordinate your LLM quota in minutes
Free plan includes one resource and one limit. Enough to validate it works in production before upgrading.