LLM APIs

LLM Rate Limiting That Actually Works at Scale

Every LLM provider enforces limits — requests per minute, tokens per minute, and concurrent connections. When multiple workers or services share the same API key, you need central coordination. Not retries. Not a proxy gateway. Coordination.

The LLM Quota Problem

LLM requests are expensive — in latency, in compute, and in the work done to build the prompt. When a 429 hits after all that prep, you've wasted it. The underlying cause is always the same: RPM limits, TPM limits, concurrency caps, and multiple workers all sharing the same quota with no coordination between them.

The fix isn't per-worker throttling — it's centralized coordination. One control plane that all workers check before sending a request. When capacity is available, they proceed. When it's not, they wait in line.

Load-Based Rate Limiting for Token Budgets

A TPM limit means different requests consume different fractions of your budget. A 500-token prompt costs far less than a 15,000-token context window. RateQueue's load-scoped limiting lets you pass the actual token count — it deducts the right amount from the budget per request, not a flat "1 per call."

Python

import ratequeue.aio as rq

async with rq.acquire(
    "anthropic-claude",
    load=prompt_token_count,  # actual tokens consumed
    api_key=RATEQUEUE_API_KEY
):
    response = await anthropic_client.messages.create(
        model="claude-opus-4-5",
        max_tokens=1024,
        messages=[{"role": "user", "content": prompt}]
    )

Stack Multiple Limits on One Resource

A real LLM resource might enforce all of these at once:

RPM

1,000 requests per minute. Counts each acquire call as 1.

TPM

200,000 tokens per minute. Deducts the load value you pass.

Concurrency

10 active requests at once. Released when the context exits.

All three enforced simultaneously. A request waits if any constraint would be violated.

Works with Any LLM Provider

RateQueue isn't provider-specific. Any API with rate limits works:

  • OpenAI (GPT-4o, o1, o3)
  • Anthropic Claude
  • Google Gemini
  • Cohere Command
  • Mistral
  • Amazon Bedrock
  • Azure OpenAI
  • Any provider with rate limits

Coordinate your LLM quota in minutes

Free plan includes one resource and one limit. Enough to validate it works in production before upgrading.