OpenAI
OpenAI Rate Limits: Fix Them for Real
OpenAI enforces rate limits per API key — requests per minute (RPM), tokens per minute (TPM), and requests per day (RPD). When you run multiple workers against the same key, they all hit the same limits but have no visibility into what the others are consuming.
Why Multiple Workers Break OpenAI's Limits
OpenAI's limits are per API key, not per process. If you have 4 workers and each one sends 25 RPM, you're at 100 RPM total — well over a 60 RPM limit. Each worker thinks it's behaving within the rules. Collectively, they aren't. This is a coordination problem, not a per-worker problem, and no amount of per-worker throttling will reliably fix it when worker count changes.
Token-Aware Rate Limiting with RateQueue
OpenAI's TPM limit is tricky because each request consumes a different number of tokens. RateQueue handles load-based limiting — you pass the expected token count per request, and it coordinates against the total budget. A short prompt consumes little, a large context window consumes a lot. Both are accounted for correctly.
Python
import ratequeue.aio as rq
async with rq.acquire(
"openai-gpt4",
load=estimated_tokens, # token count counts against TPM
priority=priority,
api_key=RATEQUEUE_API_KEY
):
response = await openai_client.chat.completions.create(
model="gpt-4o",
messages=messages
)TypeScript
import { ratequeue } from "@ratequeue/sdk";
await ratequeue.acquire(
"openai-gpt4",
{ load: estimatedTokens, priority, apiKey: process.env.RATEQUEUE_API_KEY! },
async () => {
response = await openai.chat.completions.create({
model: "gpt-4o",
messages,
});
}
);Multiple Limits on One Resource
OpenAI enforces both RPM and TPM simultaneously — so you need to as well. You can stack multiple limits on a single RateQueue resource, and all of them enforce at the same time:
Limit 1: RPM
60 requests per minute, request-scoped. Each call to acquire counts as 1 regardless of size.
Limit 2: TPM
150,000 tokens per minute, load-scoped. The load value you pass is deducted from the token budget.
Both constraints enforce simultaneously. A request waits if either limit would be exceeded.
Prioritize Interactive Requests
Background jobs and user-facing requests both hit the same OpenAI key. When capacity is constrained, you want interactive requests served first. Priority is a numeric value — higher numbers are served before lower numbers when the resource is near its limit. Background enrichment jobs set priority 1; user-facing calls set priority 100. The user doesn't wait behind the batch job.
Start free, no infrastructure to set up
Create an OpenAI resource, set your RPM and TPM limits, and wrap your first API call. Works across all your workers immediately.