OpenAI

OpenAI Rate Limits: Fix Them for Real

OpenAI enforces rate limits per API key — requests per minute (RPM), tokens per minute (TPM), and requests per day (RPD). When you run multiple workers against the same key, they all hit the same limits but have no visibility into what the others are consuming.

Why Multiple Workers Break OpenAI's Limits

OpenAI's limits are per API key, not per process. If you have 4 workers and each one sends 25 RPM, you're at 100 RPM total — well over a 60 RPM limit. Each worker thinks it's behaving within the rules. Collectively, they aren't. This is a coordination problem, not a per-worker problem, and no amount of per-worker throttling will reliably fix it when worker count changes.

Token-Aware Rate Limiting with RateQueue

OpenAI's TPM limit is tricky because each request consumes a different number of tokens. RateQueue handles load-based limiting — you pass the expected token count per request, and it coordinates against the total budget. A short prompt consumes little, a large context window consumes a lot. Both are accounted for correctly.

Python

import ratequeue.aio as rq

async with rq.acquire(
    "openai-gpt4",
    load=estimated_tokens,   # token count counts against TPM
    priority=priority,
    api_key=RATEQUEUE_API_KEY
):
    response = await openai_client.chat.completions.create(
        model="gpt-4o",
        messages=messages
    )

TypeScript

import { ratequeue } from "@ratequeue/sdk";

await ratequeue.acquire(
  "openai-gpt4",
  { load: estimatedTokens, priority, apiKey: process.env.RATEQUEUE_API_KEY! },
  async () => {
    response = await openai.chat.completions.create({
      model: "gpt-4o",
      messages,
    });
  }
);

Multiple Limits on One Resource

OpenAI enforces both RPM and TPM simultaneously — so you need to as well. You can stack multiple limits on a single RateQueue resource, and all of them enforce at the same time:

Limit 1: RPM

60 requests per minute, request-scoped. Each call to acquire counts as 1 regardless of size.

Limit 2: TPM

150,000 tokens per minute, load-scoped. The load value you pass is deducted from the token budget.

Both constraints enforce simultaneously. A request waits if either limit would be exceeded.

Prioritize Interactive Requests

Background jobs and user-facing requests both hit the same OpenAI key. When capacity is constrained, you want interactive requests served first. Priority is a numeric value — higher numbers are served before lower numbers when the resource is near its limit. Background enrichment jobs set priority 1; user-facing calls set priority 100. The user doesn't wait behind the batch job.

Start free, no infrastructure to set up

Create an OpenAI resource, set your RPM and TPM limits, and wrap your first API call. Works across all your workers immediately.