Load-Scoped Limits

Not All Requests Are Equal — Your Rate Limiter Shouldn't Treat Them That Way

A standard rate limiter counts requests. A 10-token prompt and a 100k-token context window both count as “1 request.” But they consume wildly different amounts of capacity. Load-scoped limiting counts the actual weight of each request — so your limits reflect real resource consumption.

The Problem with Request-Count Limiting

Your LLM API allows 100,000 TPM. Your rate limiter allows 60 RPM. You send 60 requests with 3,000 tokens each — 180,000 tokens total. You stayed within the RPM limit and blew the TPM limit. The request-count limiter had no concept of how much each request actually consumed.

This mismatch is especially sharp with LLMs, where request size varies by orders of magnitude. It also applies to file uploads (bytes), database queries (rows scanned), and any API where the actual cost of a call varies per request.

Load-Scoped Limiting: Count What Actually Matters

With load-scoped limits, each request passes a load value — the actual weight of that request. The limit window tracks cumulative load, not request count. Short requests barely affect the budget. Long-context requests consume their real weight.

import ratequeue.aio as rq

def estimate_tokens(messages: list) -> int:
    # rough estimate: ~4 chars per token
    return sum(len(m["content"]) // 4 for m in messages) + 100

async def call_llm(messages: list):
    estimated = estimate_tokens(messages)

    async with rq.acquire(
        "anthropic-claude",
        load=estimated,          # this is the token weight
        api_key=RATEQUEUE_API_KEY
    ):
        return await anthropic_client.messages.create(
            model="claude-opus-4-5",
            max_tokens=1024,
            messages=messages
        )

Works for Any Unit of Work

Load doesn't have to be tokens. Any numeric value that represents the actual weight of a request works.

import { ratequeue } from "@ratequeue/sdk";

// Limit by data volume for an export API
await ratequeue.acquire(
  "export-api",
  {
    load: estimatedBytes,  // e.g. 1_500_000 for a 1.5MB export
    apiKey: process.env.RATEQUEUE_API_KEY!
  },
  async () => {
    await exportData(query);
  }
);

Tokens

Enforce TPM limits for LLM APIs. Short prompts consume less budget than long context windows.

Bytes

Limit by data transfer volume for upload APIs, export endpoints, or storage writes.

Credits

If your API provider bills by credit, pass the credit cost as load to stay within budget.

Work units

Abstract units representing processing cost — useful for internal services where requests have variable cost.

Combine Request Count and Load Limits

Stack a request-count limit and a load-scoped limit on the same resource. Both enforce simultaneously — whichever is the binding constraint at any given moment determines when requests wait. This mirrors exactly how providers like OpenAI enforce their limits.

Enforce limits that match how your API actually charges you

Start free with one resource. Add load-scoped limits alongside request-count limits on the paid plan.