Documentation menu
Operations

Rate limits

Understand TokenAir rate limits and design clients that recover cleanly.

ExplanationLast updated: July 2, 2026

How limits work

Rate limits can depend on account configuration, model family, request volume, and early access status. Different model families may have different throughput, latency, and availability profiles. Public docs do not publish a fixed global quota because limits can change during onboarding and production rollout.

Client behavior

  • Handle 429 responses explicitly.
  • Use exponential backoff with jitter.
  • Limit concurrency per model family.
  • Track latency and error rate separately for each model ID.
  • Keep queues bounded so retries do not overload your own app.

When to request higher limits

Ask for higher limits when you can share expected monthly volume, peak request rate, model mix, and production launch timing. This helps TokenAir tune access without making unsupported public promises.

Signals to monitor

  • 429 rate by model ID.
  • Queue length and time spent waiting before a retry.
  • Latency percentiles for normal traffic and retry traffic.
  • Spend or quota consumption after large batch jobs.

Next steps