All courses > Technology and Programming > Cloud Computing and Web Servers ::

Rate Limiting, Quotas, and Traffic Shaping for API Protection

Capítulo 5

Estimated reading time: 10 minutes

Why traffic controls matter at the gateway

Rate limiting, quotas, and traffic shaping are gateway controls that protect backend services from overload and abuse while keeping access fair across consumers. They also help you predict capacity, prevent noisy neighbors from degrading everyone’s experience, and reduce cost surprises (especially with pay-per-request backends and serverless).

Rate limiting vs quotas vs burst control

Rate limiting (requests per time window)

Rate limiting caps how fast a client can send requests, such as “100 requests per minute.” The goal is to smooth traffic so the backend stays within safe throughput and latency targets.

Good for: preventing short-term overload, controlling abusive scraping, keeping latency stable.
Typical expression: X requests per second/minute.

Quotas (total usage over a longer period)

Quotas cap total consumption over a longer period, such as “1,000,000 requests per month” or “10 GB egress per day.” Quotas are about entitlement and cost control rather than instantaneous load.

Good for: plan enforcement (free vs paid tiers), budget control, limiting total compute/egress.
Typical expression: X requests per day/month, or X units (bytes, tokens, compute seconds).

Burst control (allowing short spikes safely)

Burst control allows brief spikes above the steady rate while still protecting the backend. For example, you may allow a client to burst to 50 requests in a second, but only if they have been idle and have “saved up” burst capacity.

Good for: user-driven spikes (page loads), batch operations, mobile clients reconnecting.
Typical expression: steady rate + burst size (or “burst limit”).

Common algorithms (conceptual)

Token bucket (common for burst-friendly limits)

Imagine a bucket that fills with tokens at a steady rate (for example, 10 tokens per second) up to a maximum capacity (for example, 100 tokens). Each request spends one token (or more, if you weight requests). If the bucket is empty, the request is rejected (or delayed, depending on your design).

Continue in our app.

Listen to the audio with the screen off.
Earn a certificate upon completion.
Over 5000 courses for you to explore!

Or continue reading below...

Download the app

Why it’s popular: it naturally supports bursts (spend saved tokens) while enforcing an average rate over time.
How to tune: fill rate controls steady throughput; bucket size controls burst tolerance.

Leaky bucket (common for smoothing)

Imagine requests entering a bucket that leaks at a constant rate. If requests arrive faster than the leak rate, the bucket fills; once full, new requests are rejected (or queued). This produces a steadier output rate toward the backend.

Why it’s useful: it smooths traffic and protects backends that dislike bursts.
Trade-off: less burst-friendly unless you add queueing (which increases latency).

Fixed window and sliding window (simple counting)

Many systems implement “N requests per minute” by counting requests in time windows.

Fixed window: reset the counter at the start of each minute. Simple but can allow boundary bursts (e.g., 100 requests at 12:00:59 and another 100 at 12:01:00).
Sliding window: approximates a rolling minute, reducing boundary effects. More accurate but typically needs more bookkeeping.

Choosing the right limit dimensions

A “limit key” defines who the limit applies to. Gateways often combine multiple dimensions to match your fairness and protection goals.

Per API key (or per client application)

Use when each consumer has a distinct key and you want plan-based fairness. This is the most common dimension for public APIs.

Pros: fair across customers; easy to explain and monetize.
Cons: if many end-users share one key (a shared client), they compete with each other.

Per IP address

Use when you need a coarse control for anonymous traffic or to reduce abuse from a single source.

Pros: works even without a key; useful for early protection.
Cons: NAT and corporate proxies can cause many users to share one IP; attackers can rotate IPs.

Per token subject (per end-user)

Use when you want fairness at the end-user level (for example, each user gets 60 requests/minute), even if they use the same client application.

Pros: prevents a single heavy user from exhausting a shared app’s allowance.
Cons: requires a stable user identifier; consider privacy and the meaning of “user” for machine-to-machine integrations.

Per route (endpoint-level limits)

Apply different limits to different endpoints. For example, allow higher rates for read-only endpoints and stricter limits for expensive operations.

Examples:
- GET /catalog: higher rate
- POST /orders: moderate rate + burst control
- POST /reports: low rate, strong quota, or async-only

Composite keys (recommended in many real systems)

Often you combine dimensions, such as “per API key + per route” or “per API key + per user.” This lets you protect expensive endpoints without punishing all usage.

Response behavior when limits are exceeded

HTTP 429 Too Many Requests

When a request exceeds the configured rate limit, the gateway should return 429. The response should be consistent and machine-readable so clients can back off.

Retry-After

Include Retry-After to indicate when the client can retry. Gateways may provide seconds (e.g., Retry-After: 12) or an HTTP date. If your gateway can compute the next available time precisely, this header significantly improves client behavior.

Helpful limit headers (communicating limits)

Even if not standardized across all gateways, it’s common to expose headers that help clients self-throttle. Examples include remaining requests and reset time. If your platform supports it, document the exact header names you use and keep them stable.

HTTP/1.1 429 Too Many Requests
Retry-After: 10
Content-Type: application/json

{"error":"rate_limit_exceeded","message":"Too many requests. Retry after 10 seconds."}

Setting thresholds (practical steps)

Step 1: Identify what you are protecting

Start from backend constraints and business risk, not arbitrary numbers.

Backend capacity: max safe RPS, CPU, DB connections, downstream rate limits.
Latency SLOs: at what concurrency does p95 latency degrade?
Cost sensitivity: endpoints that trigger expensive compute, external APIs, or large egress.

Step 2: Classify endpoints by cost

Create a simple cost model per route:

Cheap reads: cacheable, low DB load
Standard operations: typical business logic
Expensive operations: fan-out calls, heavy DB writes, report generation, file processing

Then assign stricter limits to expensive routes and more generous limits to cheap reads.

Step 3: Choose the fairness unit (limit key)

Decide whether fairness should be per customer app, per end-user, per IP, or a combination. For B2B APIs, per customer (API key) is often the primary unit; for consumer-facing APIs, per end-user can prevent one user from exhausting a shared app’s allowance.

Step 4: Pick a rate + burst policy

Use token bucket-style thinking:

Steady rate: what you can sustain continuously
Burst size: what you can tolerate briefly without harming the backend

Example policy: “20 requests/second with a burst of 40” for a read endpoint, and “2 requests/second with a burst of 5” for a costly write endpoint.

Step 5: Add quotas for plan enforcement and cost control

Define monthly/daily quotas aligned to pricing tiers or internal budgets. If you have multiple resource types, consider multiple quotas (requests, bytes, and compute-like units).

Step 6: Validate with load tests and production telemetry

Use realistic traffic patterns (including bursts) and observe backend saturation signals. Adjust limits based on error rates, latency, and downstream throttling. Revisit limits when you change backend capacity or add new endpoints.

Handling shared clients and “noisy neighbor” scenarios

Problem: many users behind one API key

If a single API key represents many end-users (for example, a mobile app or partner integration), a per-key limit can cause internal competition and unpredictable failures.

Mitigations

Layered limits: apply a per-key limit (protects backend) plus a per-user limit (fairness within the key).
Partner-specific policies: allocate higher limits for trusted partners, but still enforce per-route caps for expensive endpoints.
Separate keys per tenant: if the partner is multi-tenant, require a key per tenant or include a tenant identifier in the limit key.
Weighted requests: charge more tokens for expensive operations so one heavy endpoint can’t dominate.

Storage choices for counters and tokens (conceptual)

In-memory (local to a gateway instance)

The gateway keeps counters/tokens in its own memory.

Pros: very fast; simple; no external dependency.
Cons: limits are per-instance, not global. With multiple gateway replicas, a client can effectively get a higher total rate by hitting different instances (unless you use sticky routing). Restarts lose state.
Best for: single-instance deployments, edge throttling as a first line of defense, or when approximate limits are acceptable.

Distributed storage (shared across instances)

The gateway uses a shared data store to coordinate limits across all replicas.

Pros: consistent global enforcement across a cluster; better for strict plan enforcement.
Cons: adds network latency and a dependency; must handle store outages gracefully.
Best for: multi-region or horizontally scaled gateways where fairness must be consistent.

Practical guidance

Decide how strict you need to be: for paid quotas, prefer distributed enforcement; for basic abuse protection, local may be sufficient.
Plan for failure modes: decide whether the gateway should “fail open” (allow traffic) or “fail closed” (block) if the rate-limit store is unavailable. For protection, fail closed is safer; for availability, fail open may be chosen with conservative local fallback limits.

Traffic shaping patterns for spikes and overload

Soft throttling vs hard throttling

Hard throttling: reject immediately with 429 when over limit. Simple and predictable.
Soft throttling: delay requests (queue) to smooth bursts. This can protect backends but increases latency and can tie up gateway resources if queues grow.

Spike arrest (very short window protection)

Use a tight limit over a very small interval (for example, per second) to stop sudden floods, even if the per-minute rate looks acceptable. This is useful against accidental loops and sudden retries.

Concurrency limits (protecting slow backends)

Rate limits control request arrival rate, but if the backend slows down, in-flight requests can pile up. A concurrency cap limits how many requests can be active at once per route or per client, preventing cascading timeouts.

Backoff-friendly error design

When returning 429, provide a stable error code and a clear retry hint. Clients should implement exponential backoff and jitter; your gateway responses should make that easy by including Retry-After and consistent error payloads.

Protecting serverless and pay-per-request backends from cost blowups

Why serverless needs stricter controls

Serverless platforms can scale quickly, which is great for availability but risky for cost: a sudden spike can multiply invocations and downstream calls. Rate limits and quotas act as a budget guardrail.

Practical steps

Set per-route limits on expensive functions: stricter rate and smaller burst for endpoints that trigger heavy compute or external API calls.
Add quotas aligned to budget: daily quotas can stop runaway costs faster than monthly quotas.
Use weighted limits: charge more “units” for endpoints proportional to cost (e.g., report generation costs 10 units).
Prefer async for heavy work: for expensive operations, accept the request, enqueue work, and return a job ID (then rate-limit job creation and job status polling separately).
Protect against retry storms: if downstream errors cause clients to retry, enforce stricter spike arrest and ensure 429 responses are cacheable only when appropriate (usually they are not).

Communicating limits to API consumers

Document the policy in consumer-friendly terms

Clients behave better when limits are explicit. Document:

Rate limits: per minute/second and whether bursts are allowed
Quotas: reset schedule and what happens when exceeded
Scope of limits: per key, per user, per IP, per route
Over-limit behavior: 429 usage, Retry-After, and whether limits differ by plan

Expose runtime feedback

If possible, return headers that help clients self-regulate (remaining, reset time). Keep the behavior consistent across endpoints so client libraries can implement a single backoff strategy.

Provide upgrade and support paths

When a client hits limits frequently, they need a clear next step: optimize usage (batching, caching), reduce polling, or request higher limits. Make sure your error message and documentation point to the correct action without revealing sensitive internal capacity details.

Now answer the exercise about the content:

You want to allow brief request spikes above a steady rate while still enforcing an average limit over time. Which algorithm best fits this goal, and what two parameters mainly control its behavior?

You are right! Congratulations, now go to the next page

You missed! Try again.

Token bucket naturally supports bursts by letting clients spend saved tokens, while still enforcing an average rate. The fill rate sets the sustained throughput and the bucket size determines how large a short spike can be.

Next chapter

Request and Response Transformation: Headers, Payloads, and Protocols

42%

API Gateways for Beginners: Managing, Securing, and Scaling Web APIs

New course

12 pages