Introduction

Every popular API you've ever used — Twitter, GitHub, Stripe — enforces rate limiting. It's the traffic cop that says: "You can make 100 requests per minute. After that, slow down."

Without it, a single misbehaving client (or an attacker) could flood your servers and bring everything down for everyone else.

But how does the "counting" actually work under the hood? There are five classic algorithms, each with different trade-offs. Let's break them all down — simply.

1. Token Bucket
2. Leaky Bucket
3. Fixed Window Counter
4. Sliding Window Log
5. Sliding Window Counter
Side-by-Side Comparison
Which One Should You Use?
Rate Limiting in a Distributed Environment

1. Token Bucket

The Idea

Imagine you have a bucket that holds tokens. Tokens are added to the bucket at a fixed rate (e.g., 10 tokens per second). Every time you make a request, it consumes one token. If the bucket is empty — your request is rejected.

The bucket has a maximum capacity, so tokens don't pile up forever.

Real-World Analogy

Think of it like a Delhi Metro Smart Card. Your card gets topped up automatically — say, ₹20 every hour, up to a max of ₹200. Every metro ride costs ₹30. If your balance is ₹0, the turnstile won't let you through. But if you haven't travelled in a while, the balance has accumulated and you can make several trips back-to-back without worrying.

Burst-friendly: If you haven't made requests in a while, tokens accumulate and you can fire off a burst of requests.

How It Works

Max tokens (capacity): 10
Refill rate: 2 tokens/second

t=0s → bucket has 10 tokens
Request 1 → 9 tokens left
Request 2 → 8 tokens left
...
Request 10 → 0 tokens left
Request 11 → ❌ rejected (bucket empty)

t=1s → 2 new tokens added → 2 tokens available
Request 12 → ✅ allowed

Pros & Cons

	Token Bucket
✅	Handles burst traffic gracefully
✅	Simple and memory-efficient (just store token count + last refill time)
❌	Slightly tricky to implement correctly with distributed systems

2. Leaky Bucket

The Idea

Picture a bucket with a small hole at the bottom. Water (requests) pours in from the top, and leaks out at a constant rate from the bottom. If water pours in faster than it leaks out, the bucket overflows — those requests are dropped.

Real-World Analogy

Think of a chai tapri (roadside tea stall) with one bhaiya. No matter how many people show up, bhaiya makes exactly one cup of chai per minute. People queue up patiently. If the line gets too long (bucket overflows), he tells newcomers to come back later.

Smooth output: Unlike token bucket, leaky bucket enforces a perfectly steady outflow rate, regardless of bursty input.

How It Works

Bucket capacity: 10 requests
Outflow (processing) rate: 2 requests/second

t=0s → 5 requests arrive → queue has 5
t=0.5s → 2 requests processed → queue has 3
t=1s → 8 more requests arrive → queue has 11 → ❌ 1 dropped (overflow)

Pros & Cons

	Leaky Bucket
✅	Guarantees a smooth, stable output rate
✅	Protects downstream services from sudden spikes
❌	Burst requests just get queued/dropped — no friendly burst allowance
❌	Requests can experience extra latency waiting in the queue

3. Fixed Window Counter

The Idea

Divide time into fixed windows (e.g., each minute). Keep a counter per window. Every request increments the counter. If the counter exceeds the limit, reject the request. When the window resets, start counting from zero again.

Real-World Analogy

Imagine a marriage banquet with a fixed thali limit per session. The caterer allows 10 refills per sitting. At 3:00 PM the session resets and your refill count goes back to zero — it doesn't matter that you had 10 refills at 2:59 PM.

Simplest algorithm to implement, but has a notorious edge-case: the "boundary burst" problem.

How It Works

Limit: 10 requests per minute

Window 1: 12:00:00 → 12:00:59
  Requests: 10 → ✅ all allowed

Window 2: 12:01:00 → 12:01:59
  Requests: 10 → ✅ all allowed

But here's the problem:

12:00:55 → 10 requests arrive → ✅ all allowed (end of window 1)
12:01:05 → 10 requests arrive → ✅ all allowed (start of window 2)

In just 10 seconds, 20 requests slipped through — double the intended limit! This is the boundary burst problem.

Pros & Cons

	Fixed Window Counter
✅	Extremely simple to implement
✅	Very low memory usage
❌	Boundary burst: 2× the allowed rate can slip through near window edges

4. Sliding Window Log

The Idea

Instead of fixed time buckets, keep a log (list) of timestamps for every request. When a new request arrives, remove all timestamps older than the window size, then check if the remaining count is within the limit.

Real-World Analogy

Think of a IRCTC tatkal booking window at a railway station. The clerk keeps a written log of every ticket issued with the exact time. When someone new comes to book, he crosses out all entries older than one hour, counts the remaining ones, and only proceeds if the hourly quota isn't exhausted. No estimation — he goes by the exact log, entry by entry.

No boundary burst problem. The window is always anchored to right now, not to a fixed clock tick.

How It Works

Limit: 3 requests per minute

12:00:10 → request → log: [10]      → count=1 ✅
12:00:30 → request → log: [10, 30]  → count=2 ✅
12:00:50 → request → log: [10,30,50]→ count=3 ✅
12:00:55 → request → log: [10,30,50,55] → count=4 ❌ rejected

12:01:15 → request → remove 10 (older than 1 min ago)
           log: [30,50,55]          → count=3 ❌ still rejected

12:01:35 → request → remove 10,30
           log: [50,55]             → count=2 ✅

Pros & Cons

	Sliding Window Log
✅	Most accurate — no boundary bursts
✅	Enforces the limit precisely at any point in time
❌	High memory usage — must store a timestamp for every request
❌	Doesn't scale well for high-traffic APIs

5. Sliding Window Counter

The Idea

A clever hybrid of Fixed Window Counter and Sliding Window Log. Instead of storing every timestamp, use the counts from the current and previous fixed windows, then estimate the sliding window count using a weighted formula.

Estimated count = (previous window count × overlap ratio) + current window count

The overlap ratio is how much of the previous window falls inside the current sliding window.

Real-World Analogy

You're a dhaba owner tracking how many rotis were served. You don't note the exact time of every roti — that's too much hassle. Instead, you keep a count per shift (morning/evening). When someone asks "how many rotis in the last two hours?", you take the previous shift's total, estimate how much of it falls in the last two hours based on the overlap, and add the current shift's count. It's not perfectly precise, but it's close enough to manage the kitchen load.

How It Works

Limit: 10 requests per minute
Previous window (12:00 - 12:01): 8 requests
Current window (12:01 - 12:02): 3 requests so far
Current time: 12:01:45 (75% into the current window → 25% overlap with previous)

Estimated count = (8 × 0.25) + 3 = 2 + 3 = 5 ✅ allowed

Pros & Cons

	Sliding Window Counter
✅	Low memory — only two counters needed
✅	No boundary burst problem (mostly)
✅	Good balance of accuracy and efficiency
❌	Slightly approximate — assumes requests in the previous window were evenly distributed

Side-by-Side Comparison

Algorithm	Burst Friendly	Memory Usage	Accuracy	Complexity
Token Bucket	✅ Yes	Low	High	Medium
Leaky Bucket	❌ No	Low	High (smooth)	Medium
Fixed Window Counter	⚠️ Edge case	Very Low	Medium	Low
Sliding Window Log	❌ No	High	Very High	Medium
Sliding Window Counter	⚠️ Approximate	Low	High	Medium

Which One Should You Use?

There's no single "best" algorithm — the right choice depends on your traffic patterns, accuracy requirements, and infrastructure constraints. Here's a practical guide:

Token Bucket — "Burst is fine, but keep the average honest"

Use this when your users naturally have spiky behaviour but you still want a long-term average cap. It's forgiving for legitimate users who batch up a few requests occasionally, without letting anyone hammer your API indefinitely.

Good for: Public REST APIs, mobile app backends, GitHub/Twitter-style developer APIs.
Ask yourself: Can I tolerate a short burst of requests as long as the average rate stays within limits? If yes, token bucket is your friend.
Real production example: AWS API Gateway uses a token bucket model internally for its throttling.

Leaky Bucket — "I need clockwork regularity downstream"

Use this when the service behind the rate limiter can only handle a fixed, steady throughput — and a sudden spike, even a brief one, would break it. The bucket acts as a buffer and smooths out all irregularities.

Good for: Payment processors, SMS/email gateways, third-party API integrations with strict per-second limits.
Ask yourself: Does my downstream service get overwhelmed even by short bursts? If yes, leaky bucket enforces the discipline you need.
Watch out: Users making legitimate burst requests will experience added latency as their requests sit in the queue.

Fixed Window Counter — "Keep it simple, stakes are low"

Use this when you need something up and running quickly and the boundary burst edge case won't cause real harm. It's the easiest to implement and reason about — just a counter in Redis with a TTL.

Good for: Internal dashboards, admin tools, non-critical background jobs, feature-flag APIs.
Ask yourself: What's the worst that happens if someone sneaks in 2× the allowed requests for 10 seconds? If the answer is "not much", fixed window is perfectly fine.
Implementation tip: A single INCR + EXPIRE in Redis is all you need.

Sliding Window Log — "Accuracy above all else"

Use this when correctness is non-negotiable and you can afford the memory cost. Every request is tracked with its exact timestamp, so the limit is enforced precisely at every moment — no approximation, no loopholes.

Good for: Financial transaction APIs, compliance-sensitive endpoints, admin or privileged APIs with strict per-user contracts.
Ask yourself: Would a double-spend or a quota bypass — even for a fraction of a second — cause serious problems? If yes, use sliding window log.
Watch out: Memory usage grows linearly with request volume. Not suitable for high-throughput public APIs without aggressive pruning.

Sliding Window Counter — "Production-grade accuracy without the memory bill"

This is the algorithm most large-scale systems end up with. It fixes the boundary burst flaw of fixed windows using a simple weighted estimate, costs almost nothing in memory, and scales horizontally.

Good for: High-traffic public APIs, multi-tenant SaaS platforms, any system where you need per-user rate limiting at scale.
Ask yourself: Do I need better accuracy than fixed window, but can't afford to store a timestamp per request? Sliding window counter hits the sweet spot.
Real production example: Cloudflare and Redis's own rate-limiting module use variants of this approach.
The trade-off to accept: The weighted estimate assumes requests were evenly distributed in the previous window. In practice, this approximation is accurate enough for the vast majority of use cases.

Rate Limiting in a Distributed Environment

Everything we've discussed so far assumes a single server with a single in-memory counter. The moment you run multiple instances of your service — which every production system does — things get significantly harder.

The Core Problem: Split Counters

Imagine you have 3 server instances behind a load balancer, and your limit is 10 requests per minute per user.

User sends 10 requests →
  Instance A handles 4  → counter: 4  ✅
  Instance B handles 3  → counter: 3  ✅
  Instance C handles 3  → counter: 3  ✅

Total: 10 requests → but each instance thinks the user is well within limits.

Each instance only sees its own slice. The user effectively gets 3× the allowed quota. This is the race condition at the heart of distributed rate limiting.

Solution 1: Centralised Data Store (Redis)

The most common fix is to move the counter out of each instance's memory and into a shared store like Redis.

User request → Any instance → Increment counter in Redis → Allow or Reject

Redis's atomic operations (INCR, EXPIRE, Lua scripts) ensure that no two instances double-count the same request.

-- Lua script executed atomically in Redis
local count = redis.call("INCR", key)
if count == 1 then
  redis.call("EXPIRE", key, window_seconds)
end
return count

Trade-off: Redis becomes a single point of failure. Mitigate with Redis Sentinel or Redis Cluster. Also adds a network round-trip per request — keep latency in mind.

Solution 2: Sticky Sessions (Consistent Hashing)

Route each user's requests to the same server instance consistently, using consistent hashing on the user ID or IP at the load balancer level.

User A → always → Instance 1
User B → always → Instance 2
User C → always → Instance 3

Now each instance owns its users exclusively and local in-memory counters are accurate again.

Trade-off: If an instance goes down, its users get rerouted and their counters reset — potentially allowing a burst. Also limits horizontal scalability since users aren't distributed evenly across instances.

Solution 3: Approximate Local Counters with Periodic Sync

Each instance maintains its own local counter and periodically syncs the delta to a central store (e.g., every 100ms). Instances read the global count from the store before making allow/reject decisions.

t=0ms   → Instance A: local=0, global=0
t=50ms  → Instance A processes 5 requests → local=5
t=100ms → Instance A syncs → global=5
t=110ms → Instance B reads global=5 → allows up to 5 more

Trade-off: There's a window of inaccuracy between syncs. A user could exceed the limit briefly before all instances catch up. Acceptable for most use cases; not for financial or compliance-critical systems.

Solution 4: Distributed Rate Limiting with a Sidecar / Service Mesh

In Kubernetes environments, rate limiting is often offloaded to a sidecar proxy (like Envoy) or a service mesh (like Istio). Each pod has a sidecar that intercepts requests and communicates with a central rate limit service.

Request → Envoy Sidecar → Rate Limit Service (centralised) → Allow/Deny → App Container

This keeps rate limiting logic completely out of application code and centralises policy management.

Envoy + Lyft's ratelimit service is a popular open-source implementation of this pattern.
Istio supports rate limiting via its EnvoyFilter and RequestAuthentication resources.

Choosing the Right Approach

Approach	Accuracy	Latency Impact	Complexity	Best For
Redis centralised	High	Low–Medium	Low	Most production systems
Sticky sessions	High	None	Medium	Stateful, low-churn user bases
Local + periodic sync	Approximate	None	Medium	High-throughput, tolerance for brief overruns
Sidecar / service mesh	High	Low	High	Kubernetes-native microservices

In most systems, Redis with atomic Lua scripts is the pragmatic default. It's battle-tested, simple to reason about, and handles the distributed counter problem cleanly.

Rate limiting might feel like a background concern, but picking the right algorithm directly impacts your API's reliability, fairness, and user experience. Now you know exactly what's ticking away behind that 429 Too Many Requests response.

Thanks for reading!