As Hotstar scales to hundreds of millions of customers, we need to protect our APIs from “unwanted” usages. Here we will discuss how we implemented a custom rate limit system which scales for Hotstar and protects our APIs.

NOTE: This section will talk in length about the story behind the need for a “custom” rate limit system which our API Gateway couldn’t suffice for. You can skip and go ahead to the implementation section if you want to.
Initially, the rate limiting system was introduced for a very specific use-case. The product has a feature called the “free-timer”. Using the free-timer, a “guest” user could avail a paid content for a few minutes (typically 5 mins). This helped new customers experience Hotstar without spending upfront. It was heavily used during popular events. After the timer ended, they were shown a paywall.
Free Timer — Non Trivial Tech
The free-timer was given to a “guest” user. A “guest” is a user who hasn’t logged into the product yet, and so has a “guest token”. The guest token has a unique identifier which is used to uniquely identify a customer.
When a “playback” request comes from a guest user, we check whether the identifier of the guest user (given in the request) has already availed the free timer or not. This state information was maintained.
Malicious actors devised a way of getting around the free timer check. Whenever the free timer ended, these rogue actors were calling the API for creating a new guest user. So, they got a new guest token which had a new identifier. The subsequent playback request using this new token was successfully getting the playback content, since in backend (free timer checking service), we didn’t find the identifier (as it was new). Using this loophole, rogue actors were able to play the paid content for an unlimited time.

Preventing Abuse
To prevent the above exploitation, we had to put barriers in front of our guest user creation API. We explored a bunch of solutions. First we thought of introducing rate limits at the CDN(our edge API Gateway). This approach was a no-go since the CDN rate limits are simple and generic DDOS protection. This was not DDOS, perhaps more slow DOS, the request count was relatively low.
Sure, we were doing a lot more counts of guest user API than usual, still it wasn’t a DDOS scale. Scale wasn’t the issue since we could handle the guest user API easily and serve new tokens. What we needed was to prevent more than 1 or 2 counts of request being sent from a “source” for guest user creation API. Source is a combination of multiple things like IP, userID, deviceID. We needed something more sophisticated where we could identify and rate limit requests from certain malicious sources but allow other non-malicious requests to go through.
We use Ambassador (which internally uses Envoy proxy) as our API Gateway. Envoy provides pluggable “filters” which work as middleware. Using filters we could interject some checks before the request even reaches the kubernetes service deployed in the backend.
There are many open-source filters that were written for Ambassador for performing functions like authentication, logging, tracing and of course rate limit. Two years back when we were searching for a good rate limit plugin, we found a good one from Lyft.
Lyft Rate Limiter-POC
We quickly did a POC and integrated Lyft’s rate-limit plugin to our Ambassador stack. After running both functional and non-functional tests, we discovered the following short comings:
Key formation algorithm was not performant
Redis was used as the database in Lyft’s plugin. For each request, a key would be created like
, where
API
HTTP resource path on which the rate limit is being applied eg./login
SOURCE
Request source identifier, can be IP, or userID in the requestBUCKET ID
calculated asrequest time
/rate limit window
. eg. if request was made at epoch time 210, and rate limit setup is10req/min
, thenbucket id
will be 210/60=3 (integer division) where 60 signifies 60 seconds or a minute. All the request between 180 and 240 (60 sec window) had the same bucket id of 3. Aggregation of request from the same “source” was done in this manner.
For every request, this Redis key was “formed” and INCR +1 o/p was performed on that key in Redis. Thus, each redis key held the counts of request from a source for an API in that rate limit window. Along with the INCR o/p, an EXPIRE o/p (basically setting a TTL) was also done with value as (BUCKET_ID + 1) * rate limit window
to cleanup old expired data.
# algorithm: IP based rate-limit config of 10req/min on "/login" API1. time_now = current_time()
2. rate_limit_window = 60 //since rate limit is per min or 60sec
3. req_ip = incoming_request.ip
4. bucket_id = time_now / rate_limit_window
5. redis_key = format(login_%req_ip%_%bucket_id%)
6. key_ttl = (bucket_id + 1) * rate_limit_window
7. pipeline two redis commands:
INCR redis_key 1
EXPIRE redis_key key_ttl // SET TTL
8. current_count = INCR-command.result()
9. if current_count > 10, return 429 (over limit)
Depending on the source
, there can be millions of redis keys whose bucket id
are all same, but source is different. Eg login_UserA_3
… login_UserB_3
… login_UserC_3
and so on. Since the bucket id
was same, the redis key’s TTLs were also same (in the above case its (3+1) * 60
).

We had a lot of per minute
rate limit use-cases. Each redis key which was based on a per minute rate limit expired every 60s mark. So at every 60s, redis was trying to “expire” millions of keys.
During “these” times, we saw spikes in redis latencies. The latency spikes exactly matched the TTLs of the bucket ids
which caused timeouts and 5xx in the service. Same was true for per second, per hour rate limits as well.
Ability to Blacklist — Gap
After we detected that maybe a certain user or IP was making over-limit requests, we wanted to block that source, however Lyft’s rate-limit didn’t have that feature.
Master-Slave Redis — Gap
Lyft’s rate-limit could only talk to a master-slave redis cluster. This type of redis cluster just wouldn’t have scaled, since it was a very write heavy
system, and we would be dealing with 1–5 million RPS. Also the data size can be huge and won’t fit on a single redis machine.
Shadow mode — Gap
We wanted to first identify APIs that we suspected were being misused without actually blocking it. These maybe be P0 APIs and blocking without verification might cause user impact. So a shadow or an alert mode was needed where the system will only report us the over limit requests but not block them.
Due to the short-comings in the Lyft rate-limit plugin, we decided to build our own in-house system with performance and scale in mind. We made the following changes.
Key formation algorithm changes
We needed a way to distribute the TTL of the inserted keys (under a bucket) so that they don’t expire at the same time. One alternate algorithm was to do sliding windows, but we concluded that doing aggregations in a sliding window algorithm is very expensive and wont work at a scale of a few million TPS.
We found a middle ground approach. We changed the algorithm to make the redis key’s TTL flexible. Previously, given a rate-limit (say 10 req/min), the TTL for a redis key, would have been pre-determined using math (ref. Key formation algorithm section above).
In the new algorithm, the redis key structure was changed to
(bucket id was removed). We will do INCR +1 and TTL on this key and get the current count and currently set TTL. If no TTL was set, it meant it was the first time we are seeing this key
and we would run an EXPIRE o/p on that key.
# new algo: IP based rate-limit config of 10req/min on "/login" API1. time_now = current_time()
2. rate_limit_window = 60
3. req_ip = incoming_request.ip
4. redis_key = format(login_%req_ip%)
5. pipeline two redis commands:
"INCR redis_key 1"
"TTL redis_key" // get TTL of the key
6. if TTL-command.result() = 0, // no TTL set yet
a. key_ttl = time_now + rate_limit_window
b. send redis command "EXPIRE redis_key key_ttl"
7. if INCR-command.result() > 10, return 429 (over limit)
This change negated the redis latency spikes by distributing the TTLs as now the TTLs were based on the first seen time
which is distributed in nature. This also added one more advantage.
Previously the window/bucket sizes were fixed, so even if a request came at the very end of the bucket, it would be considered it in the bucket which is about to expire. Eg. suppose a bucket spans from T60 to T120, and a request surge came at T119. All these request will fall under T60-T120 bucket which will expire at T120, which meant the source will get the limit reset for the next 60 secs. So in a window of ~10 secs the source could make 2X the limit configured i.e. X reqs. from T55–T60 (prev window) + X reqs. T60–T65 (current window) without being rate limited.
With the new algorithm, since the window started when the request from a source is seen for the first time
the above flaw was negated. Now the source couldn’t make more request than the limit set in a particular window.

However, the above approach brought it consistency issue. In the Lyft algorithm, since the window size and TTL were based on the configured rate-limit itself, finding the TTL was basic math. So every redis o/p could consistently do SET TTL
without worrying some other concurrent redis o/p will change it, since the other o/p will be setting the same TTL (because the TTL was fixed for a bucket).
In the new algorithm, for each request, we did INCR and GET TTL
(instead of INCR and SET TTL
). If the GET TTL
returned 0, we would do a SET TTL
. Since GET
and SET
TTL are two separate redis calls over the network, there are loads of possibilities where multiple concurrent requests to redis for the same key, got 0 for GET TTL
(seeing the key for the first time) and so multiple subsequent requests will be made for SET TTL
. Thus setting the TTL
wasn’t consistent.
We figured that this was an okay trade-off to make. Since our intention was to protect our APIs from misuse and it wasn’t intended to solve a business use-case like concurrent device restrictions
where the system had to be consistent. We were okay if multiple concurrent requests did SET TTL
for the same key, since all of those request would set a TTL with a (+ or -)20ms variance.
We ran fresh load tests with the new algorithm, and got the below performance.

We found that a machine with 2 VCPU, and 2GB RAM could do 6000 RPS. Around 7000 RPS is where it “broke” and latencies spikes over a second.
Some other changes
We changed the redis code to support multi-master redis cluster. This was a must for our scale. One time during peak, we were using nine (9) cache.m5.4xlarge
instances in the redis cluster.
We did another optimisation of batching redis queries and pipelining them. We wrote a batching algorithm which would use the redis cluster information to determine the node where the said key belongs and batch them together, to be sent as single pipeline command to that node. This algorithm was later natively supported by go-redis to which we ported our service.
Pipelining gave us 3x performance boost. We saw 3x less CPU usages in redis when we did micro-batching and pipelining.

We also put in support for blocking a malicious source. Now that the key could have a flexible TTL, we could “extend” the TTL further than the rate-limit window. Now we could do, if the rate limit is breached, extend/block the key for Y amount of more time, where we would just do another ‘SET TTL’ with a new value.
We put in an alert/shadow mode where we would setup the rate-limit for an API but not throw 429s
in-case the API was over-limit. This was done by pushing “over-limit” metrics to a dashboard where API owners could review and take actions (maybe choose to start blocking over-limit requests).