We recently wrapped up our wildly successful COVID edition of the Dream 11 Indian Premier League; This year was different from the previous years, since we had switched our business model, to make the event, pay-only. This meant, new, never before seen traffic patterns while at the same time dealing with the same Tsunami’s and our ad-supported streams as well.
Behind the scenes, despite a lower number of concurrent audience compared to our previous years (by design), our infrastructure experienced a scale larger than ever before. This is, thanks to a slew of new capabilities and services that we added to serve our customers: the Disney+ catalog, more geographies and subscription options, better social features, better recommendations and many more.
This meant that our infrastructure was stretched to it’s limits and provided us with a great deal of learning that we can now use to learn and scale beyond. While we were expecting some of these issues and we were prepared for them, but not so much for others. As we wind up this mad year, it’s worth retrospecting what we’ve been through, infrastructure-wise.
So, let us start with our monitoring stack.
Prometheus and Thanos : Can you fly blind?
At peak, our production Prometheus instances were ingesting and processing about 1 Gbps of metrics. A normally well behaved set of Prometheus instances strained to do everything it was asked to at peak: target discovery, scraping, metrics ingestion, querying and alerting.
It gave up at about 250 GB of memory. Restarts were particularly painful because it took over 30 minutes for the write ahead log to be replayed before ingestion started again. While we eventually chased this down to misbehaving services sending in an inordinately large amount of metrics, it started us down the path of reconciling our metrics architecture with our growth. Here are some of the things we issues and how we are thinking about working around them.
Let us start with the limitations.
- Performant or Slow-As-Molasses? : Prometheus does not do as much as it can to provide better metrics to measure slow or resource intensive queries. To be sure, newer versions of Prometheus do provide a query log, but we did not find these very helpful in practice because it seems hard to know what to baseline these results against. For example, a particular query takes 2 seconds to return; is that good or bad? What may help is to measure performance in terms of real memory consumed even if it is an approximation. Or amount of CPU cycles a query took to evaluate.
- Options, Not Really : There are options to set time limits for queries, but this unfortunately applies to both recording rules and alerts (which the infrastructure platform team manages) as well as Grafana dashboard queries which our end users have the freedom to define. So we could not go too low there.
- Got No Legs : We think sharding Prometheus is only a short term fix. If the number of applications grow an order of magnitude every year then it could limit you at some point in the future. For example, we manage a Kubernetes namespace for every team; so managing a pair of stateful Prometheus per namespace is not just costly but operationally intensive. Even if we do bite the bullet now, we can’t operationally go beyond the namespace granularity when the time comes some time next year.
We’re not arguing against adopting Prometheus though; you should in fact run Prometheus first, because it is the leanest and most scalable cloud-native metrics system out there. But there are certain precautions you should take, as an infrastructure platform team, to keep your setup stable and robust.
- Limit It : Enforce limits on the amount and type of metrics that is allowed to be scraped. Prometheus allows dropping metrics for a target if it exceeds certain limits (See
sample_limit
andtarget_limit
. in the configuration) Unfortunately, we could not find any means to go any more granular though. - Homogenise Metric & Labels : Standardize metrics and labels names across services especially for compute intensive types such as histograms. This lets you write global recording rules, which, down the stream also improves querying efficiency.
- Not Analytics For You : Discourage the use of Prometheus for the analytics use case. It is very tempting for application teams to use Prometheus to store and query business metrics because it is super easy to do it. Prometheus does not do well with high cardinality metrics, querying of which loads a large amount of metrics and samples into memory.
- Watch It : Track some important Prometheus metrics through it’s life cycle such as
scrape_samples_scraped
at the job as well as target to understand what the utilization is across your infrastructure. Also keep a tab on cardinality through using the steps outlined in this Robust Perception article. - Iceberg Ahead : When you do hit up against limits, consider a solution such as Cortex or Victoria Metrics which should tide you over nicely. If that is still insufficient, it is time to start cracking and go the custom route with filtering, throttling and sampling capabilities using a middleware such as Kafka. Beware that you may need application changes when you arrive at this point and hopefully have your own observability team.
Re-inventing auto scaling
One of the things we have also outgrown is the dependency of our scaling system on the monitoring system. One of the more convenient methods to scale is to use the trusty Prometheus Adapter which emulates a custom metric server which Kubernetes’ HPA can query. Not all of our applications are dependent on this architecture, but a sizeable chunk are.
At some point, this breaks not just because the underlying monitoring system breaks (see above) but also because scaling metrics that you care about like overall service latency and throughput metrics gets intertwined with other application infrastructure metrics such as granular API latency and throughput. Which means it is an “all or no metrics” affair. We are moving on to making scaling metrics independent of applications (which is quite a different challenge and warrants it’s own article).
KIAM and Node taint controller at scale
Because of our early adoption of Kubernetes (before even EKS started to be available), we chose to use a KIAM to grant granular AWS resources access to specific service pods.
KIAM has served us well for a long time; the only requirement being that KIAM agents daemon set pods have to start on the nodes before any of the applications pods are scheduled. To ensure this, we deployed the nifty Nidhogg node taint controller which is built exactly for this purpose.
This controller watches nodes and taints them until KIAM agent starts running. Great, but the only quirk? There is a small window between a node getting created and the nidhogg controller tainting the said node when application pods are able to sneak in and fail. This becomes especially apparent at scale when the controller is busy due to a sudden spurt of new nodes all of which have to be reconciled.
We are currently debating if we should spend time working on the controller or go to the AWS native solution which uses OIDC to marry IAM policies and Kubernetes’s service accounts. The latter is more effort because of all the trust relationships that change in application roles, but is a better, more native way to move forward.
Ingress controllers at scale
Another issue we ran up against was with ingresses, specifically with AWS application load balancers (ALB). We observed that some high traffic services were experiencing connection issues during sizeable scale-down events (after the end of each match).
Careful analysis of Kubernetes and load balancer timing revealed that the state of Kubernetes cluster in terms of the pods were not synced up with the load balancer backends as quickly as we’d needed it to happen.
We further chased this issue down to the fact that the reconciliation loop in the ALB ingress controller processed one load balancer at a time rather than just target the load balancers that had to be updated.
Weirdly in addition to pod and endpoint events, the controller also watched Kubernetes node events which caused the reconciliation loop to run more often than required. This may be a requirement for the classic load balancer which does not have IP mode and instead can only register EC2 nodes, but should not have been the case for application load balancers in IP mode. This issue was further amplified due of the fact that we run several load balancers to cater to our traffic spikes which leads to a large sync delay when there are too many load balancers to reconcile.
Thankfully AWS recently released a new major version of the AWS load balancer controller which addresses some of these issues and even supports network load balancers!. They’ve even made it play nice with the previous versions through a later release 🙌
Conclusion
A careful reader may have observed that the issues described above occur during high slope traffic. That is, when a large number of customers arrive at your door within an extremely short span of time. This may happen to your service is there is a sudden spurt of interest or maybe during timed holiday shopping events.
Operationally it is challenging to react to high slope issues, without dropping out some functionality and is one of the reason why we spend a lot of time thinking about scalable infrastructure architectures at Hotstar.
Ultimately we look at this exercise as discovering limits of our tools and techniques, re-learning age old operational principles yet again in the context of a new sets of tools and getting better every day.
If you want to push limits of what’s out there and have a ball of a time doing that, we’re hiring across all roles! Please check out open jobs at https://tech.hotstar.com.