Hotstar’s Journey from EC2 to containers

Prakhar

This article will describe the journey of Hotstar’s infrastructure from EC2 to Kubernetes. The journey about why, what, and how we migrated the Hotstar platform.

In 2017 we saw max concurrency around 4.7M while in 2018 we saw max concurrency around 10.3M which was more than double and was huge for that time.

During IPL’18. Hotstar’s infrastructure was running on the EC2 stack (ELB on top of EC2 machines). While we had configured Auto Scaling Group’s (ASGs) for scaling up applications, we did not trust ASG’s and pre-emptively scaled up during big events.

We were using pre-baked AMIs for our micro-services to reduce the bootup time for our applications. For CI/CD we were using Jenkins and we were using Terraform for deploying our infrastructure. Quite straightforward, right?

Challenges at Scale

Since 10M was a huge deal for us at that time, it was not that straight forward to scale the infrastructure for the traffic pattern that we were seeing.

The traffic pattern of high concurrency match

There were a lot of challenges like surge handling, insufficient compute capacity, API throttling during the scale-up process of infrastructure, to name a few.

Even with AMIs in place, spinning up new EC2 machines does take a lot of time that can cause delays or outages when these surge of requests arrive on the platform within a few minutes. Since scaling was slow we needed to keep a healthy headroom in terms of resources for our platform which led to unoptimised resources most of the time. Since applications were not using containers we were forced to use EC2 machines and there was wastage of resource in terms of memory/CPU as we couldn’t optimize the AWS EC2 machines beyond a point.

Another factor was capacity available in specific AWS region for those EC2 machines. Since we need more machines to scale up applications, we had reached the resource limit in those regions for specific EC2 classes. Also, since the traffic pattern was unpredictable, we had observed internal AWS API throttling during the ASG scale-ups, so, we needed to scale applications based on step size using ASGs.

So, after IPL 2018, we decided that scaling the current infrastructure for upcoming IPLs will be a nightmare as more resources were available for users and we knew the pace at which Hotstar is growing we would see a huge surge in traffic in upcoming big tournaments (VIVO IPL’19 and ICC WC 2019).

In 2018, post the tournament, we decided to move our platform to containers. As simple at it seems it was a daunting task, to think of it, at that time. We were running around 40–50 core applications in productions and migrating them was not a cakewalk. However, the advantages that we were about to get from containers kept us motivated to do this migration.

Containers are platform agnostic, so we could run our application on any machine on any cloud provider. They are easy to boot up and takes significantly less time as compared to EC2. Resources can be tuned in for containers as per the requirements of applications which leads to huge optimisation in terms of resources for us. Since bootup time is less scaling applications would not be an issue for us. We could scale up or scale down applications seamlessly as compared to EC2 machines.

One more advantage of containers is that now we had standard procedures for configuring applications in terms of logging, alerts etc. This helped us a lot in deploying new applications to productions quickly. We will not discuss further the details of deployment procedures in this post.

We were containerising the applications and along with this, we were exploring options for orchestration these containers. We decided to go with Kubernetes because of its community, user studies and features. Kubernetes brought a new way of deploying applications at Hotstar.

Now all the deployments followed the same standards. We have created kubernetes library for standard deployments for applications. Those libraries take in application configuration via a JSON file and convert them to kubernetes objects in YAML files.

We have also built CLI tools for using these libraries and deploying applications. This helps us in separating configurations between infrastructure and applications. E.g. now developers don’t have to worry about configuring the logging for their applications, one just needs to provide the team name and put logs in standard path.

We deployed GoCD on kubernetes which we use for our CI/CD pipelines. GoCD is a great CI/CD tools and has scaled with the requirements of Hotstar. With the help of GoCD, we have one-click deployments of applications.

The MVP of kubernetes at Hotstar was request based scaling. Previously, in EC2 world, we discussed that we couldn’t do let the machine perform auto-scaling which led to a lot of wastage. With this feature, we saved a lot in infrastructure resources. Here is a more detailed post about scaling Hotstar.

Migration and its issues

Containers and kubernetes sound awesome, but every good thing comes with its price! Hotstar infrastructure is quite huge and containerising each application was not easy and there are multiple languages in which applications are built. We came with core images of Hotstar for different environments like JAVA, Python, Golang, that take care of most of the common practices like environment setup for a specific language, logging configuration other infra related configurations. Now developers just have to use those core images and build their application on top of it.

With this migration, there would be a whole new environment for applications so we needed to check the performance of each application on these new setups and tune those applications accordingly for eg we need to perform a stress test to verify the Java parameters for an application like thread count, heap size, java queue size length etc.

We also needed to tune the kubernetes cluster to make sure the cost spend on is optimised and were not wasting money. We also tuned the nodes auto-scaling configuration and type of instances to use for those nodes.

Another critical thing to focus on is monitoring part of the kubernetes cluster which is quite critical for alerts, metrics and scaling of applications. We need to tune metrics or parameters for horizontally scaling monitoring components (or other infra components) for handling the traffic.

In 2019 we streamed 2 big tournaments (VIVO IPL’19 and ICC WC’19) on the new infrastructure on kubernetes. We were using GoCD as CI/CD pipelines which are again deployed on kubernetes. We were using Vault/consul for secret management for kubernetes apps. We were still using terraform for infra deployment (We love Terraform!!).

We run around 10 kubernetes clusters in production and we are maintaining those clusters in-house. The most important win for an infra engineer at Hotstar was that we are running Hotstar on auto-pilot mode for scaling up or down during any events.

In our case, kubernetes proved to be a success for us in terms of multiple deployments and less go-to production time for applications. Also, the auto-pilot mode for our infrastructure is like a dream come true for us because in EC2 world scaling was the major pain point for us. We have also built a lot of cool features on top of our kubernetes like API gateway. We will be explaining them in future blog posts.

It’s important to note that we are not comparing platform (k8s, EC2 etc) in general, however, for what we wanted to do with our infrastructure, Kubernetes worked out very well.

Finally, if you’d like to be on the other side of the looking glass, we’re hiring! Check out open positions in our devops team here!