Hotstar as a platform deals with more than 1M RPS of pure API calls from various clients (this excludes calls for images/video). At peak, the volumes go up to 5M RPS. Hotstar is a global platform with more than 200+ micro-services working together to give our customers a smooth experience.
There are multiple patterns in which one can monitor systems. Two of the patterns which we use are
Inside Out Monitoring
In this pattern, each of the systems running a service is emitting metrics from within, the source of the truth is the system itself that is being monitored, this typically means that you are trusting that all will be well with the system even when it is stressed out and would send the distress signal as well.
Outside In Monitoring
This approach basically assumes that every component in your platform can fail, and hence you can’t trust the data given by it. That’s why you always ask the layer above it, which is calling it to provide data on how this system is performing, and you do this all up in the chain.
Our tech stack varies by use-cases. We have a variety of data stores used across the platform. Some of these are fully managed services and some are managed by us. Unfortunately, there is no uniform monitoring solution for these, each one of them emits the metrics in their own way.
To root cause an incident, it’s important for us to be able to see all the various metrics from these different systems in a single place and co-relate. We have achieved this over the course of time by exporting metrics into the central collection layer. We have used Grafana for visualisation and alerting
With centralisation, each team created their own dashboards, which helped them to a certain extent, but there are still 100s of data points that need to be observed.
The centralisation of the metrics has been helpful in root-causing an incident, and this has been helpful in reducing to mean time to mitigation. During a live event, it is also important to be able to detect incidents much faster. Every second of delay can be harmful.
In 2018, we started with a very basic dashboard, which had each service horizontally listed with layers listed vertically. Each of the blocks had a metric and that could be either red/amber/green, indicating the health of the layer.
In 2019, we add few more services and a single dashboard was not sufficient. Hence we split the dashboard by priority of the service. This solved the requirements for most of the events in 2019.
In 2020, we added a lot more services to the platform, we at this point started to realise that this mechanism is not scalable, and started exploring tools that can offer this more effectively.
We needed a system that could provide an overview of the entire platform and was also smart to detect traffic anomalies and help to quickly highlight co-relations as our service eco-system was growing year-on-year.
Acceptance Criteria
An upgraded system needed to meet the following key criteria:
- Build co-relation across different services across different layers.
- Drill down and narrow down to a layer/component causing the trouble.
- Detect degradations post changes/deployments on the platform.
- Automatically detect thresholds and set alerts on the data.
- Detect anomalous patterns across the platform.
- Monitor all resources created in production, and auto-discover co-relations
With the above things in mind, we went out and started to look for a tool. At some point, we thought maybe we should build this internally. During this, we came across Last9. It’s a platform that is built on solid SRE principles with a twist for modern platforms.
One of the key features which caught our attention was that Last9 was the system graph. We could map each and every component of the system in it. Be it physical/virtual and then build relation among each other.
We want to ingest more event sources into the product so that we’re able to draw even more co-relations and learn when things are starting to smoke, versus when they are on fire!
Our vision is that we should be able to tell when a certain business metrics is impacted, what is the cause of it, was it a change in product, or a certain engineering incident, which led to it, ultimately, knowing a problem before a customer figures is out, is the key goal that we chase when it comes to monitoring followed rapidly by problem isolation that can lead to the quickest mitigations.