Creating a strong subscription business means protecting the product from being accessed without the right entitlements. With great power comes great responsibility, and with high growth comes even higher risks. Given our premium Disney content and live events there is always a risk of piracy, account sharing, account takeover and data/content leaks.
Mitigating these risks is extremely important from a business perspective as it can lead to loss in subscriber counts ultimately leading to monetary and reputation losses.
Hence, we at Disney+Hotstar floated this idea of building a risk engine to provide us a potent way to detect bad actors and patterns without hurting our legitimate customers. Using the telemetry that we detect, we’re able to take instant corrective actions like logging out a user or even disabling their services.
Challenges
Introducing a new element in any business requires considerable thought. We needed to be mindful of not impacting the business as we rolled this capability out.
Following are some of the complexities that we had to be cognisant of:
- Our products cover more than 16 different platforms, including Android, iOS, Web, and all kinds of living-room devices.
- Our service is available in multiple regions across the globe, including India, Southeast Asia, North America, Europe, Middle East and South Africa.
- We offer a variety of entitlement options. We have both free and paid content, also Video on Demand (VOD) and LIVE content. The subscription tiers also differ in entitlements and limitations. So, lots of variety and variations.
- We have a high scale platform that sees huge traffic every day.
Since we are in plain sight, while hackers hide in the darkness, the attacks could happen from anywhere, at any time, targeting any weakness in our system. The threats to us can be categorized as following:
- Account: account sharing, account takeover, mass registration, subscription fraud, promotion abuse
- Content: VOD piracy, LIVE privacy
- Application: app modification, fake installs, ads blocking, voting cheats
- Service: DDoS, request forgery, malicious crawler
Therefore, to deal with the above threats, the risk engine needs to analyze all kinds of data from every perspective of business domains.
Besides, another challenge is the action. Since simply detecting malicious activities is not enough, we need to take action upon them. The actions typically orchestrate over multiple services, and span complex business logic, so this is a key concern as well. We need some strategies to decide which action we should take based on the characteristics and severity of the breaches.
Last but not the least, how to validate the performance of the risk engine. We need to answer two questions: Is it effective to stop fraud? Is it affecting normal business and users?
For this we need to define some quantitative metrics to evaluate and track how the risk engine is performing as a feedback mechanism.
Before we decided to build, we did some market research to evaluate if there are any open-source or third-party solutions we can use. However, we didn’t find any solutions that are flexible enough to support our use case and can handle our complexity.
To prove the concept of an in-house risk engine, we decided to build a minimum viable version. We wanted to validate if some rules are able to capture malicious users, and if some actions are effective to them.
The first version of our risk engine came online shortly. It was built on top of offline queries. A simple Rule Engine maintains some static rules, and triggers actions if certain thresholds are met.
The workflow is pretty straightforward as below:
- Spark ETL queries data from our Data Lake to detect breaches based on hardcoded thresholds. The job is triggered on an hourly or daily basis by Airflow, and outputs raw results to Rule Engine via Kafka.
- We coded some rules in Rule Engine, which will check the breach frequencies. If a certain threshold is met, correspondent actions will be taken.
- Action Engine will call services API to take actions, like logging the user out or blocking the IP.
This solution is mainly based on Spark batch processing. It mostly relies on the existing Data Lake, which stores billions of watch, login and payment records everyday. The system was running for several months, and the results proved that it was able to capture a lot of abnormalities everyday. This gave us confidence to invest more in the risk engine. However this approach has some major limitations due to its architecture.
Speed
The response time of the risk engine is crucial, otherwise it would continue to allow malicious actors to abuse our platform. However, due to the batch nature of this solution, we had delays of hours before detection and action. The batch job could take exceptionally long during high scale events.
Configurability
All the jobs and strategies are basically hardcoded. If we want to change a threshold or modify a rule, we’ll need to update the code and redeploy.
No mechanism like percentage rollout, so when we want to launch a new rule, we need to be very cautious to avoid harming normal users. Therefore, only our team is capable of maintaining it.
Product managers and data analysts cannot contribute to it due to the configuration complexity of the system, which is only suited to engineers.
Visibility
There’s no UI frontend to visualize and manage the data. We can only rely on the common Redash dashboard to do the query and analytics.
The first version has proved that we were able to detect malicious activities with some rules, and the actions we took were effective. However due to the above constraints, the systems were less responsive and scalable. This motivated us to build a new version of the risk engine. In this version, we aimed to improve on below aspects:
- Real-time analysis and action.
- Scalable for 10 billion data points daily.
- Everything is easily configurable and visualized via a web portal.
- Non-developers should be able to self-serve using the portal to manage their strategies.
In the next sections, we will first talk about how we orchestrate the rule structure to make it easily understandable and configurable by everyone. Then we will discuss the new real-time processing flow, where Flink plays a major role in it.
Configurability: Rule Structure
When we parse the rules, there have to be some parts that need to be implemented in the code, and some parts can be exposed via configuration. The key is to define the configuration points. To standardize this, we define a structure of rule that consists of 3 components: Factor, Rule and Action.
This structure helps us focus on separated concerns and make the system agile.
Speed: Stream Processing with Flink
To achieve faster processing speed, we introduced Flink that processes data from Kafka events in real-time. Below shows a basic workflow:
Scale and Cost Stats: With 16 pods of 8 CPU cores and 8 GB memory, we are able to handle 10 billion events daily, with peak RPS at 200k. One optimization we made is to enable checkpoints to S3 instead of memory. This saves a lot of memory and GC cost.
Rule Engine
We want rules configuration to be simple but extensible. Therefore we define a simple rule schema like below:
On the left arm, we have:
- Factors which are the preliminary results produced by Flink.
- Logical operators of NOT / AND / OR.
- High-level Functions like RATE and MAX to check frequency and scale of Factors.
On the right arm, we have:
The rule is configured in the web portal. And we implemented the editor and parser with ANTLR. Rule Engine is continuously checking every rule if left arm condition is met, and if so, will take the actions on the right arm on respective users, devices or IPs.
Meanwhile, the rules’ lifecycle is managed in the portal. We can add, enable/disable, modify rules there. Moreover, we support two advanced deployment options for rules:
- Passive mode: Rules running in passive mode won’t take actual actions, but results will be captured for analysis.
- Percentage rollout: For active rules, we can roll-out on a certain percentage.
With these levers, we have the freedom to configure any kind of strategy that fits our business purpose, and at the same time we can experiment and verify our rules with minimum impact to normal business.
Score Engine
Apart from taking actions in an active manner, our risk engine also generates a score passively. The actions taken by Rule Engine focus on the most severe breaches, which favors precision over recall. And as a complement, the Score Engine focuses more on recall rate.
Based on Factors, we compute a score from 0 to 100. The larger, the riskier. To quantify the score, we adopt the FAIR model (Factor Analysis of Information Risk). It basically uses frequency and magnitude of each factor to decide a score based on a lognormal model. The scores are further aggregated by weights and normalized within the range of 0 to 100.
The scores are open to all our services. They can use the score at their convenience, in either online and offline scenarios. A typical use case is that our recommendation team will rule out high risk users when training our recommendation model, avoiding the extreme and abnormal behaviors affecting the model performance.
Visibility: Web Portal
The web portal is the main operation point to carry out all the actual strategies, and it is open for usage across various teams in the company. Business teams can configure their own strategies and validate the results with a few clicks. The Customer Support team will also use the portal to check users’ breach history and take manual actions.
In this blog, we talk about why we want to build a risk engine, and how it involves from a hardcoded batch system, to a flexible real-time one. We discuss our design principles to make it configurable and dynamic. To achieve a better detection speed, we introduced Flink. And on top of that, we built Rule Engine, Action Engine, Score Engine and Web Portal to make our risk engine more agile, versatile and self-serving.
Want to build stuff like this? We’re hiring & we’re 100% remote! Check out open roles at https://tech.hotstar.com.