Cloud Resource Explorer using simple ChatOps
Need to know your cloud dependencies in a pinch? Yes, we’ve been there. Here’s how we leveraged ChatOps to make our lives easier.
We’ve got cloud, and we’ve got a 99 problems about what’s residing in our clouds. On most days you might have the luxury of time to unravel this dependency graph, however, if you’re chasing down an incident, you need to know in a hurry! Here’s how the Sentinels, which is our security team @ Hotstar, solved this using ChatOps.
A cloud Resource Explorer is one of the most important items in the the toolkit for anyone who is builds in a modern engineering team. While the reasons can vary, the need to know what resides where and the metadata around it is needed without much drama.
Here’s what most people leverage today to discover items in the cloud and their challenges.
- Console : Does not scale in Multi Account setup, complex correlation not possible.
- CLI or SDK (e.g. Boto) : CLI needs setup like setting up keys, role assume settings etc.. SDK requires some programming comfort — does not scale for team members who are not current with coding. A default problem which always exists with this method is managing the keys at scale & their rotation.
- Cloud Inventory or a Cloud Security Posture Management (CSPM) solution: Focus of this tool is security, not so much, inventory. Therefore the data is stale and can only work as a coarse method, which might not serve all use-cases.
While as a combination these things might work, this is not something that can be used in a pinch and will require stitching together of a solution.
What we wanted to solve is something seen at scale only on a day-to-day basis. For example, someone has a simple question, this someone could be a customer care executive, or a backend developer. Their question might go something like :
“I want to know where is x.x.x.x IP in our Infra”
Our goal was to make it as easy as querying from an excel sheet or a simple database for people. Using the traditional methods would fail for the simple fact that it would require stitching and additional work each time this question was asked, unless you pooled together some tooling. Add the complexity of multi-cloud, or even multiple access levels and so on, which is very common. In general, the head-wind to even answer a simple question like this is intimidating.
We began to introspect the questions that our teams were asking. Here is a sampling :
Which account does this S3 bucket belongs to & what type of encryption is enabled on it?
For an access key, which account & user this belongs to?
I want to know what xyz.hs.com points to. Which account’s R53 to check?
Each of these takes a different quantum of complexity to answer! Imagine spinning up bespoke scripts to handle each question, this is just not scalable.
We extensively use Slack for communicating. ChatOps can be on any chat app for that matter. Anyone who keeps questioning about various things on Infrastructure comes to slack first and asks someone, most of the times — it is DevOps, Infrastructure & Security Teams who gets these questions.
Our goal was simple — nothing should limit someone to ask a question and ensure minimal dependency.
Querying cloud still remains the same — it is either CLI, SDK or using existing data from a source like CSPM which already pulls most of the data for you.
When to use real time queries vs using CSPM Data depends on the use case and how live you expect the data to be. For example I expect IP data to be almost live(1–2hr window) as a lot of IPs keep changing for various reasons — Spot nodes, Auto Scaling etc.. my IAM Data can be 6-12 hour old since user & access key creation is not that frequent. Similarly pulling S3 or R53 data can also be around 6–12 hours.
A simple architecture diagram to explain how it is built and used is here —
Components in the architecture:
Slack — This is where someone fires a Slash Command depending upon the info they wish to get. This command can be fired from their DM or a dedicated channel, the response comes to a pre-defined channel.
API Router — This is where most of the logic sits. It authenticates the Slack User, Payload coming in & then routes it to corresponding API. Decision of whether to use CSPM API, ES or Real Time CLI Query is taken here. Response to Slack is also given by this component. This is a simple Flask App.
CSPM API — This can be your CSPM, or an alternative cloud inventory service which pulls your posture data every 24 hrs. It will have some API exposed to query data out of it, which can be used.
Custom Full Text Search — You can use any full text here, we used Elastic Search here. We have few cron jobs running to pull data and keep it live as much as possible. The frequency of Cron depends on what kind of data is being pulled from the Cloud. Like mentioned before — IAM data can be pulled every 6-12 hrs, IP data every 1 hour, so on and so forth. This frequency depends on your environment & priority given to certain resources.
Real Time Queries — You can fire custom queries either using CLI, SDK such as Boto or use a tool like Steampipe.
Note: Access Control — Ofcourse everyone is not allowed to see everything, we would like to have some restrictions on what kind of data can be queried by what category of people. Simple access controls can be written based on the Slack User ID who fires the command. Group of Slack User IDs can be allowed/denied to fire certain APIs.