Strengthening S2S Security: How JioHotstar Secures Its Service Ecosystems

As organisations increasingly rely on distributed systems and micro-services, security breaches in service-to-service (S2S) communication can have catastrophic consequences. According to Cybersecurity Insiders, 74% reported their organisation was moderately to extremely vulnerable to an insider attack. Furthermore, 48% agreed that insider attacks are more difficult to detect and prevent than external attacks. It’s encouraging to see many teams proactively tackle insider threats without regulatory pressure.

In our previous article, “Code Less, Achieve More: Disney+ Hotstar’s Approach to Modern Access Control,” we introduced IAuth — a powerful platform offering authentication, authorization, and accounting (AAA security) solutions for both S2S and client-to-service (C2S) interactions. This blog dives deeper into how IAuth enhances the security and resilience of S2S communication, simplifying security across complex service ecosystems.

In the early days of micro-services, the lack of security measures led to a dangerously open environment. Services could communicate with one another freely within the intranet, as long as the network was accessible. This was possible because firewalls existed only at the network’s perimeter, leaving internal services unrestricted.

For developers, this was ideal — seamless invocations without the hassle of security protocols. However, for security professionals, it was a disaster waiting to happen. As services exchanged critical data without protection, it became clear that unrestricted access to sensitive services posed a significant security risk.

Our first step in securing S2S interactions was the introduction of static tokens (long-term tokens). These tokens served as proof of a service’s identity, preventing forgery and tampering. A service requesting access would attach a static token to each API call, and the responder would validate the token to authenticate the requester.

Of course, mere possession of a valid token wasn’t enough. To ensure that services only accessed APIs they were authorized to, we implemented Role-Based Access Control (RBAC). This added a layer of protection, granting access based on predefined roles and permissions.

While static tokens were easy to understand and implement, they had a major drawback: they never expired. If a token was leaked — whether through log files or compromised developers — it posed a long-term security risk. The diagram below illustrates the flow of static tokens.

To overcome the limitations of static tokens, we introduced dynamic tokens (short-term tokens) that have a Time-to-Live (TTL) of one hour. The IAuth SDK actively manages a probe thread that refreshes the dynamic token at least five minutes before expiration, operating with a 30-second refresh interval. While the overall process remains unchanged, services must now use their credentials (service ID and secret) to request fresh dynamic tokens from the IAuth backend periodically.

This raises a valid question: “Are we merely substituting static tokens with another static element — service credentials?” While the concern is understandable, the difference is crucial. Static tokens were shared across three parties: the requester, responder, and IAuth, thereby widening the attack surface. In contrast, service credentials are only shared between the requester and IAuth, significantly reducing exposure. Moreover, if a dynamic token is compromised, its short TTL minimizes the window for abuse. But what if service credentials themselves are compromised? To mitigate this, we store them securely in a key vault, and we’re actively working on mechanisms to prevent even the developers of services from accessing these secrets directly.

Despite this improvement, dynamic tokens had a potential flaw. If the IAuth backend crashed, the requester couldn’t obtain a fresh token — posing a resiliency issue, and an unacceptable risk for critical S2S interactions.

To address the need for security and resiliency, we developed a hybrid model. This approach uses dynamic tokens during normal operations and falls back on static tokens when the IAuth backend is unavailable.

If the IAuth SDK fails to retrieve a dynamic token, it seamlessly switches to a cached static token, ensuring continuity. This model reduces that attack surface by using dynamic tokens most of the time, while static tokens act as a backup for failures.

But this raises another challenge: what if static tokens, meant for emergencies, are compromised and used during normal operations? How can we ensure they are only used during IAuth backend failures?

In response, we refined the hybrid approach to prevent static tokens from being used in normal operations while allowing them during IAuth backend failures. The key was implementing a heartbeat mechanism that detects backend availability and triggers the IAuth SDK to switch between static and dynamic tokens.

Although the heartbeat mechanism appears simple, it poses a significant challenge in distributed systems: time-sequence-introduced data inconsistency. In the IAuth ecosystems, different services or their Kubernetes pods may observe inconsistent heartbeat statuses during IAuth crash or recovery moments, resulting in unexpected behaviors due to misaligned health perceptions.

In our scenario, we encountered two critical moments of inconsistency that can cause severe issues:

Backend Crash: Service A detects the backend failure earlier and switches to static tokens, while Service B has not yet detected the failure and continues to reject static tokens. This leads to live issues because Service B should have accepted static tokens during the failure period.

Service B should have accepted static tokens during backend crash

Backend Recovery: Service B detects the recovery earlier and resumes rejecting static tokens, while Service A continues using them until it successfully retrieves a dynamic token, leading to severe live issues because Service B should wait until Service A detects the recovery.

Service B resumes to rejecting static tokens during backend recovery

These unexpected behaviours can severely impact overall system resiliency. A naive solution would be to have the IAuth SDK call the backend on every request to check the latest health status. However, this approach would significantly increase latency and place an unsustainable load on the IAuth system.

To mitigate the severe issues caused by data inconsistency, we designed a more solid mechanism based on the following steps:

Two Cached Flags: the IAuth SDK maintains two cached flags: ObservedAvailability and isBlockingStaticToken (also referred to as CurrentAvailability). ObservedAvailability is a byte that tracks the health status of the last eight probes. isBlockingStaticToken is a boolean indicating whether static tokens should be rejected.
Periodic Health Probe: Every 5 seconds, the SDK performs a health probe to the IAuth backend. Based on the result, it updates both flags accordingly.isBlockingStaticToken is set to true only when ObservedAvailability equals 0xFF, meaning all of the last eight probes reported healthy status.
Immediate Probe on Demand: When a static token is received and the current ObservedAvailability indicates healthy status, the SDK performs an immediate synchronized probe, even if the next scheduled probe hasn’t occurred yet. This ensures that a health state transition is not missed. As shown in the diagram below, the blue arrow represents this on-demand probe, which updates the cached status ahead of the scheduled probe at time t1.
Graceful Transition Period: The SDK will reject static tokens only if isBlockingStaticToken is true. This effectively provides a grace period of at least 35 seconds for services to transition back to dynamic tokens. The SDK actually attempts to refresh dynamic tokens every 30 seconds, which comfortably fits within the grace period. As shown in the diagram, the callee resumes rejecting static tokens only after all eight recent probe results indicate a healthy backend.

While the above mechanism significantly improves system resiliency, it is not without limitations.

One such edge case involves network partitions or targeted attacks, where Service A (the caller) cannot reach the IAuth backend, but Service B (the callee) can. In this scenario, the caller falls back to using static tokens due to the inability to obtain a dynamic token, while the callee, having observed a healthy backend, continues rejecting static tokens. This discrepancy leads to a live issue — a classic manifestation of partition tolerance failure.

Although such occurrences are rare, we’ve built safeguards to mitigate their impact. Specifically, IAuth supports a global configuration flag that controls whether static tokens should be blocked. In critical cases, this flag can be dynamically adjusted to temporarily allow all static tokens, ensuring service continuity during network partitions or backend isolation events.

This flexibility ensures that, even in the face of extreme conditions, we can prioritize availability and mitigate live issues without compromising long-term security guarantees.

The evolution of IAuth’s token management mechanism at JioHotstar reflects our commitment to S2S communications while balancing security, resiliency, and operational continuity. By transitioning from static tokens to dynamic tokens and finally to a sophisticated hybrid model, we’ve addressed critical vulnerabilities and ensured seamless, uninterrupted service interactions. The enhanced hybrid approach, with its dynamic fallback mechanisms and synchronized heartbeats, enables IAuth to protect our service ecosystem against potential attacks without compromising on performance.

Securing distributed systems requires constant innovation and a deep understanding of potential failure points. With IAuth, we’ve created a flexible and robust solution to safeguard interactions across a complex landscape of services. As the threats to S2S communications continue to evolve, so too must our defenses, and IAuth will remain at the forefront of that effort.

We’d love to hear your thoughts on our approach or discuss how your organisation addresses similar security challenges. What strategies have worked for you in protecting critical services from internal vulnerabilities?

Leave a Comment Cancel Reply