Scaling Infrastructure for Millions: Datacenter Abstraction (Part 2)

Unblock Hotstar in UAE

In our previous blog, we explored the infrastructure limitations that made it challenging to achieve the performance and scalability required at a massive scale. We shared how we overcame those hurdles one by one, setting the stage for the next chapter in our journey. 📈

In this article, we take a deep dive into one of our most transformative projects: Datacenter (DC) Abstraction. This initiative redefined availability and scalability, unlocking new possibilities and transforming our infrastructure into a leaner, more efficient platform.

Before delving into DC abstraction, let’s revisit our infrastructure landscape and the challenges that demanded a paradigm shift in our architecture.

The Infrastructure Landscape

As briefly mentioned in our previous blog, we managed our infrastructure on two self-managed Kubernetes clusters. These clusters were capable of handling spikes of up to ~25 million concurrent users on the platform.

We deployed over 800 services across these clusters, with high-priority services deployed on both clusters to ensure better scalability, availability, and reliability. Each service had its own AWS load balancer. Services were using NodePort-based Kubernetes services to expose specific ports on worker nodes, and those nodes were added as load balancer targets to distribute traffic.

The Client Flow: Traffic Routing

The client apps made calls to the external API Gateway (we used the CDNs as our external API Gateway), which ran security checks and executed a series of routing rules to forward the requests to the respective service-specific load balancer. The load balancer then forwarded the requests to a NodePort on any of the worker nodes, regardless of whether the targeted pods were running on that specific node. If the pod was not on the node that received the request, Kubernetes’ kube-proxy handled the routing to the correct node where the pod was running. Service-to-service calls also follow the same request paths as depicted in the diagram below.

When we asked ourselves this question, the answer was clear: No, not with the existing infrastructure 🛑. Here’s why:

Key Limitations

  • Service Port Exhaustion:
    Our services primarily used LoadBalancer-type services, which rely on NodePort services to expose the pods on specific node ports. However, NodePort services can only use ports within a fixed range (typically 30000–32767). Given our large number of services, we were hitting these port limits..
  • Hardware and Instance Type Limitations:
    Our clusters were running on Kubernetes v1.17, a version that is no longer supported, and ran on older hardware. This outdated setup impacted both scalability and performance. We couldn’t leverage newer-generation instance types like c6i, c5i, and gravitons in our KOPS clusters because the version lacked support for newer AWS instance families. It was critical for us to use new generation instance types to achieve the 50+ million concurrency scale at optimized performance and price point.
  • IP Exhaustion:
    Our legacy KOPS clusters were massive — each hosting over 800 microservices. With every service deployment consuming multiple IP addresses across Pods, Services, and LoadBalancers, we were approaching subnet and VPC CIDR exhaustion. This wasn’t just a theoretical limit; we were hitting real-world barriers where adding even a few more nodes or services required painful IP juggling or services cleanup, or downscaling.
  • Scaling Operational Overhead:
    Before every major live tournament, we had to coordinate across the entire organization to gather pre-warming requirements for each service’s Load Balancers. This process was high touch and involved running an organization-wide operationally challenging campaign.
  • Cost Consideration:
    At a high scale, cost management becomes critical. Using the traditional cluster autoscaler on our legacy clusters would have led to significant expenses, as it is slower in consolidating nodes, resulting in resource wastage. In contrast, more advanced scaling solutions like Karpenter offer faster and more efficient node management, optimizing costs. However, due to the limitations of our legacy clusters, which couldn’t support the latest Kubernetes controllers like Karpenter

Given these challenges, we realized that scaling our infrastructure to meet the demands of 50+ million concurrent users required more than just incremental improvements. Even if we were to naively scale by simply adding more clusters, we’d still hit hard limits — managing dozens of ALBs, bumping into NodePort and IP exhaustion issues, and multiplicative operational overhead. These problems don’t disappear with horizontal cluster sprawl. We needed a higher order abstraction; enter the “data center”.

A data center, in our architecture, is essentially a higher-level abstraction that logically groups multiple Kubernetes clusters within a region, allowing them to function as a single, unified compute unit. This multi-cluster abstraction enables us to treat the entire collection of clusters within a region as one cohesive unit for deployment and operations.

Each application team is assigned a single logical namespace per data center, which spans across the underlying clusters but is used to deploy services into just one of them. This means applications are no longer tied to specific clusters, nor do they need to be spread across multiple clusters with custom routing logic. Instead, services are deployed into one chosen cluster within the data center, and the data center abstraction — powered by platform-level routing and proxy layers — handles all the complexity behind the scenes.

This shift enables clear team ownership, reduces operational sprawl, simplifies failover and capacity management, and ensures that traffic routing remains seamless and centralized. Essentially, applications are now namespaced by the data center, not the cluster, allowing us to scale horizontally while presenting a unified and cohesive platform to engineers.

By centralizing traffic management, observability, and platform-level services (such as rate limiting, authentication, and service discovery) into a shared infrastructure layer, we ensure that applications remain decoupled from the specifics of any one cluster. This architecture provides several key benefits:

  • Simplified failover and disaster recovery, since services can shift to another cluster within the same data center without changing namespaces or endpoints.
  • Improved resource utilization and elasticity, because capacity can be distributed and optimized across multiple clusters behind the scenes.
  • Reduced operational overhead, as platform teams can manage infrastructure uniformly at the data center level rather than per cluster.

To make the data center abstraction truly viable, we introduced a central proxy layer 🛡️ — an internal API Gateway built on Envoy (you can think of it as a pointer in the OS memory management layer). Rather than pre-warming and managing 200+ discrete load balancers across services (as if managing each memory page), we scale a single fleet of internal gateways based on Envoy. These gateways (analogous to TLBs in paging) route requests to the right backend service and hide the complexity below.
This transition radically simplified things and eliminated risks. We don’t think about services any longer — the gateway layer efficiently routes traffic to the services on our behalf behind the scenes.

With this, teams no longer need to think about routing traffic across clusters or deploying services in multiple places for redundancy. The gateway, combined with centralized auth, rate limiting, and load shedding, provides a unified control layer that’s installed on every cluster.

Once the foundational architecture was in place, we made a deliberate choice to build our platform on top of cloud-managed EKS clusters. This wasn’t just about offloading cluster maintenance; it was about choosing a base that aligned with our core principles — scalability, resilience, and consistency. With EKS, we gain access to the latest Kubernetes features without the burden of manual upgrades or patching the control plane. It gives us the elasticity to rapidly provision new clusters, the confidence of built-in reliability from AWS, and the uniformity required to run multiple clusters as part of a single logical data center. EKS became the enabling substrate that allowed our multi-cluster platform to scale cleanly and operate predictably across regions.

  • Standardized Endpoints 🌍
    To simplify service discovery and communication across our infrastructure, we introduced a unified and standardized endpoint structure for all services across data centers. Previously, service owners often created multiple endpoints per service — spread across different domains — with no clear distinction between public and private interfaces. This led to confusion, redundant configurations, and an increased burden on both development and DevOps teams.
    With this initiative, we eliminated ambiguity by enforcing a consistent naming convention for service endpoints across environments. Each service is now exposed through three well-defined types of endpoints: intra-DC (used within the same data center), inter-DC (used for communication across data centers), and external (exposed publicly via the external API Gateway or CDN). These endpoints follow a predictable structure:
    Intra-DC: .internal.
    Intre-DC: .internal..
    External: .This standardization not only made traffic routing more intuitive for both clients and services but also enabled seamless service migrations between clusters without breaking connectivity.
  • One Manifest to Rule Them All 📜
    We significantly simplified the application deployment process by introducing a single Kubernetes manifest that works across all environments and data centers. Previously, engineers had to maintain 5–6 environment-specific manifest files, even though the majority of the configuration was identical. This not only introduced redundancy but also increased the chances of errors and slowed down the deployment process. With our new approach, teams define their services using a single base manifest and apply environment-specific overrides only when necessary — for example, to adjust CPU or memory thresholds.
    In addition to unifying the manifests, we abstracted away infrastructure-specific configurations such as endpoint definitions and ALB (Application Load Balancer) settings. These are now centrally managed by our internal infrastructure platform, removing the burden from individual engineers. This means engineers no longer have to concern themselves with low-level wiring and can instead focus on declaring what their service needs are, while the platform handles how it’s delivered. The result is faster, safer, and more consistent deployments, with dramatically reduced operational overhead.

Sample base manifest snippet:

app:
name:
team:
processes:
main:
image:
resources:
burst:
cpu: 2000m
memory: 4Gi
guaranteed:
cpu: 2000m
memory: 4Gi
healthCheck:
type: tcpSocket
port: 4001
initialDelaySeconds: 120
periodSeconds: 10
preStopCommand: /bin/sleep 90
metrics:
port: 4001
path: /metrics
ports:
- 4001
- 8080
ingress:
rules:
- path: /
port: 8080
- path: /
port: 8080
internetFacing: true
dnsPolicy:
deployment:
terminationGracePeriodSeconds:
maxUnavailable:
logging:
multiline: true
regex:
  • Eliminating NodePort Service Limitations with ClusterIP
    We transitioned from NodePort-based services to ClusterIP-based services, taking advantage of the latest Application Load Balancer (ALB) controller. With this update, we can directly connect to Pod IPs, eliminating the need for NodePort-based services. This helped us to address the NodePort service limit issue, where the limited range of ports (30000–32767) constrained the scalability of our services. By adopting ClusterIP, we unlocked more flexibility and scalability while simplifying the overall traffic flow.

With these changes, we solved the limitations of our legacy clusters, paving the way for a future-proof platform that is scalable, secure, and easy to manage! 🎉

Transitioning to a multi-cluster architecture wasn’t as simple as just creating new clusters and deploying applications. We needed to ensure that these clusters had enough compute and network capacity to handle traffic at the massive scale we were targeting. But here’s the challenge — services are not one-size-fits-all. Some are compute-intensive, others are memory-hungry, and some need high network throughput. So, how do we decide which services go where❓

We took a step back and worked closely with teams across the organization to collect critical data about each service. We asked questions like: What compute resources are required? How much network bandwidth and throughput will these services need at a scale of 50+ million concurrent users? This data-driven approach allowed us to make informed decisions about how to allocate services to clusters.

We avoided overloading any cluster by ensuring multiple compute- or network-heavy services weren’t deployed within the same cluster. For example, high-priority services like Ads and Personalisation couldn’t coexist in the same cluster since both require significant compute and network resources during peak traffic. We limited each cluster to one P0 stack to ensure peak performance and resource isolation. 🎯

The outcome? We deployed six EKS clusters in production, carefully balancing resources and scalability — thanks to close collaboration across teams that made this transition smooth and customer-safe. We believe this setup gives us enough headroom to scale and meet the challenges of tomorrow. 💪

Key Outcomes 👏

  • Successfully migrated over 200 services from legacy clusters to our new data center (DC) clusters, overcoming key infrastructure challenges and positioning us to meet our ambitious targets.
  • Standardized service endpoints across all data centers, streamlining traffic routing and management. The abstraction enables the Cloud Engineering Team to switch a service’s proxy or cluster without developer intervention.
  • Replaced individual service load balancers with a centralized API gateway, simplifying operations and improving performance.
  • Faster failure recovery: In case of cluster issues, apps can be moved to new clusters; no other changes are required, as the endpoint remains the same.
  • We successfully scaled our infrastructure to handle over 61 million concurrent users 🚀, and we’re just getting started — our infrastructure is built to go even further beyond this milestone. 🌟

In the next set of articles, we will dive deep into our internal platform, which we built to scale our systems to handle the massive scale. Stay tuned!

Leave a Comment