Video thumbnail

    How Netflix Uses Java - 2025 Edition

    This article delves into Netflix's extensive use of Java in its backend systems, highlighting how the company continually evolves its architecture and technology stack. It covers everything from the high-level application architecture for both streaming and enterprise services to specific details about JDK versions, garbage collectors, threading models, and application frameworks. The discussion addresses common misconceptions, such as Java being used for UI, and emphasizes Netflix's strategic move away from reactive programming towards virtual threads. Furthermore, it provides insights into how Netflix manages its vast number of microservices, crucial for supporting millions of users and ensuring low latency across multiple regions. The article also touches upon the company's internal tools and processes for maintaining a robust and efficient Java platform, including their efforts to upgrade legacy systems and standardize on modern technologies like Spring Boot and GraphQL.

    Java at Netflix's Backend

    Netflix predominantly uses Java for its backend services, despite common assumptions that it might be used across all tiers. The presentation clarified that user interfaces (UIs) are developed using languages best suited for specific devices, with no Java involvement unless it's related to Android, which has a Java heritage. The primary focus for Java development at Netflix is on high-performance backend systems that support millions of users and high requests per second (RPS).

    There are two main categories of applications at Netflix:

    Netflix Streaming Applications

    These applications are characterized by extremely high RPS due to the massive user base. They operate across four different Amazon Web Services (AWS) regions to ensure low latency for users worldwide. Cross-region communication is inherently expensive and slower, adding complexity to the backend design. A typical request from a device fan out to numerous microservices. For failure scenarios, these services often employ retries with aggressive timeouts to maintain low latency. If retries fail, it's often acceptable to return a response with missing data, as the overall user experience might not be significantly impacted. Unlike traditional systems, relational data stores are generally avoided in the streaming backend due to multi-region complexities and the nature of the data, favoring in-memory distributed data stores for caching.

    Enterprise Applications

    In addition to streaming, Netflix functions as a major film studio, necessitating numerous traditional enterprise applications for managing movie production, personnel, equipment, and stages. These apps typically exhibit very low RPS compared to streaming services and generally operate within a single AWS region. Data often fits well into a relational database model, and unlike streaming, data persistence is critical, meaning failures resulting in data loss are unacceptable. This leads to a different failure model where retries are not always a viable option, and data must reliably end up in a database.

    Architectural Overview

    Despite the differences in traffic patterns and failure tolerance, both Netflix's streaming and enterprise applications share a surprisingly similar architecture, centered around GraphQL.

    GraphQL Federation

    Incoming requests, whether from a TV for streaming or a laptop for an enterprise app, first hit an API Gateway. This gateway handles federated GraphQL queries, meaning that while users perceive a single GraphQL schema, it's actually implemented by many different backend services. The gateway uses a schema registry to determine which backend service is responsible for fetching specific data. These services are often referred to as Domain Graph Services (DGS), built using the DGS framework and Spring Boot, making them entirely Java-based.

    Inter-Service Communication

    From these initial services, there's often further fan out to other backend services. While GraphQL is preferred for device-to-backend communication due to its HTTP-based nature and flexibility, gRPC is frequently used for Java service-to-Java service communication. gRPC is a highly efficient binary protocol that allows services to interact as if calling a method on another service. Various data stores are utilized, including in-memory distributed data stores like EVCache, Kafka for streaming, and Cassandra, chosen based on specific use case requirements.

    Beyond Core Streaming and Enterprise Apps

    The architecture described primarily covers the "discovery" aspect of Netflix, where users browse titles. Once a user clicks play, other systems come into action, primarily managed by Open Connect. This system involves appliances (servers) housed at internet providers globally to deliver movie content directly to users with minimal latency. The management software for Open Connect is almost entirely Java-based. Additionally, media encoding pipelines and various stream processing systems within Netflix are also built using Java. While other languages like Go for low-level platform tasks and Python for machine learning are present, Java remains the dominant language for the vast majority of Netflix's backend.

    JDK Evolution at Netflix

    Netflix has made significant strides in upgrading its Java Development Kit (JDK) versions, moving from a previously outdated JDK 8 to modern versions like JDK 17, JDK 21, and even experimental use of JDK 23/24. This transition involved overcoming several challenges:

    Overcoming Legacy Hurdles

    Netflix was stuck on JDK 8 for an extended period due to an outdated in-house application framework and numerous old libraries incompatible with newer Java versions. To break this cycle, they embarked on a multi-pronged approach. Firstly, they systematically patched unmaintained libraries to ensure JDK compatibility, minimizing the effort required for service owners to upgrade. This strategic patching of a handful of critical libraries proved effective in unblocking the upgrade path without forcing extensive code changes.

    Secondly, a massive migration of all Java services to Spring Boot was initiated approximately two to three years ago. This involved significant effort and custom tooling, including automated code transformations, to facilitate the transition of over 3,000 applications. The successful migration means nearly all services now run on Spring Boot, with a handful of legacy exceptions maintained only for older device compatibility. Consequently, almost all services now run on JDK 17 or newer, with high-RPS services leveraging JDK 20/21/23 to utilize advanced garbage collectors.

    Garbage Collector Improvements

    Upgrading JDK versions yielded substantial performance benefits, particularly concerning garbage collection (GC). With JDK 17, the G1 garbage collector, already in use, saw significant improvements. Netflix observed approximately 20% less CPU time spent on garbage collection in high-RPS services, a substantial performance gain essentially "for free" just by upgrading the JDK.

    The introduction of a generational ZGC (Z Garbage Collector) in JDK 21 marked a pivotal improvement. Previously, ZGC, a low-pause-time GC, was not generational, making it inefficient for services with large, long-lived objects. The generational ZGC in JDK 21 transformed it into a highly effective general-purpose garbage collector for most workloads. Metrics showed a dramatic reduction in maximum GC pause times, from over a second (leading to timeouts and retries) to virtually zero with generational ZGC. This significantly reduced error rates on Inter-Process Communication (IPC) calls, improved service consistency, and allowed services to run at higher CPU loads, squeezing more performance out of existing machines. The switch simply involved a configuration change from G1 to ZGC.

    "When we moved to JDK 17 what we saw is that the G1 garbage collector just got a lot better so on Java 8 we were using G1... on 17 we were still using G1, it just got a lot better because that was a lot of Java releases where work had been done on the performance of the JVM mostly on the garbage collectors and what we saw is that we got about 20% less CPU time spent on garbage collection on on a lot of these high RPS services and that is just a lot of performance we get basically for free by just upgrading to the new JDK."

    "When we switch to ZGC... the graph just drops and it doesn't mean it stopped measuring it, it's measuring but it's running ZGC and you see there's just no pass times anymore so that's really impressive we went from like more than a second pass times to zero."

    The Promise of Virtual Threads

    Netflix is keenly interested in virtual threads, especially with the features in JDK 21 and beyond. The strategy involves integrating virtual thread support into their Spring Boot-based application framework and DGS framework so developers can benefit automatically without altering their coding style. This allows for parallel execution of potentially slow operations that previously ran serially on platform threads, leading to significant performance gains, especially for GraphQL resolvers. The "free" nature of virtual threads (low overhead) makes this parallel execution a default behavior, improving developer experience by eliminating the need for manual thread pool management or complex reactive programming constructs.

    A bold claim was made that virtual threads, combined with structured concurrency, are expected to completely replace reactive programming. This is significant given Netflix's historical role in developing RxJava, a foundational reactive programming library. While reactive programming offers concurrency benefits, it often introduces considerable code and debugging complexity. Netflix largely moved away from reactive programming due to this trade-off, finding it largely unfavorable. The integration of structured concurrency is anticipated to remove the last remaining needs for reactive paradigms, simplifying application development.

    However, the initial rollout of virtual threads in JDK 23 encountered unexpected deadlocks. This was traced back to external libraries using the `synchronized` keyword, which would "pin" a virtual thread to a platform thread. If all platform threads became pinned while waiting for a lock held by another virtual thread that couldn't run due to lack of available platform threads, a deadlock ensued. This led to services becoming unresponsive. Fortunately, JDK 24 (via JEP 491) addressed this by reimplementing how `synchronized` interacts with virtual threads, eliminating the pinning issue. With this fix, Netflix expects to resume the aggressive rollout of virtual threads, confident in their stability and performance benefits.

    Application Framework: Spring Boot Netflix

    At the core of Netflix's application development is "Spring Boot Netflix," which is essentially open-source Spring Boot augmented with a suite of internal modules. This custom distribution integrates Spring Boot seamlessly into Netflix's unique infrastructure and ecosystem, while maintaining the standard Spring Boot programming model. Developers interact with familiar concepts, annotations, and APIs, ensuring a low learning curve and consistency with broader Spring development practices.

    Key components and integrations added to Spring Boot Netflix include:

    • Security Integration: Seamless integration with Netflix's internal authentication and authorization systems, exposed through standard Spring Security annotations like `@Secured` and `@PreAuthorize`.
    • Service Mesh Integration: Integration with their service mesh, based on Proxied, for service discovery, TLS, and other network functionalities.
    • gRPC Client and Server Programming Model: An annotation-based programming model for easily implementing gRPC servers and clients, abstraction the complexities of gRPC for developers.
    • Observability: Integration for distributed logging, tracing, and metrics, primarily leveraging Micrometer, but utilizing Netflix's custom-built high-scale systems for data storage.
    • Fast Properties: A dynamic configuration system that allows changing most application configurations at runtime without requiring service restarts, critical for incident management and feature flag toggling.
    • IPC Clients: Extensions to standard clients like Spring's WebClient to incorporate Netflix-specific resiliency behaviors, such as retries and circuit breakers.

    Netflix's strong relationship with the Spring team is a significant factor in their commitment to Spring Boot. They actively collaborate, providing feedback and working together on new features, ensuring the framework continues to meet their evolving needs. Spring Boot's long-term reliability, continuous innovation in leveraging new Java features (like virtual threads), and its widespread adoption in the developer community (simplifying new hire onboarding) reinforce its position as Netflix's preferred application framework.

    Deployment and Startup Time

    Netflix deploys Spring Boot applications either directly on AWS instances or on Titus, their in-house container platform (which shares similarities with Kubernetes). Applications are deployed as exploded JAR files with embedded Tomcat. While they have experimented with GraalVM Native Image for faster startup times, it has not been adopted for widespread use due to implementation complexities and adverse effects on developer experience (e.g., increased build times during development). Instead, Netflix is betting on Project Leyden's Ahead-of-Time (AOT) compilation capabilities to improve startup performance in the future, again, collaborating with the Spring team on this initiative.

    Spring Boot 3 Upgrade and Jakarta EE

    The upgrade to Spring Boot 3, which baselines on JDK 17, presented a notable challenge due to the shift from `javax` to `jakarta` namespaces. While a simple find-and-replace for application code, this change significantly impacted shared libraries that might still be built against Spring Boot 2. To mitigate this, Netflix implemented a Gradle transform plugin that performs a bytecode rewrite at artifact resolution time, converting `javax` to `jakarta` namespaces on the fly for dependencies. This solution, open-sourced as part of the Nebula ecosystem, allowed them to safely bridge the gap between legacy and modern libraries without breaking existing applications during the transition.

    DGS Framework and GraphQL Adoption

    Netflix has been a strong proponent of GraphQL, leading to the creation and open-sourcing of the DGS (Domain Graph Service) framework in 2020. This framework, built on top of the lower-level GraphQL Java library, provides a Spring Boot integration with an annotation-based programming model for writing GraphQL resolvers. It also includes a robust testing framework that allows running GraphQL queries against services without needing to start a full web server, significantly speeding up development and testing cycles.

    Recognizing the growing importance of GraphQL, the Spring team also began developing their own GraphQL support. To avoid fragmented efforts, Netflix extensively collaborated with the Spring team to shape Spring for GraphQL. As a result, the DGS framework now integrates Spring for GraphQL components under the hood, allowing developers to seamlessly use features from both programming models.

    IPC Communication Strategy

    Netflix's Inter-Process Communication (IPC) strategy heavily favors GraphQL for UI-to-backend communication and gRPC for server-to-server communication, with a strong recommendation against using REST.

    Communication Type Recommended Protocol Reasoning
    UI to Backend GraphQL Offers a flexible API with schemas, allowing UI developers to query precisely the data they need. Facilitates collaboration by thinking in terms of data.
    Server to Server gRPC Extremely performant due to its binary protocol. Provides a strong schema (Protobuff) and allows services to model interactions as method calls, aligning with the mental model for inter-service communication.
    REST Not Recommended Lacks a flexible API and schema, often forcing clients to receive more data than necessary. While easier for quick data dumps, it leads to a poor developer experience for UI and backend collaboration. Adding a schema (e.g., OpenAPI) is an afterthought.

    "If you think about a UI talking to a back end you want to have a flexible schema or a flexible API basically that kind of works for all these different clients that you need to deal with and graphical gives you that that gives you that really flexible way of querying data and very importantly you have a schema so that is the way you collaborate between UI developers and backend developers... If you're talking about server to server communication you often want to think a little bit more about okay now I'm actually just calling a method it just happens to run on on another server that's kind of the the mental model that you're in and that is what GPC is really good at."

    This strategic approach to IPC reflects Netflix's continuous effort to optimize performance, maintain flexibility, and enhance developer experience across its vast and evolving microservices landscape.

    Takeaways

    1. Java's Dominance in Backend: Netflix primarily uses Java for its backend services, including high-RPS streaming applications and traditional enterprise systems, discrediting assumptions about Java in UI.
    2. Strategic JDK Upgrades: Moving from JDK 8 to JDK 17, 21, and beyond, Netflix achieved significant performance gains, particularly from improved garbage collectors like generational ZGC, which nearly eliminated GC pause times and reduced error rates.
    3. Virtual Threads Revolution: Netflix is heavily invested in virtual threads to simplify concurrent programming, aiming to replace reactive programming paradigms by providing automatic parallel execution with minimal overhead, although initial deployments required fixes in JDK 24 for synchronization issues.
    4. Spring Boot as Foundation: All new Java applications are built on "Spring Boot Netflix," an augmented version of open-source Spring Boot, providing deep integration with Netflix's internal infrastructure while maintaining a standard developer experience.
    5. GraphQL and gRPC for IPC: Netflix has standardized on GraphQL for UI-to-backend communication due to its flexible schemas and data-centric approach, and gRPC for high-performance, schema-driven server-to-server communication, entirely moving away from REST for modern applications.

    References

    This article was AI generated. It may contain errors and should be verified with the original source.
    VideoToWordsClarifyTube

    © 2025 ClarifyTube. All rights reserved.