Navigating the Dataflow Runner Dilemma: Why Java and Kafka Often Favor Runner V1

Modernizing a data pipeline isn’t always a linear path to the latest version. While Cloud Dataflow Runner v2 is the modern standard, certain high-performance Java streaming workloads, specifically those utilizing KafkaIO, encounter unique architectural hurdles. For pure Java streaming, Runner v1 is often still the preferred, battle-hardened choice for strict Service Level Agreements (SLAs).

The Performance Wall: V1 vs. V2 Architecture

Navigating the Dataflow Runner Dilemma: Why Java and Kafka Often Favor Runner V1

The transition from Runner V1 to V2 is a shift from a monolithic process to a decoupled, portable architecture. While V2 offers many benefits, the “portability tax” can be heavy for latency-sensitive applications especially for Java and Kafka workloads.

The KafkaIO and Java Performance Gap

For pipelines requiring sub-5-second latency, migrating to Runner V2 under a modern Apache Beam SDK release can sometimes trigger an unexpected performance regression

For example: In a customer pipeline ingesting high-throughput data from Kafka to AlloyDB, the team upgraded from Apache Beam 2.62 to 2.68 and switched from Runner v1 to v2.

Latency Spike: In the production environment, event processing time jumped from less than 5 secs to 2.5 minutes.
Technical Root Cause: The regression was attributed to the overhead inherent in Runner V2’s Portable architecture when handling KafkaIO sources. Unlike the native execution of Runner V1, the portable layer introduced latency that violated the SLA.
Solution: Reverting back to Runner V1 while retaining the upgraded Beam SDK version allowed the pipeline to immediately meet its SLA and keep its code modernised.

What is a Splittable DoFn?

A Splittable DoFn (SDF) is a specialised `DoFn` that enables non-monolithic element processing. Unlike a standard `DoFn` which must process an entire element in a single invocation, an SDF allows the processing of a single element to be parallelised, dynamically split, or checkpointed.

This is achieved by pairing each input element with a “restriction” (defining the subset of work to be performed, such as an offset range in a Kafka partition). The SDF framework utilizes a `RestrictionTracker` (such as `OffsetRangeTracker`) to thread-safely coordinate dynamic adjustments to these restrictions. This allows the runner to split the remaining work for parallelization or to checkpoint work to free up worker threads.

The Runner v2 Paradox: Built for SDF, but at a Cost

Under Runner v2, modern I/O connectors (like KafkaIO) are completely abstracted into SDFs. While this introduces incredible features like dynamic work-stealing and cleaner checkpointing, it forces every single data claim through the decoupled SDK Harness layer via the Beam Fn API. For high-throughput Java pipelines, routing data through this API introduces a severe “portability tax” via continuous serialization, deserialisation, and IPC (Inter-Process Communication) overhead.

By deliberately choosing Runner v1, we bypass this default SDF wrapping layer. Runner v1 allows the pipeline to utilize the legacy UnboundedSource framework natively inside the worker’s monolithic memory space, entirely avoiding the constant Fn API handshakes required by v2’s portable implementation.

Additional Flags Used for Runner V1 Optimization

To squeeze every millisecond of performance out of Runner v1’s native execution, fine-tuning pipeline options can be applied.

unboundedReaderMaxReadTimeMs=500

By default, the legacy reader may wait up to 10 seconds to fill and commit a bundle. By capping this maximum wait time to 500ms via command-line arguments, smaller and tighter bundles are forced to commit downstream much faster.

Conclusion: Choosing Your Path

While Runner V2 is essential for multi-language pipelines and advanced features like Dataflow Prime, Runner V1 remains a powerful, low-latency alternative for pure Java streaming jobs. For Kafka-heavy workloads where every millisecond counts, the combination of Runner V1 and the legacy UnboundedSource architecture provides a validated path to stability and performance.

Big gratitude to collaborators and reviewers: Sri Harshini Donthineni (Cloud Data & AI Consultant) and Abdulsalam Abdullateef (Cloud Data & AI and Agentic Cloud Consultant).

Navigating the Dataflow Runner Dilemma: Why Java and Kafka Often Favor Runner V1 was originally published in Google Cloud – Community on Medium, where people are continuing the conversation by highlighting and responding to this story.

Source Credit: https://medium.com/google-cloud/navigating-the-dataflow-runner-dilemma-why-java-and-kafka-often-favor-runner-v1-60cce4c61662?source=rss—-e52cf94d98af—4