Managed Service for Kafka benchmarking and scaling guidance

Important considerations

To achieve optimal consumer throughput, the fetch-size property is crucial for tuning. The default fetch-size configuration is largely determined by your consumption and throughput needs, and can range from up to 1MB for smaller messages to 1-50MB for larger ones. It’s advisable to analyze the effects of different fetch sizes on both application responsiveness and throughput. By carefully documenting these tests and examining the resulting information, you can pinpoint performance limitations and refine your settings accordingly.

How to benchmark throughput and latencies

Benchmarking the producer

When conducting tests to measure the throughput and latencies of Kafka producers, the key parameters are batch.size, or the maximum size of a batch of messages, and linger.ms, the maximum time to wait for a batch to fill before sending. For the purposes of this benchmark, we suggest keeping acks at 1 (acknowledgment from the leader broker) to balance durability and performance. This helps us to estimate the expected throughput and latencies for a producer. Note that message size is kept constant as 1KB.

Throughput (messages/s)	Throughput (MBs)	Latency(ms)	ack(=1)	batch_size	linger_ms
48049	45	608	Leader	1KB	10
160694	153	171	Leader	10KB	10
117187	111	268	Leader	100KB	10
111524	106	283	Leader	100KB	100

Analysis and findings

The impact of batch size: As expected, increasing batch size generally leads to higher throughput (messages/s and MBs). We see a significant jump in throughput as we move from 1KB to 10KB batch sizes. However, further increasing the batch size to 100KB does not show a significant improvement in throughput. This suggests that an optimal batch size exists beyond which further increases may not yield substantial throughput gains.
Impact of linger time: Increasing the linger time from 10ms to 100ms with a 100KB batch size slightly reduced throughput (from 117,187 to 111,524 messages/s). This indicates that, in this scenario, a longer linger time might not be much beneficial for maximizing throughput.
Latency considerations: Latency tends to increase with larger batch sizes. This is because messages wait longer to be included in a larger batch before being sent. This is clearly visible when batch_size is increased from 10KB to 100KB.

Together, these findings highlight the importance of careful tuning when configuring Kafka producers. Finding the optimal balance between batch.size and linger.ms is crucial for achieving desired throughput and latency goals.

Benchmarking for consumer

To assess consumer performance, we conducted a series of experiments using kafka-consumer-perf-test, systematically varying the fetch size.

Throughput(messages/sec)	Throughput(MBs)	fetch-size
2825	2.6951	10KB
3645	3.477	100KB
18086	17.8	1MB
49048	46	10MB
61334	58	100MB
62562	60	500MB

Analysis and findings

Impact of fetch size on throughput: The results clearly demonstrate a strong correlation between fetch.size and consumer throughput. As we increase the fetch size, both message throughput (messages/s) and data throughput (MBs) improve significantly. This is because larger fetch sizes allow the consumer to retrieve more messages in a single request, reducing the overhead of frequent requests and improving data transfer efficiency.
Diminishing returns: While increasing fetch.size generally improves throughput, we observe diminishing returns as we move beyond 100MB. The difference in throughput between 100MB and 500MB is not significant, suggesting that there’s a point where further increasing the fetch size provides minimal additional benefit.

Scaling the Google Managed Service for Apache Kafka

Based on some more experiments, we explored optimal configurations for the managed Kafka cluster. Please note that for this exercise, we kept message size as 1KB, batch size as 10KB, the topic has 1000 partitions, and the replication number is 3. The results were as follows.

Producer threads	cluster_bytes_in_count (MBs)	CPU Util	Memory Util	vCPU	Memory
1	56	98%	58%	3	12gb
1	61	24%	41%	12	48gb
2	104	56%	57%	12	48gb
4	199	64%	60%	12	48gb

Scaling your managed Kafka cluster effectively is crucial to ensure optimal performance as your requirements grow. To determine the right cluster configuration, we conducted experiments with varying numbers of producer threads, vCPUs, and memory. Our findings indicate that vertical scaling, by increasing vCPUs and memory from 3 vCPUs/12GB to 12 vCPUs/48GB, significantly improved resource utilization. With two producer threads, the cluster’s byte_in_count metric doubled and CPU utilization increased to 56% from 24%. Your throughput requirements play a vital role. With 12 vCPUs/48GB, moving from 2 to 4 producer threads nearly doubled the cluster’s bytes_in_count. You also need to monitor resource utilization to avoid bottlenecks, as increasing throughput can increase CPU and memory utilization. Ultimately, optimizing managed Kafka service performance requires a careful balance between vertical scaling of the cluster and your throughput requirements, tailored to your specific workload and resource constraints.

Build the Kafka cluster you need

In conclusion, optimizing your Google Cloud Managed Service for Apache Kafka deployment involves a thorough understanding of producer and consumer behavior, careful benchmarking, and strategic scaling. By actively monitoring resource utilization and adjusting your configurations based on your specific workload demands, you can ensure your managed Kafka clusters deliver the high throughput and low latency required for your real-time data streaming applications.

Interested in diving deeper? Explore the resources and documentations linked below:

Source Credit: https://cloud.google.com/blog/products/data-analytics/managed-service-for-kafka-benchmarking-and-scaling-guidance/