Building scalable ZK Infrastructure on Google Cloud | by Deep Chenna | Google Cloud - Community

12 min read

17 hours ago

Zero-knowledge proofs (ZKPs) are a revolutionary cryptographic tool that can reshape how we approach privacy and scalability in Web3. ZKPs enable one party, the prover, to prove the truth of a statement to another party, the verifier, without disclosing any details beyond the statement’s validity. This capability has significant consequences for blockchain applications, particularly in these areas:

Scalability: ZKPs can generate succinct proofs for complex computations, enabling verification without re-executing the entire process, which is crucial for scaling blockchain networks.
Privacy: ZKPs can verify transactions without revealing the underlying data, thereby enhancing user privacy and confidentiality.
Identity: ZKPs can be used to prove identity or ownership of assets without disclosing sensitive information.

The demand for efficient and scalable ZK infrastructure is rising rapidly with the growth of the Web3 ecosystem. Google Cloud Platform (GCP) is dedicated to providing developers with the tools and resources needed to build and deploy ZK-powered applications capable of reaching a global audience.

Google has actively participated in the Web3 space through various investments and partnerships. Recognizing the transformative potential of ZK technology, Google has established a dedicated Web3 team focused on advancing ZK development. Google’s ZK Summit and ongoing research demonstrate this commitment.

Google Zero Knowledge Summit 2024

Google Cloud Platform (GCP) provides a wide range of Web3 products that cater to the needs of developers and businesses in the blockchain space. This includes fundamental blockchain infrastructure such as RPC nodes, which allow users to interact with blockchain networks. Additionally, GCP offers a Web3 portal that serves as a gateway to Web3 resources and tools. To further support Web3 development, GCP provides blockchain analytics tools that enable users to track and analyze blockchain data.

By leveraging GCP’s secure, reliable, and scalable infrastructure, leading Web3 companies can focus on building and deploying their applications with confidence. GCP’s infrastructure ensures that Web3 applications can handle high levels of traffic and data while maintaining optimal performance and security. This allows Web3 companies to deliver a seamless user experience and meet the demands of the rapidly growing Web3 ecosystem.

Optimizing Zero-Knowledge (ZK) infrastructure is crucial due to the rapidly increasing demand for efficient and scalable solutions in the Web3 ecosystem. As ZK proofs handle complex computations, the infrastructure must be fine-tuned to manage the computational intensity effectively. Optimization ensures faster proof generation, reduced resource utilization, and lower operational costs, all of which are vital for scaling ZK-powered applications and meeting the demands of a global audience. Efficient optimization also supports privacy, identity, and scalability, key areas where ZK proofs play a transformative role.

Building and scaling ZK infrastructure requires careful optimization across different layers:

Hardware Layer: Leveraging specialized hardware accelerators, including tailored machine configurations, GPUs, and TPUs, can considerably accelerate the computationally demanding process of proof generation.

Prover Layer: Efficient prover algorithms and libraries are crucial for optimizing proof generation time and resource utilization.

Orchestration Layer: An intelligent orchestration layer is essential for managing and coordinating the proof generation process across multiple provers and hardware resources.

This blog post will delve into the optimization strategies at Orchestration Layer for Zero-Knowledge Rollup (ZK-Rollup) infrastructure hosted on the Google Cloud Platform (GCP). We will explore the various services and tools offered by GCP that can enhance the performance and scalability of ZK-Rollup solutions.

An intelligent orchestrator plays a vital role in optimizing ZK infrastructure. It can:

Right-size Resources: Dynamically allocate resources based on the complexity and size of the ZK proofs being generated. This ensures that the infrastructure is used efficiently and costs are minimised.
Select the Right Worker: Choose the most suitable prover and hardware for a given task, optimizing for performance and cost-efficiency. This selection process considers factors such as the proof system used, the desired performance level, and the cost of different hardware options.
Handle Exceptions: Implement robust exception handling mechanisms to ensure reliable operation even in the face of errors or failures. This includes retrying failed tasks, using alternative proof systems or hardware, and alerting administrators to potential issues.
Prioritize Tasks: An intelligent orchestrator can prioritize tasks based on factors such as the importance of the transaction, the gas fees offered, or the user’s service level agreement. This ensures that the most critical tasks are completed first and that users receive the best possible experience.
Monitor Performance: Continuously monitor the performance of the ZK infrastructure, collecting metrics on proof generation time, resource utilization, and error rates. This data can identify bottlenecks, optimize resource allocation, and improve the overall efficiency of the system.
Learn and Adapt: An intelligent orchestrator can use machine learning algorithms to learn from past experiences and adapt its behavior to changing conditions. This could involve automatically adjusting resource allocation based on historical usage patterns or predicting future demand to ensure sufficient capacity.

By intelligently managing and optimizing the proof generation process, the orchestrator can significantly improve the performance and cost-effectiveness of ZK infrastructure.

Here are few Optimization strategies one can explore for best cost to performance ratio

Blended Infra — A workload mix of Low cost and high performance machines

By strategically allocating tasks to different types of machines based on their computational requirements, blended infrastructure can optimize the cost-to-performance ratio. This approach ensures that resources are used efficiently, maximizing performance where needed and minimizing costs where possible, leading to a more cost-effective and scalable Prover infrastructure.

High-Performance Machines: Certain stages of the proving process / certain type of L2 transaction are computationally intensive and require substantial resources. High-performance machines can expedite these stages, ensuring that proofs are generated quickly and efficiently. This speed is essential for maintaining the throughput of the zk-rollup and minimizing latency.
Low-Cost Machines: Other stages of the proving process / some L2 transactions may be less demanding. Utilizing low-cost machines for these stages can significantly reduce operational costs without compromising the overall performance of the system.

Leveraging GKE and horizontal scaling & parallel computing

GKE provides a managed Kubernetes environment that simplifies deployment and scaling, and combining it with horizontal scaling and parallel computing techniques is key to optimizing your prover performance.

Benefits of GKE:

Managed Control Plane: GKE handles the complexities of managing the Kubernetes control plane, allowing you to focus on your application.
Scalability: GKE makes it easy to scale your cluster up or down based on demand.
Integration with GCP Services: GKE integrates seamlessly with other Google Cloud Platform (GCP) services, such as Cloud Storage, Cloud SQL, and Cloud Monitoring.
Cost Optimization: You can choose node pools with different machine types to match the resource requirements of your provers and optimize costs. Preemptible VMs can also be considered for cost savings (with the caveat that they can be terminated with short notice).

Horizontal Scaling for Provers

Horizontal scaling involves increasing the number of prover pods to handle more workload. This is especially important for zk-Rollups, as the proving process can be parallelized. Here’s how you can achieve horizontal scaling in GKE:

Horizontal Pod Autoscaler (HPA): The HPA automatically scales the number of prover pods based on resource utilization metrics, such as CPU utilization or memory usage. You can configure the HPA to target specific resource utilization levels. This is a good option if your workload fluctuates.
Manual Scaling: You can manually scale the number of prover pods using the kubectl scale command. This is useful if you have predictable workload patterns.
Batch Processing with Jobs: If you have a large number of proving tasks, you can use Kubernetes Jobs to manage them. Each Job can create multiple pods to process a subset of the tasks. This is a good approach for batch processing scenarios. Consider using a Job controller to manage the parallel execution of these Jobs.

Parallel Computing Techniques:

Task Distribution: Divide the overall proving task into smaller sub-tasks that can be processed independently. This allows you to distribute the workload across multiple prover pods.
Data Partitioning: If your proving process involves large datasets, partition the data so that each prover pod works on a subset of the data. This can significantly reduce processing time.
Work Queues: Use a work queue (e.g., Redis, RabbitMQ) to distribute proving tasks to the available prover pods. This allows you to dynamically balance the workload and ensure that all pods are utilized efficiently.

Combining GKE, Horizontal Scaling, and Parallel Computing:

Containerize Your Prover: Package your prover application and its dependencies into a Docker container.
Create a Kubernetes Deployment: Define a Kubernetes Deployment to manage the desired number of prover pods. Specify the container image, resource requests and limits, and any other necessary configurations.
Configure Horizontal Scaling: Use the HPA to automatically scale the number of prover pods based on resource utilization. Alternatively, use manual scaling or Kubernetes Jobs for batch processing.
Implement Parallel Computing: Distribute the proving tasks across the prover pods using task distribution, data partitioning, or a work queue.
Monitor and Optimize: Monitor the resource utilization of your prover pods and adjust the scaling parameters and parallel computing strategy as needed. Use GKE’s monitoring tools or integrate with Prometheus and Grafana.

Key Considerations for GKE:

Node Pool Configuration: Choose the appropriate machine types for your node pools based on the resource requirements of your provers. Consider using GPUs or FPGAs if necessary.
Preemptible VMs: For cost savings, consider using preemptible VMs for your prover pods. However, be aware that these VMs can be terminated with short notice, so you need to design your application to be resilient to interruptions.
Resource Quotas and Limits: Set resource quotas and limits to prevent any single pod from consuming excessive resources and impacting other pods.
Network Configuration: Ensure that your network configuration is optimized for the communication patterns of your provers. Consider using a flat network topology if low latency is critical.

zk-Rollup provers are resource-intensive. They often require significant CPU, memory, and sometimes specialized hardware like GPUs or FPGAs. Efficiently packing these workloads onto your Kubernetes cluster is crucial for minimizing costs and maximizing throughput. “Bin packing” in this context means strategically scheduling your provers onto nodes to minimize resource fragmentation and maximize utilization.

Key Considerations for zk-Rollup Prover Bin Packing:

Resource Requirements: This is the most crucial factor. You need a very precise understanding of the resource demands of your provers. Consider:

CPU: How many cores? What architecture (e.g., x86, ARM)? Are there specific CPU features required (e.g., AVX-512)?
Memory: How much RAM? Are there any specific memory requirements (e.g., huge pages)? Are there memory leaks?
GPU/FPGA (if applicable): Which models? How much memory on the devices? Are there driver dependencies?
Disk I/O: Is the prover I/O intensive? If so, what type of storage is required (e.g., fast NVMe drives)?
Network: How much network bandwidth does each prover require? Is low latency critical?

Prover Characteristics:

Runtime: How long do the provers run? Are they short-lived or long-running?
Interdependencies: Do provers need to communicate with each other?
Priority: Are some provers more important than others?

Kubernetes Features:

Node Labels and Taints/Tolerations: Use these to target specific hardware or node types. For example, label nodes with the type of GPU they have and then use tolerances in your prover pods to ensure they are scheduled on the correct nodes.
Resource Requests and Limits: Accurately define resource requests and limits for your prover pods. This is essential for the Kubernetes scheduler to make informed decisions. Over-requesting can lead to underutilization, while under-requesting can lead to resource contention and instability.
Pod Affinity and Anti-Affinity: Use these to control where pods are scheduled relative to each other. You might want to avoid scheduling too many provers on the same node (anti-affinity) to prevent a single node failure from impacting too many provers. Or, if certain provers need to communicate frequently, you might use affinity to keep them close together.
Topology Spread Constraints: This is useful for distributing pods evenly across failure domains (like availability zones or racks). This improves resilience.
Horizontal Pod Autoscaler (HPA): If your prover workloads are dynamic, the HPA can automatically scale the number of prover pods based on resource utilization.
Descheduler: The descheduler can evict pods from nodes to improve resource utilization. This can be helpful if you have long-running provers and node resources become fragmented.

Monitoring and Analysis: You need good monitoring to track resource utilization and identify bottlenecks. Prometheus and Grafana are excellent tools for this.

Lets look at few key differences that will make Google Cloud stand out and better choice for ZK Infra

Eliminate VM Squeezing for ZKP Services: Custom VMs allow for precise allocation of resources, ensuring that ZKP (Zero-Knowledge Proof) services have the exact amount of CPU, memory, and storage they require. This eliminates the need to squeeze ZKP processes into undersized VMs, which can lead to performance bottlenecks and stability issues.
Prevent Overprovisioning: With custom VMs, you can avoid allocating excessive resources to ZKP services. This prevents overprovisioning, which can result in wasted cloud spend and inefficient resource utilization.
Rightsizing for Specific Requirements: Custom VMs can be tailored to the specific requirements of your ZKP workloads. This ensures that each ZKP service has the optimal amount of resources, maximizing performance and cost-efficiency.

Google Kubernetes Engine (GKE) provides a managed environment for deploying, managing, and scaling containerized applications using Google infrastructure. GKE offers a 3x increase in scalability compared to its competitors and is rigorously tested on Open-Source Software (OSS) Kubernetes. This ensures seamless integration and optimal performance for all zkEVM (zero-knowledge Ethereum Virtual Machine) services deployed on the GKE platform.

To further simplify cluster management and optimize costs, consider leveraging GKE Autopilot. Autopilot is a mode of operation in GKE where Google manages the underlying infrastructure, including nodes, node pools, and their configuration. You simply deploy and manage your workloads.

Benefits of GKE Autopilot relevant to ZK Infrastructure:

Simplified Operations: Autopilot reduces the operational burden of managing the Kubernetes infrastructure, allowing your team to focus more on developing and optimizing your ZK applications. Google handles node provisioning, scaling, upgrades, and security patching automatically.
Optimized for Cost and Performance: Autopilot automatically right-sizes nodes for your workloads based on their resource requests. This helps to ensure efficient resource utilization and can lead to cost savings by avoiding over-provisioning. It intelligently scales resources up or down as your ZK infrastructure demands change.
Enhanced Security: Autopilot enforces security best practices and automatically applies security updates to the underlying infrastructure, contributing to a more secure environment for your sensitive ZK workloads.
Node Auto-Repair and Auto-Upgrade: Autopilot automatically detects and repairs unhealthy nodes and seamlessly upgrades nodes to the latest Kubernetes version, minimizing downtime and maintenance windows for your ZK infrastructure.
Workload Isolation: Autopilot provides strong workload isolation by default, ensuring that different ZK components or even different tenants running on the same cluster are well-protected from each other.

Considerations for using Autopilot with ZK Infrastructure:

Customization Limitations: While Autopilot offers significant benefits, it provides less fine-grained control over the underlying infrastructure compared to standard mode. If your ZK provers have very specific hardware requirements or need highly customized node configurations, standard mode might be more suitable. However, Autopilot does support selecting different compute classes optimized for different workload types (e.g., compute-optimized, memory-optimized).
Cost Model: Autopilot has a different pricing model than standard GKE. You are charged per pod CPU, memory, and persistent storage consumed, rather than per node. For ZK workloads with predictable and consistently high resource utilization, standard mode might be more cost-effective. It’s recommended to evaluate the cost implications based on your specific workload patterns.

By choosing between standard GKE with careful node pool configuration and optimization techniques or GKE Autopilot for a more hands-off and automated approach, you can tailor your Kubernetes environment on Google Cloud to best suit the specific needs and operational preferences of your scalable ZK infrastructure.

AlloyDB provides significant performance enhancements compared to Aurora, another database solution. AlloyDB delivers twice the throughput of Aurora, leading to faster and more efficient processing of transactions and data. Additionally, AlloyDB supports 20 read pools, which is five more than Aurora. This increased number of read pools allows for greater scalability and improved performance when handling read-intensive workloads. To optimize database performance for zkEVM, AlloyDB is used to store the Hash, Pool, and State databases, ensuring efficient data management and retrieval for these critical components of the zkEVM infrastructure.

ZK technology holds tremendous potential for the future of Web3. At Google Cloud we are committed to providing developers with the necessary infrastructure and tools to build and scale ZK-powered applications. By focusing on optimization across different layers and leveraging intelligent orchestration, we can unlock the full potential of ZKPs and drive the next wave of Web3 innovation.

Source Credit: https://medium.com/google-cloud/building-scalable-zk-infrastructure-on-google-cloud-8b3dd478610b?source=rss—-e52cf94d98af—4