
High-Performance Computing (HPC), AI, and Machine Learning (ML) are no longer reserved for those with massive on-premise data centers. The Cluster Toolkit (formerly known as the Cloud HPC Toolkit) is an open-source tool provided by Google Cloud designed to simplify the deployment of these complex environments. By utilizing YAML-based “blueprints,” users can provision turnkey, repeatable environments that adhere to Google Cloud’s best practices.
In this medium article, we’ll dive into the details of setting up the Cluster Toolkit and deploying your first cluster.
Core Components of the Toolkit
Before we jump into the setup, it’s important to understand the three primary components that make the Cluster Toolkit function:
- Cluster Blueprint: A YAML file where you define which modules to use and how to customize them for your specific workload.
- Modules: These are the fundamental building blocks of a deployment, consisting of Terraform or Packer configuration files.
- gcluster engine: The command-line tool that processes your blueprint, combines the necessary modules, and generates a deployment folder ready for execution.
Phase 1: Prerequisites and Environment Preparation
A successful deployment starts with a properly configured Google Cloud environment.
1. Project and Billing
Select or create a dedicated Google Cloud project and ensure that an active billing account is attached.
2. Enable Required APIs
You must enable several essential APIs within your project to allow the toolkit to manage resources:
- Compute Engine API
- Filestore API
- Cloud Storage API
- Service Usage API
- Resource Manager API
3. Manage Quotas
HPC workloads are resource-intensive. Verify that your project has sufficient quotas for the specific resources you intend to deploy, such as specialized CPUs (C2, N2, or H3), GPUs (like A100 or L4), and storage solutions like Filestore or Persistent Disks in your target region.
Phase 2: Setting Up the Execution Platform
The gcluster binary is flexible and can be run from multiple environments:
- Google Cloud Shell: This is the recommended starting point as it comes with the Google Cloud CLI pre-installed and inherits your user credentials automatically.
- Local Linux or macOS: You can download pre-compiled binaries for Linux (amd64/arm64) or macOS. Note that if you are developing on a Mac, you may need to install GNU tooling like coreutils via Homebrew.
Installation
To get started, clone the repository and install the toolkit:
git clone https://github.com/GoogleCloudPlatform/cluster-toolkit.git
cd cluster-toolkit
# Follow the specific instructions for your platform to install the prebuilt bundle [cite: 327]
Phase 3: Deploying Your Cluster
Deploying a cluster is a two-step process: creating the deployment folder and then applying it.
Step 1: Create the Deployment Folder
Choose a blueprint that fits your needs. The toolkit includes numerous examples, such as hpc-slurm.yaml for a standard Slurm cluster or ml-slurm.yaml for machine learning workloads.
Use the create command to expand the YAML blueprint into Terraform files:
./gcluster create examples/hpc-slurm.yaml --vars "project_id=<PROJECT_ID>,region=us-central1,zone=us-central1-a"
Note: Ensure project_id, zone, and region are correctly set under the vars section of your blueprint or passed via the CLI.
Step 2: Provision Resources
Navigate to the newly created deployment folder and use the deploy command:
./gcluster deploy <deployment_name>
You will be prompted to review and approve the Terraform execution plan before any physical resources are created on GCP.
Phase 4: Advanced Configuration (Optional)
For production environments, you may want to manage your Terraform state remotely rather than locally.
- Remote State: You can configure a Google Cloud Storage (GCS) bucket to store your state by adding a terraform_backend_defaults block to the top-level of your blueprint.
- Staging Files: The ghpc_stage function can be used within blueprints to copy local files or directories into the deployment directory, ensuring relative paths remain valid regardless of where the deployment is executed.
Management and Best Practices
- Modifying Clusters: The toolkit currently only supports creation and deletion. To change an active cluster’s hardware or software configuration, the recommended workflow is to delete the cluster, update the blueprint, and re-deploy.
- Cost Management: Always destroy your resources when they are no longer in use to avoid unnecessary costs:
./gcluster destroy <deployment_name> - Slurm Versions: Note that Slurm-GCP v6 is the only version supported within the Toolkit; v5 has reached its end-of-life.
By following these steps, you can harness the full power of Google Cloud’s infrastructure for your most demanding computational tasks, all while maintaining a repeatable and automated workflow.
Harnessing the Power of HPC: A Comprehensive Guide to Setting Up Cluster Toolkit on GCP was originally published in Google Cloud – Community on Medium, where people are continuing the conversation by highlighting and responding to this story.
Source Credit: https://medium.com/google-cloud/harnessing-the-power-of-hpc-a-comprehensive-guide-to-setting-up-cluster-toolkit-on-gcp-1add95f2cbdc?source=rss—-e52cf94d98af—4
