We’ve all been there. You’ve meticulously prepared your dataset and written your training script. You hit run, and your excitement builds, only to be crushed by the infamous error: CUDA out of memory.
This is one of the most common roadblocks in AI development. Your GPU’s High Bandwidth Memory (HBM), is the high-speed memory that holds everything that’s needed for computation, and running out of it is a hard stop. But how do you know how much you need?
To build a clear foundation, we’ll start by breaking down the HBM consumers on a single GPU and we’ll present key strategies to reduce HBM consumption on a single GPU. Later, we’ll explore advanced multi-GPU strategies like data and model parallelism that can help relieve memory pressure and scale your training in the cloud.
Understanding HBM: What’s using all the memory?
When you fine-tune a model, your HBM is primarily consumed by three things:
-
Model Weights: This is the most straightforward. It’s the storage space required for the model’s parameters—the “brain” that it uses to make predictions. A 7-billion parameter model loaded in 16-bit precision will take up roughly 14 GB before you even process a single piece of data.
-
Optimizer States and Gradients: This is the overhead that’s required for learning. To update the model’s weights, the training process needs to calculate gradients (the direction of learning) and the optimizer (like the popular AdamW) needs to store its own data to guide the training. In full fine-tuning, this can be the largest consumer of HBM.
-
Activations and Batch Data: This is the most dynamic part. When your data (images, text, etc.) flows through the model’s layers, the intermediate calculations, or activations, are stored in HBM. The memory needed here is directly proportional to your batch size. A larger batch size means more activations are stored simultaneously, which leads to faster training but much higher memory usage.
Note: These calculations are theoretical minimums. Real-world frameworks add up to 30% overhead due to temporary buffers, kernel launches, and memory fragmentation.
Although it’s impossible to get a perfect number without experimentation, you can estimate your HBM needs with this general formula:
Total HBM ≈ (Model Size) + (Optimizer States) + (Gradients) + (Activations)
Further reading: See this excellent JAX e-book that covers these topics in great detail and even has some “try it out yourself” test questions.
Example: Why full fine-tuning is so demanding
To see why running out of memory is such a common problem, let’s walk through a real-world example that I recently worked on: fine-tuning the medgemma-4b-it model, which has 4 billion parameters. Our script loads it in bfloat16 precision (2 bytes per parameter).
First, let’s calculate the static HBM footprint. This is the memory that’s required just to load the model and prepare it for training, before you’ve even processed a single piece of data.
-
Model Size: The memory that’s needed to simply hold the model on the GPU.
4 billion parameters × 2 bytes/parameter = 8 GB
-
Gradients and Optimizer States: The overhead for training every parameter with the AdamW optimizer.
Gradients: 4 billion parameters × 2 bytes/parameter = 8 GB
Optimizer States (AdamW): 2 × 4 billion parameters × 2 bytes/parameter = 16 GB
Note: While AdamW is a popular optimizer, other optimizers, such as Adafactor and Lion, have different memory footprints.
Adding these together gives us our baseline HBM cost for a full fine-tuning attempt:
8 GB (Model) + 8 GB (Gradients) + 16 GB (Optimizer) = 32 GB
This 32 GB is the baseline just to start the training process. On top of this, the GPU needs additional memory for activations, which is a dynamic cost that grows with your batch size and input data size. This is why full fine-tuning of large models is so demanding and often reserved for the most powerful hardware.
Key strategies to reduce HBM consumption
The HBM requirement for a full fine-tune can seem impossibly high. But several powerful techniques can reduce memory consumption, making it feasible to train large models on consumer-grade or entry-level professional GPUs.
Parameter-Efficient Fine-Tuning (PEFT) with LoRA
Instead of training all the billions of parameters in a model, Parameter-Efficient Fine-Tuning (PEFT) methods focus on training only a small subset of parameters. The most popular of these is LoRA (Low-Rank Adaptation).
LoRA works by freezing the original model’s weights and injecting a tiny number of new, trainable adapter layers into the model architecture. This means the memory-hungry gradients and optimizer states are only needed for these few million new parameters, not the full 4 billion.
The math behind LoRA’s memory savings
LoRA doesn’t remove the base model from your GPU. The full 8 GB of the original model’s weights are still loaded and taking up HBM. They’re just frozen, which means that the GPU isn’t training them. All of the memory savings come from the fact that you no longer need to store the huge gradients and optimizer states for that massive, frozen part of the model.
Let’s recalculate the static HBM footprint with LoRA, assuming it adds 20 million trainable parameters:
-
Model Size (unchanged): The base model is still loaded.
4 billion parameters × 2 bytes/parameter = 8 GB
-
LoRA Gradients & Optimizer States: We now only need overhead for the tiny set of new parameters.
Gradients: 20 million parameters × 2 bytes/parameter = 40 MB
Optimizer States: 2 × 20 million parameters × 2 bytes/parameter = 80 MB
The new static HBM footprint is now:
8 GB (Model) + 40 MB (Gradients) + 80 MB (Optimizer) ≈ 8.12 GB
The training overhead has shrunk from 24 GB to just 120 MB. Your new baseline memory requirement is now just over 8 GB. This lower baseline memory requirement leaves much more room for the dynamic memory that’s needed for activations, which lets you use a reasonable batch size on a common 16 GB or 24 GB GPU without running out of memory.
Model quantization
Besides training fewer parameters, we can also shrink the ones that we have by using quantization, which involves reducing the numerical precision of the model’s weights. The standard precision for modern training is bfloat16 because it offers the dynamic range of float32 with half the memory footprint. But we can reduce HBM usage further by converting weights to lower-precision integer formats like int8 or int4.
Using lower-precision integer formats has a significant impact on HBM when compared to the standard bfloat16 baseline:
-
bfloat16 (standard): The baseline size (e.g., a 7B model requires ~14 GB).
-
8-bit precision: Halves the model size (e.g., 14 GB becomes ~7 GB).
-
4-bit precision: Reduces the model size by a factor of 4 (e.g., 14 GB becomes ~3.5 GB).
The reduction in size lets you fit much larger models into memory with minimal degradation in performance.
Source Credit: https://cloud.google.com/blog/topics/developers-practitioners/decoding-high-bandwidth-memory-a-practical-guide-to-gpu-memory-for-fine-tuning-ai-models/
