Artificial intelligence (AI) is revolutionizing healthcare, but how do you take a powerful, general-purpose AI model and teach it the specialized skills of a pathologist? This journey from prototype to production often begins in a notebook, which is exactly where we’ll start.
In this guide, we’ll take the crucial first step. We’ll walk through the complete process of fine-tuning the Gemma 3 variant MedGemma. MedGemma is Google’s family of open models for the medical community, to classify breast cancer histopathology images. We’re using the full precision MedGemma model because that’s what you’ll need in order to get maximum performance for many clinical tasks. If you’re concerned about compute costs, you can quantize and fine-tune by using MedGemma’s pre-configured fine-tuning notebook instead.
To complete our first step, we’ll use the Finetune Notebook. The notebook provides you with all of the code and a step-by-step explanation of the process, so it’s the perfect environment for experimentation. I’ll also share the key insights that I learned along the way, including a critical choice in data types that made all the difference.
After we’ve perfected our model in this prototyping phase, we’ll be ready for the next step. In an upcoming post, we’ll show you how to take this exact workflow and move it to a scalable, production-ready environment using Cloud Run jobs.
Setting the stage: Our goal, model, and data
Before we get to the code, let’s set the stage. Our goal is to classify microscope images of breast tissue into one of eight categories: four benign (non-cancerous) and four malignant (cancerous). This type of classification represents one of many crucial tasks that pathologists perform in order to make an accurate diagnosis, and we have a great set of tools for the job.
We’ll be using MedGemma, a powerful family of open models from Google that’s built on the same research and technology that powers our Gemini models. What makes MedGemma special is that it isn’t just a general model: it’s been specifically tuned for the medical domain.
The MedGemma vision component, MedSigLIP, was pre-trained on a vast amount of de-identified medical imagery, including the exact type of histopathology slides that we’re using. If you don’t need the predictive power of MedGemma, you can use MedSigLIP alone as a more cost-effective option for predictive tasks like image classification. There are multiple MedSigLIP tutorial notebooks that you can use for fine-tuning.
The MedGemma language component was also trained on a diverse set of medical texts, making the google/medgemma-4b-it version that we’re using perfect for following our text-based prompts. Google provides MedGemma as a strong foundation, but it requires fine-tuning for specific use cases—which is exactly what we’re about to do.
To train our model, we’ll use the Breast Cancer Histopathological Image Classification (BreakHis) dataset. The BreakHis dataset is a public collection of thousands of microscope images of breast tumor tissue that was collected from 82 patients using different magnifying factors (40X, 100X, 200X, and 400X). The dataset is publicly available for non-commercial research and it’s detailed in the paper: F. A. Spanhol, L. S. Oliveira, C. Petitjean, and L. Heudel, A dataset for breast cancer histopathological image classification.1
Handling a 4-billion parameter model requires a capable GPU, so I used an NVIDIA A100 with 40 GB of VRAM on Vertex AI Workbench. This GPU has the necessary power, and it also features NVIDIA Tensor Cores that excel with modern data formats, which we’ll leverage for faster training. In an upcoming post, we’ll explain how to calculate the VRAM that’s required for your fine tuning.
My float16 disaster: A crucial lesson in stability
My first attempt to load the model used the common float16 data type to save memory. It failed spectacularly. The model’s outputs were complete garbage, and a quick debugging check revealed that every internal value had collapsed into NaN (Not a Number).
The culprit was a classic numerical overflow.
To understand why, you need to know the critical difference between these 16-bit formats:
-
float16 (FP16): Has a tiny numerical range. It can’t represent any number that’s greater than 65,504. During the millions of calculations in a transformer, intermediate values can easily exceed this limit, causing an overflow that creates a NaN. When a NaN appears, it contaminates every subsequent calculation.
-
bfloat16 (BF16): This format, developed at Google Brain, makes a crucial trade-off. It sacrifices a little bit of precision to maintain the same massive numerical range as the full 32-bit float32 format.
The bfloat16 massive range prevents overflows, which keeps the training process stable. The fix was a simple one-line change, but it was based on this critical concept.
The successful code:
Source Credit: https://cloud.google.com/blog/topics/developers-practitioners/a-step-by-step-guide-to-fine-tuning-medgemma-for-breast-tumor-classification/
