
The intersection of artificial intelligence and medicine has promised transformative changes for years. Now, with the advent of powerful, domain-specific foundation models, we are moving from promise to practice. At the forefront of this movement is MedGemma, a new family of open models from Google, meticulously crafted for the healthcare and medical research communities.
MedGemma isn’t just another large language model. It is a new suite of open, medically-tuned, vision-language foundation models. Built upon the powerful and efficient architecture of Gemma 3 — which itself is derived from the research and technology behind the Gemini models — MedGemma is engineered to interpret and reason about complex medical images and text with a high degree of proficiency.
MedGemma’s advanced capabilities are rooted in a sophisticated architecture that synergizes a state-of-the-art language model with a highly specialized vision component. This design is not a simple adaptation of a general-purpose model but a ground-up effort to build a system with medical fluency at its core. At its foundation, MedGemma is a variant of the Gemma 3 family of models, inheriting its modern and efficient design. It is built on a decoder-only transformer architecture, a design inherently optimized for generative tasks such as producing coherent text for medical reports or answering complex clinical questions. Key architectural features inherited from Gemma 3 include a very long context window which allows the model to process extensive medical histories or research papers in a single pass.
MedSigLIP: The Medical Vision Engine
The cornerstone of MedGemma’s multimodal prowess is MedSigLIP, a 400-million-parameter vision encoder specifically tuned for the medical domain. Derived from Google’s SigLIP (Sigmoid-based Language-Image Pre-training) model, MedSigLIP was not merely trained on generic images; it underwent extensive, specialized pre-training on a massive and diverse corpus of de-identified medical imagery. This dataset includes a wide range of modalities critical to clinical practice consisting of chest X-rays, dermatology photographs, ophthalmology fundus images, and digital histopathology slides. It is ideal for:
- Traditional image classification: Build performant models to classify medical images.
- Zero-shot image classification: Classify images without specific training examples by comparing image embeddings to the embeddings of textual class labels.
- Semantic image retrieval: Find visually or semantically similar images from large medical image databases.
Model variants
The MedGemma collection is offered in several variants, allowing developers to select the optimal model for their specific needs in terms of performance, modality, and computational resources :
- MedGemma 4B: A 4-billion parameter multimodal model that strikes a balance between high performance and resource efficiency. It is the workhorse for many of the image analysis use cases and is available in two forms:
– medgemma-4b-it (Instruction-Tuned): This version has been fine-tuned to follow instructions and engage in dialogue, making it the recommended starting point for most applications.
– medgemma-4b-pt (Pre-trained): This is the raw base model before instruction tuning. It is provided for researchers who require deeper control for advanced experimentation and custom fine-tuning pipelines. - MedGemma 27B Text-Only (medgemma-27b-text-it): A 27-billion parameter model optimized exclusively for medical text comprehension and reasoning. Due to its focused training, it exhibits slightly higher performance on text-only benchmarks and is the ideal choice for tasks like summarizing EHRs, querying medical literature, or analyzing clinical notes.
In this post, let’s work with a sample application viz. “Cymbal MedBuddy” to check how MedGemma can be a crucial actor for medical image scan analysis. We will use medgemma-4b-it deployed using Vertex AI. The application is developed using Python and Streamlit for backend and UI. After loading a medical image (e.g. a MRI slice or a X-ray), the user will enter a short text (user prompt) and the system will compose a message including the system prompt, user prompt and the uploaded image and query the model. The model response is then unpacked and displayed in the chat window.
Pre-requisites
- A Google Cloud Account, with billing enabled.
- Google Cloud SDK: gcloud CLI installed and authenticated
- APIs Enabled: Enable the Vertex AI API in your GCP project.
- Python 3.10+ and pip.
- Git: For cloning the repository.
Deploy MedGemma using Vertex AI
To use MedGemma, we need to deploy the model. It is available in both Google Model Garden and Hugging Face. For this post, we will use Model Garden. Select from deploy options as Vertex AI.
You will be prompted to select the MedGemma variant, the computing resource. For this post, select google/medgemma-4b-it , Basic deployment settings, us-central1 region, and vLLM 128K context (2 NVIDIA_L4; g2-standard-24) machine spec. By pressing Deploy button, the deployment will start. It will take several minutes for the deployment to complete.
You can use the below CLI command as well to deploy. Replace
with your project id. Region used is us-central1.
gcloud auth logingcloud ai model-garden models deploy \
--model="google/medgemma@medgemma-4b-it" \
--region="us-central1" \
--project="" \
--accept-eula \
--machine-type="g2-standard-24" \
--accelerator-type="NVIDIA_L4" \
--container-image-uri="us-docker.pkg.dev/vertex-ai/vertex-vision-model-garden-dockers/pytorch-vllm-serve:20250430_0916_RC00_maas" \
--use-dedicated-endpoint \
--endpoint-display-name="google_medgemma-4b-it-mg-one-click-deploy"
Wait for the model to be successfully deployed and in Active status. Note the model endpoint id which is required for this post.
Cymbal MedBuddy application
The application is hosted in GitHub under the following repo:
We will run this application in Google Cloud Shell. You can also deploy the application on Cloud Run or App Engine as well. Make sure IAM roles are provided correctly following best practices. As a pre-requiste to run this application, check and alter the details in .env file as required.
GCP_PROJECT_ID="your-gcp-project-id" # Change to your GCP project id
GCP_REGION="us-central1" # The region of your project, e.g., us-central1
MODEL_ENDPOINT_ID= # The ID of the Vertex AI Endpoint you deployed
MODEL_ENDPOINT_REGION="us-central1" # The region where your endpoint is deployed
The application code is under app.py file. The key components are:
fn_initialize_model()
: This function uses the@st.cache_resource
decorator. This is a critical optimization that tells Streamlit to run this function only once, establishing the connection to the Vertex AI endpoint and caching the connection object. This prevents costly re-initialization on every user interaction, making the app significantly faster.fn_run_query()
: This is the core AI interaction function.
– System Prompt: It constructs a detailed system_instruction that guides the MedGemma model to behave as a medical imaging expert. This is a key aspect of prompt engineering that ensures high-quality, consistent, and safe responses.
– Payload Formatting: It packages the system prompt, user’s text prompt, and the base64-encoded image into the JSON payload format expected by the Vertex AI API.st.session_state
: Streamlit’s session state is used extensively to maintain the application’s state across user interactions (reruns).
–st.session_state.messages
: A list that stores the entire chat history.
–st.session_state.img_b64
: Stores the uploaded image so it doesn’t need to be re-processed.
–st.session_state.temperature
&st.session_state.max_tokens
: These are linked directly to the UI sliders, allowing user settings to persist.
Various interactions are depicted in the following sequence diagram:
Source Credit: https://medium.com/google-cloud/analyze-medical-images-with-medgemma-a-technical-deep-dive-fee0be18e7e0?source=rss—-e52cf94d98af—4