Imagine showing an AI a photo of a messy workshop and asking:
“How many hex bolts are in that bin?”
Traditionally, a multimodal LLM would make a best guess based on a single pass over the image — sometimes correct, sometimes confidently wrong.
But what if the AI could pause, reason about the task, write a Python script to zoom into the bin, apply a counting algorithm, draw bounding boxes around each bolt — and then verify its own work?
That’s no longer science fiction.
With Agentic Vision in Gemini 3 Flash, image understanding shifts from passive observation to active investigation, grounding answers in visual evidence rather than plausibility.
Agentic Vision: From Seeing to Investigating
Agentic Vision represents a fundamental change in how multimodal models reason about images. Instead of describing what they see, models can now interact with images as environments.
At its core is a rigorous Think → Act → Observe loop:
- Think: The model analyzes the user’s request and the initial image, then formulates a multi-step plan.
For example: “The objects are small. I need to zoom in, isolate regions, and count them.” - Act: The model generates and executes Python code to manipulate or analyze the image — cropping regions of interest, running calculations, counting objects, or drawing annotations.
- Observe: The transformed image (for example, a high-resolution crop or annotated output) is appended back into the model’s context window, allowing it to inspect the results before producing a final answer. This loop enables Gemini 3 Flash to reason with images, not just about them.

What Changed with Agentic Vision?
With Agentic Vision, Gemini 3 Flash can:
- Zoom into regions of interest
- Perform visual math and measurements
- Generate and execute code for image analysis
- Inspect intermediate visual outputs
- Ground answers in observable, inspectable evidence
Solving the “Black Box” Problem in Vision AI
One of the biggest challenges in vision AI has been trust. When a model says “There are 10 items,” you’re often forced to take it at face value.
Agentic Vision changes that.
Because Gemini 3 Flash can annotate images directly — drawing bounding boxes, arrows, and labels — developers can see exactly how the model arrived at its conclusion. Instead of trusting a number, you can inspect the actual Matplotlib or OpenCV output showing where each item was detected.
Images are no longer opaque inputs. They become auditable artifacts in the reasoning process.
Key Capabilities Enabled by Agentic Vision
- Zoom & Inspect
Automatically crop and re-analyze small or distant details at higher fidelity. - Visual Math
Count objects, sum values on receipts, or measure distances using pixel-to-ratio calculations. - Annotation & Grounding
Draw boxes, arrows, and labels directly on images to explain and justify conclusions.
Let’s try it out!
You can find the complete code at https://github.com/laxmi-genai/gemini3-samples
Below is a real prompt used to demonstrate Agentic Vision in action.
Prompt
Count the items in this image and classify them and label it by its
specific color and object type.
Point to all the items in the image in the format of {"point": [y,x], label=""}.
Identify any objects that should not belong to a work desk and annotate them on
the image with boxes, arrows and make sure you label them.
Also provide the reasoning on why they do not belong in a work desk
Code Snippet
This example uses Gemini 3 Flash, enables code execution, and requests a high-reasoning mode so the model can plan and verify its actions.
response = client.models.generate_content(
model="gemini-3-flash-preview",
contents=[image, prompt],
config=types.GenerateContentConfig(
tools=[types.Tool(code_execution=types.ToolCodeExecution)],
thinking_config=types.ThinkingConfig(
thinking_level="HIGH",
),
),
)
The critical pieces here are:
- Enabling the Code Execution tool
- Allowing the model to reason deeply before acting
- Letting visual outputs flow back into the model’s context
What the Model Identified as “Not Belonging on a Work Desk”
The following items were detected, annotated, and justified by the model:
Action Figure (Humanoid):
- This is a toy/collectible intended for recreation or display rather than for professional or educational tasks.
Monster Figure:
- Similar to the action figure, it is a recreational item and not a standard office supply.
Lip Butter Tube:
- This is a personal care/cosmetic item, not a functional desk tool.
Fabric Strap:
- This appears to be a personal accessory (likely a watch strap) and is not an office or desk-related tool.
Input and Output images

Why This Matters for Developers
Agentic Vision unlocks a new class of applications:
- Visual inspection systems that verify their own results
- Debuggable vision pipelines with inspectable outputs
- Multimodal agents that can reason, act, and validate
- AI systems that explain how they know — not just what they know
Images are no longer static prompts. They are interactive environments that models can explore, manipulate, and reason over.
Final Thoughts
Agentic Vision represents a shift from looking to reasoned seeing.
By combining multimodal reasoning with executable code, Gemini 3 Flash turns images into active participants in the decision-making loop — enabling AI systems that are more accurate, more trustworthy, and far more transparent.
And yes…
Maybe it really is time to organize and clean up my desk.
Thank you, Gemini 3 !
Beyond Just Looking: Gemini 3 Now Has Agentic Vision was originally published in Google Cloud – Community on Medium, where people are continuing the conversation by highlighting and responding to this story.
Source Credit: https://medium.com/google-cloud/beyond-just-looking-gemini-3-now-has-agentic-vision-b20c7c6809ce?source=rss—-e52cf94d98af—4
