Gemini 2.5 Computer Use model is the model that has capabilities to interact with User interfaces and perform actions on the user’s behalf, such as operate a web application to order a product from an e-commerce site. Gemini 2.5 Computer Use Model is both a model and a tool. It can suggest the next action such as click at point (x,y), or enter text located in the text box at point(x,y), based on your current state and it can also work as a tool by interpreting your instructions such as “Use Google flights to find me the best flights from Tokyo to Singapore” to automatically figure out next steps until the completion of the task.
How to use Gemini 2.5 Computer Use Model
At the time of writing the model is available in Vertex AI and aistudio.google.com in Google Cloud in preview. The model is called
gemini-2.5-computer-use-preview-10–2025
Basic Usecase: Find where to click on the browser
You can use the Gemini 2.5 Computer Use model to determine the precise location of a button or a text field to click or enter some text. In this example we will use a web browser and Playwright to perform actions on the web browser.
Step 1: Create a web browser instance
async def create_playwright_session():playwright = await async_playwright().start()
# Launch the browser in headless mode, which is required for this environment
browser = await playwright.chromium.laugnch(headless=False)
# Create a new page
page = await browser.new_page()
screen_width, screen_height = 1920, 1080
await page.set_viewport_size({"width": screen_width, "height": screen_height})
print("Playwright session started.")
return page
Step 2:
Navigate to https://pet-luxe-spa.web.app/ and take a screenshot of the browser
async def get_screenshot(page):screenshot_bytes = await page.screenshot()
return screenshot_bytes
async def navigate_to(page,url):
await page.goto(url)
async def main():
page = await create_playwright_session()
await navigate_to(page,"https://pet-luxe-spa.web.app/")
screenshot_bytes = await get_screenshot(page)
Figure 1: Screenshot of Luxe Pet Spa Page
Step 3: Get Response from Gemini 2.5 Computer Use Model
async def call_gemini(contents):
PROJECT_ID = "" # @param {type: "string", placeholder: "[your-project-id]", isTemplate: true}
LOCATION = "global" # @param {type: "string"}
# fmt: on
client = genai.Client(vertexai=True, project=PROJECT_ID, location=LOCATION)
MODEL_ID = "gemini-2.5-computer-use-preview-10-2025"
# Configure Computer Use tool with browser environment
config = GenerateContentConfig(
tools=[
Tool(
computer_use=ComputerUse(
environment=Environment.ENVIRONMENT_BROWSER,
# Optional: Exclude specific predefined functions
excluded_predefined_functions=["drag_and_drop","open_web_browser"],
)
)
],
)
contents = contents
# Generate content with the configured settings
response = client.models.generate_content(
model=MODEL_ID,
contents=contents,
config=config,
)
return responseasync def call_computer_use_model(screenshot_bytes, instructions):
contents=[
Content(
role="user",
parts=[
Part(
text=instructions,
),
# Optional: include a screenshot of the initial state
Part.from_bytes(
data=screenshot_bytes,
mime_type="image/png",
),
],
)
]
response=await call_gemini(contents)
return response
Step 4: Confirm the response from the model
candidates=[Candidate(
avg_logprobs=-0.3403618335723877,
content=Content(
parts=[
Part(
function_call=FunctionCall(
args={
'x': 807,
'y': 33
},
name='click_at'
)
),
],
role='model'
),
Step 4: Click on the button
candidate = response.candidates[0]
function_call=candidate.content.parts[0].function_call
screen_width, screen_height = 1920, 1080
actual_x = normalize_x(function_call.args["x"], screen_width)
actual_y = normalize_y(function_call.args["y"], screen_height)
await page.mouse.click(actual_x, actual_y)
Automate Computer Use: Accomplish a task
In the previous section you saw that providing information like current state with a screenshot and instruction to the Gemini 2.5 Computer Use Model, we were able to get information like next action (click_at etc) and the position to click or perform some action at. By performing action suggested and adding the new context (browser screenshot), we can chain a series of actions suggested by the model and accomplish our goal.
Currently following are the actions available
open_web_browser
Open the web browser.
wait_5_seconds
Pauses execution for 5 seconds to allow dynamic content to load or animations to complete.
go_back
Navigates to the previous page in the browser’s history.
go_forward
Navigates to the next page in the browser’s history.
search
Navigates to the default search engine’s homepage (e.g., Google). Useful for starting a new search task.
navigate
Navigates the browser directly to the specified URL.
click_at
Clicks at a specific coordinate on the webpage. The x and y values are based on a 1000×1000 grid and are scaled to the screen dimensions.
hover_at
Hovers the mouse at a specific coordinate on the webpage. Useful for revealing sub-menus. x and y are based on a 1000×1000 grid.
type_text_at
Types text at a specific coordinate, defaults to clearing the field first and pressing ENTER after typing, but these can be disabled. x and y are based on a 1000×1000 grid.
key_combination
Press keyboard keys or combinations, such as “Control+C” or “Enter”. Useful for triggering actions (like submitting a form with “Enter”) or clipboard operations.
scroll_document
Scrolls the entire webpage “up”, “down”, “left”, or “right”.
scroll_at
Scrolls a specific element or area at coordinate (x, y) in the specified direction by a certain magnitude. Coordinates and magnitude (default 800) are based on a 1000×1000 grid.
drag_and_drop
Drags an element from a starting coordinate (x, y) and drops it at a destination coordinate (destination_x, destination_y). All coordinates are based on a 1000×1000 grid.
The above are only the suggested actions, you will need to implement each of the actions using tools of your choice like Playwright.
Figure 2: This shows how the whole process works. The user starts by feeding the current state (screenshot) and instruction to the model. The model then provides a list of actions to execute, after the actions are executed the final state is added to the context and sent back to the model. The process is then repeated until the objective is achieved.
You can choose how long to run this loop by setting the number of iterations.
async def main():
screen_width, screen_height = 1920, 1080
playwright_session = await create_playwright_session()
page=playwright_session
# Create the content with user message and initial screenshot
screenshot_bytes = await playwright_session.screenshot()
instructions="Find me a flight from SF to Hawaii on next Monday, coming back on next Friday. Start by navigating directly to flights.google.com, choose round trip option. On start enter or select SF and on destination select Hawaii"
contents=[
Content(
role="user",
parts=[
Part(
text=instructions,
),
# Optional: include a screenshot of the initial state
Part.from_bytes(
data=screenshot_bytes,
mime_type="image/png",
),
],
)
]#create a unique session id
sessionid = str(uuid.uuid4())
#100 iterations max
for i in range(100):
try:
response=await call_gemini(playwright_session,instructions,contents)
contents.append(response.candidates[0].content)
results = await execute_function_calls(response, playwright_session, screen_width, screen_height)
await capture_state(results,playwright_session,contents,sessionid)
except:
pass
In this case the Gemini 2.5 Computer Use model will automatically determine the steps needed to find the flights between SF to Hawaii. It will automatically figure out all the steps needed to get there. You can find the whole code here,
https://github.com/haren-bh/webautomation
Get user confirmation
The process works better if you put a human in the loop and get input from the user whenever needed. Sometimes the model explicitly asks for the user input.
{
"content": {
"parts": [
{
"text": "I have evaluated step 2. It seems Google detected unusual traffic and is asking me to verify I'm not a robot. I need to click the 'I'm not a robot' checkbox located near the top left (y=98, x=95).",
},
{
"function_call": {
"name": "click_at",
"args": {
"x": 60,
"y": 100,
"safety_decision": {
"explanation": "I have encountered a CAPTCHA challenge that requires interaction. I need you to complete the challenge by clicking the 'I'm not a robot' checkbox and any subsequent verification steps.",
"decision": "require_confirmation"
}
}
}
}
]
}
}
You might get something like “require_confirmation”, in this case you should get the confirmation from the user and send the confirmation back to the user. For more details on this please check here.
https://ai.google.dev/gemini-api/docs/computer-use
Conclusion
In this blog you learnt about the basics of Gemini 2.5 Computer Use model and how to use the Model to perform sophisticated automation with the browser. You can either control every step of the execution process or you can let Gemini 2.5 handle the entire task automatically. The best approach depends on your use case.
Source Credit: https://medium.com/google-cloud/getting-started-with-gemini-2-5-computer-use-79c525149966?source=rss—-e52cf94d98af—4
