The “Magic” Breaks When the Spinner Doesn’t Stop
We have all been there. You type a prompt into a sophisticated AI agent, expecting magic. You wait. You see a spinner. You wait some more. 30 seconds later, you’ve checked your email or switched tabs. In the contemporary digital landscape, where the standard for uptime often reaches over “five nines” (99.999%), latency has effectively become a new form of downtime. This challenge is particularly relevant to Generative AI systems.

While Large Language Models (LLMs) have unlocked incredible reasoning capabilities, they have introduced a new bottleneck: the “Thinking Time.” Unlike traditional software, where a database query takes milliseconds, an agentic workflow involving planning, tool use, and reflection can easily take tens of seconds to even minutes. In consumer UX, that is an eternity.
As we build agentic systems, we often fall into one of two traps. We either assume latency is purely an engineering problem, something to be fixed by waiting for the next model upgrade, or we simply accept it as the “nature of the beast,” believing that intelligent agents are inherently slow and there is nothing we can do about it.
Of course, agentic systems will always carry some latency due to their multi-step nature, but with careful design, we can massively slash those wait times. Let me share an example from my personal experience working on an agentic deep research agent. In this use case I was able to reduce the execution time from over 4 minutes down to just 2 minutes. That was a 50% time reduction achieved purely through architecture and logic optimization and most importantly without sacrificing the quality of the agent.
The reality is that latency can be reduced with architectural and design choices.
Reducing latency isn’t just about code optimization. It requires a holistic approach involving:
- The Brain: Selecting and/or tuning the right models.
- The Payload: Being ruthless about what data enters the context window.
- The Flow: Architecting paths that prioritize speed over unnecessary complexity.
- The Illusion: Using psychology to manage user perception when waits are unavoidable.
Here are 14 concrete strategies you can implement today to make your agents feel significantly faster.
The Brain: Optimize the Core Processing
The biggest contributor to latency is often the model itself. When we default to the “smartest” model available for every single step, we are essentially hiring a PhD to make coffee. It works, but it’s expensive and slow. (…though I must admit, we occasionally manage to get an unnecessarily amazing coffee)
1. Right-size the model for the task
We often default to frontier-class models (Like Gemini 3 Pro) for every step in an agentic flow. However, asking a massive reasoning model to perform simple categorization or formatting tasks is inefficient. It introduces unnecessary overhead for tasks that require speed, not deep intellect.
- The fix: Dynamically route tasks based on complexity. If an agent needs to categorize a user query or format a JSON output, use a smaller, faster model (like Gemini Lite). These models can execute routine tasks in a fraction of the time. Reserve the heavy hitters only for complex reasoning and planning steps.
- Bonus tip: Ideally more complex models should only be used where performance benefits can be observed through evaluation metrics.
2. Fine-Tune to reduce “Instruction Overhead” General-purpose models need detailed, lengthy instructions to understand your specific domain and constraints. This “instruction overhead” forces the model to process more input tokens and spend more inference time “understanding” the task before it even begins generating an answer.
- The fix: Fine-tune the model specifically on your domain and prompt structure. This allows the model to operate on “muscle memory,” requiring fewer steps and less “thought” processing to arrive at the correct answer. A tightly fine-tuned model generates tokens faster because the probability distribution for the correct answer is sharper.
3. Cap the “Thinking Budget” Newer reasoning models (Gemini 2.5 and 3 with thinking process) are powerful, but they can overthink a task, spending excessive time reflecting on simple queries.
- The fix: Strictly control the thinking mode. Use API controls to cap the thinking budget for intermediate steps or control the depth of thoughts. With Gemini 3 for example you can define the Thinking Level to Low or High.
- Bonus tip: For additional control you can also use prompt instructions like “Answer immediately without reflection” or “do not overthink, just answer” to force a faster, heuristic-based response when deep thought isn’t required.
4. Debug the “Thought Process” When models expose their “chain of thought,” you will often see them spinning their wheels — debating definitions, handling edge cases, or trying to resolve logical conflicts in your prompt. This “internal debate” is wasted latency.
- The fix: Audit the model’s raw thoughts to identify confusion points. For example if the model spends 10 seconds debating what “recent” means, clarify the prompt to define a specific timeline. By removing ambiguities in your prompt, you eliminate the need for the model to “debate” itself and think more, cutting down latency at the source.
The Payload: Go on a Token Diet
Latency is directly correlated with the amount of data the model has to read (input tokens) and write (output tokens). Every unnecessary character is a millisecond of wasted time.
5. Be selective with Context Agents love context, but they don’t always need all of it. If you blindly pass the entire conversation history or massive retrieved documents into every step of the chain, you are slowing down the “time to first token.”
- The fix: Dynamically prune context. If an agent is performing a specific sub-task (like formatting a date), it doesn’t need to know the user’s name or the previous three summaries. Send only what is strictly necessary for that specific inference call.
6. Pre-process Inputs (JSON vs. Markdown) Structured data is great for computers, but it’s heavy for LLMs. JSON files, with their repeated keys, brackets, and quotes, are token-expensive.
- The fix: Convert heavy JSON objects into lightweight Markdown or simple text lists before sending them to the model. The model can understand the data just as well, but it has to process significantly fewer tokens, resulting in faster processing times.
- Bonus tip: Input prompts (i.e instructions) can be compressed. Prompt compression can be used to reduce the prompt size without losing its meaning.
7. Control the Output Size The model doesn’t know you only want a sentence if you don’t tell it. Without constraints, an LLM might generate a polite preamble (“Here is the information you requested…”) or a lengthy explanation.
- The fix: Set strict maximum generated tokens limits for internal agent steps. If a step only requires a “Yes/No” or a classification ID, enforce a low token limit via the API. This mechanically forces the API request to close faster.
- Bonus tip: If a hard maximum generated tokens limit should not be set, you can instruct the LLM what a usual output size should be. This way you give a guidance to the LLM to generate a response roughly on the right size but you do not forbid the model to generate a bit more when it is necessary
8. Utilise Context Caching Agents often carry massive “static” payloads with long system instructions, 50-page documents, or a dozen few-shot examples — that are sent identically with every single user request. Re-processing these same thousands of tokens for every turn is redundant and significantly slows down the “Time to First Token” (TTFT).
- The fix: Use the Context Caching features provided by major APIs (Gemini Context Caching). By using “cached” context, the model processes it once and stores the state in memory. Subsequent calls reference this cached state instantly, bypassing the need to re-process it. This can reduce prompt processing time by up to 90% for context-heavy agents. In simple terms, it changes the workflow from “Reading the book every time you ask a question” to “Reading the book once, keeping it open on the desk, and just asking questions.”
- Bonus tip: For maximum benefit, ensure the cached context is truly static and rarely changes, as invalidating and recalculating the cache introduces its own latency cost.
The Flow: Optimize the Architecture
Once you have the right model and the right data, the next bottleneck is how your agents talk to each other and the outside world. This is where “Engineer Brain” meets “Product Strategy.”
9. Consolidate your Agents Micro-services are a popular architecture pattern, but “Micro-Agents” can be a latency trap. If you have one agent to “Summarize” and another to “Evaluate,” you are paying the cost of network latency and context re-processing twice.
- The fix: Combine closely related tasks into a single prompt. As models become more capable, handling larger context windows and following complex instruction sets, the need to fragment tasks for accuracy is diminishing. Modern LLMs are perfectly capable of summarizing text, evaluating its sentiment, and formatting the output in a single pass. Reducing the number of handoffs not only simplifies the architecture but drastically cuts down the total round-trip time.
- Bonus tip: Defining json schemas as model output can help generate responses with multiple attributes forcing the agent to complete multiple tasks. For example if i need to summarise some results, generated a score against the user question and define the topic I can request a response to be in the following schema.
{
"summary": "…",
"score": "…",
"topic": "…"
}
LLM APIs have support for such generating such structure outputs
10. Hard-Code Deterministic Flows LLMs are decision engines, but they shouldn’t decide everything. A common anti-pattern is including instructions like: “Always select the ‘save_to_file’ tool after the summary is generated.”
This forces the agent to make a completely separate LLM call just to decide what to do next (selecting the save_to_file tool). You are effectively paying for latency just to have the model agree with your own hard-coded instruction. Worse, every additional LLM hop is a roll of the dice. You introduce unnecessary points of failure such as network timeouts, API glitches, or the occasional refusal where the LLM simply decides not to follow instructions this time. It is a shame to fail on a deterministic step because you relied on a probabilistic engine.
- The fix: Hard-code the deterministic part of orchestration. Instead of the typical agentic loop (AGENT 1 → Tool Selection → Tool Execution → AGENT 2), force a deterministic path: AGENT 1 → PYTHON CODE → AGENT 2. By explicitly invoking the function in your application logic, you skip the entire “Tool Selection” inference step, saving time and eliminating the risk of errors.
11. Build Custom Tools (Avoid “Black Box” overhead) Platform providers (like Google Vertex AI) often provide “one-click” retrieval tools to help you search documents or browse the web. While these are fantastic for getting a prototype up and running quickly, they often operate as “black boxes” optimized for general-purpose use cases rather than speed. They might be performing unseen steps ,like generating summaries of every search result or re-ranking documents that consume valuable seconds before you ever see a token. If you rely solely on these pre-built abstractions, you have no way to trim the fat.
- The fix: Roll your own tools. For example, in our research agent, we found the built-in search tool was forcing a summary generation step we didn’t need, triggering extra LLM calls. By building a custom integration that simply fetched chunk results, we consolidated everything into a single final LLM call. This cut out multiple intermediate wait times.
- Bonus tip: Start by using the existing “build-in” tools and after you profile your agentic workflow evaluate for which tools (if any) you should improve.
12. Trade Cost for Speed (Speculative Execution) In traditional software engineering, we are taught to be efficient: don’t run a query unless you are sure you need the result. However, in the world of LLMs, sequential efficiency is the enemy of speed. If you wait for “Step A” (Initial Research) to be fully analyzed before you decide to trigger “Step B” (Deep Dive), you are serializing the latency. You are saving a few cents on tokens, but you are charging the user in seconds of waiting time.
- The fix: Run tasks in parallel, even if you might throw the results away. For example, you can trigger a second round of research while the agent is still evaluating the first round. If the first round passes the quality check, you discard the second round. Yes, you paid for the extra tokens, but you saved the user from waiting for a sequential process. This is a classic “Cost vs. Latency” strategic decision.
13. Parallelize Agent Workflows (The “Map-Reduce” Pattern) For large, complex tasks (like “research 10 companies”), a single agent working sequentially is a bottleneck. It processes company 1, then company 2, etc., causing latency to scale linearly with the workload.
- The fix: Break the task into independent chunks and fan them out. Instead of one agent doing 10 tasks one-by-one, spin up 10 agents to do them all at once. This applies to reasoning as well: A sequential “Generate → Critique → Fix” loop is often slower than a parallel “Best-of-N” approach, where you generate 5 options simultaneously and simply have a lightweight model pick the best one.
- Bonus tip: Agentic frameworks like Agent Development Kid (ADK) allow you to use parallel workflows. This make the implementaiton of such parallel workflows easy.
14. Optimize Tool Concurrency (Parallel & Async) A single agent step often requires multiple pieces of information or actions. A common mistake is letting the agent execute these sequentially: Get Weather → Wait → Get Calendar → Wait.
- The fix: Implement Async Tool Calling. Configure your agent loop to execute multiple tool calls simultaneously. If the model requests both get_weather and get_calendar in a single turn, run them asynchronously in parallel threads. This collapses the total wait time to the duration of the single longest request, rather than the sum of all requests.
15. Fire and return (Trigger Long Running Operations) Some actions, like generating a complex PDF report, processing a video, or sending a marketing email, simply take time. In a chat interface, this forces the user to stare at a spinner. In a backend workflow, it blocks the main thread, causing the entire pipeline to stall and drastically reducing throughput.
- The fix: Decouple the execution using asynchronous operations. For Chat the agent starts the job execution and instead of waiting, and replies immediately (“I’ve started generating your report…”), freeing the user to continue the conversation. For Backend Agentic workflows the agent pushes the task to a queue and immediately proceeds to the next item. This prevents a single slow side-effect from creating backpressure on your high-speed reasoning engine, allowing you to process thousands of items in parallel rather than sequentially.
- Bonus tips: Some agentic frameworks offer build in handling for long running operations. Explore more on how to build long running tools with ADK
The Illusion: Manage the Wait
Sometimes, complex work simply takes time. No matter how much you optimize the model or the architecture, a deep reasoning task might still take 30 seconds. If you cannot reduce the actual latency any further, your job shifts to reducing the perceived latency. This is pure psychology.
16. Split the Architecture (The “Appetizer” Strategy) A common mistake is treating the agent’s response as a single, monolithic block. We force the user to stare at a blank screen while the agent generates a perfect, comprehensive report in the background. By the time the answer arrives, the user has lost interest. It’s like a restaurant refusing to serve drinks or bread until the main course is ready.
- The fix: Split your architecture into “Fast” and “Slow” tracks. For a complex research query, use a fast model to generate a quick response, high-level summary or a “plan of action”. Serve this to the user instantly. While they are reading this “appetizer,” the slower agent performs the deep dive in the background. By the time the user finishes reading the summary, the detailed results are ready to be streamed in.
17. Stream the “Thinking” (UI Transparency) Standard loading spinners are anxiety-inducing because they are opaque. A spinner doesn’t tell you if the system is thinking hard or if it has silently crashed. When a user sees a static loading state for more than 10 seconds, their trust in the system plummets, and the wait feels twice as long.
- The fix: Stream intermediate steps to the UI. Instead of a generic loading state, show the user exactly what is happening by sharing the agenth thoughts or where the agent is in the execution process: “Step 1/5: Searching database…”, “Step 2/5: Reading 4 documents…”, “Step 3/5: Consolidating results…” This sharing of progress converts a 30-second wait from a frustrating delay into an engaging observation of work being done. It confirms the system is active and adds perceived value by showing the complexity of the task being handled.
Conclusion
When building agentic systems, it is easy to accept sluggishness as the price of admission for intelligence. But as we have seen, significant speed gains and often reducing wait times by 50% or more, are hiding in plain sight. They are tucked away in oversized models, bloated context windows, and redundant agent handoffs.
Speed is not just about raw GPU power; it is about smart orchestration. By optimizing the Brain, cleaning the Payload, streamlining the Flow, and managing the Illusion, you can turn a sluggish tool into a snappy, production-ready product.
The models will get faster on their own eventually. But your users are waiting today.
The Art of Fast Agents: 17 Strategies to Fix Latency was originally published in Google Cloud – Community on Medium, where people are continuing the conversation by highlighting and responding to this story.
Source Credit: https://medium.com/google-cloud/the-art-of-fast-agents-14-strategies-to-fix-latency-07a1e1dfebf9?source=rss—-e52cf94d98af—4
