

The “agentic” paradigm is a defining trend for 2025, and everyone wants to build and interact with their own AI agents. The most significant evolution for chatbots is the shift from text-based interaction to voice, enabling real-time, natural conversations with human-like agents.
And just like humans, agents need a memory to recall past conversations within a session.
offers two key technologies to make this happen:
- Gemini’s voice-first, real-time, bi-directional (BiDi) streaming, which allows for fluid, conversational AI experiences (currently in preview).
- Agent Engine, which provides a robust service for storing session history and remembering key pieces of past discussions.
Google also offers a versatile and out-of-the-box framework, ADK, to easily and quickly use these tools
A working version of my implementation combining these technologies is available on GitHub.
For my initial attempt, I naively took my existing BiDi streaming agent and integrated Agent Engine to handle session management.
I ran the code, and the session was created successfully.
However, when I started the first voice interaction, the response was a disaster. I received disjointed audio fragments, resulting in a completely unintelligible and broken experience.
You can replicate this failure yourself. In my code, go to the main.py
file and uncomment this line and the ones that follow.
After digging into the logs, I understood the bottleneck.
When you interact with an agent, the conversation consists of a series of input and output events (text, images, audio, etc.).
When Gemini answers a question using live streaming, it doesn’t send a single audio file with the complete response. Instead, it streams dozens of small audio fragments per second. When played back in real-time without gaps, these fragments create a smooth and fluent voice.
The problem arises when trying to persist the session history. Agent Engine is designed to save every single event to the session. While this is lightning-fast for an in-memory session, using the Agent Engine’s REST API introduces a significant delay. Each call to append an event to the session takes about 200ms.
Because the API call is synchronous (or blocking), the main application thread has to wait for the save operation to complete before processing the next audio fragment. With dozens of fragments arriving every second, the 200ms latency per fragment is far too slow. This delay causes the application to deliver audio chunks sporadically, creating the fragmented and unnatural speech I experienced.
Furthermore, storing raw audio fragments in the session history has other downsides:
- Inefficiency: It consumes far more storage space than text.
- Context: It’s inefficient for the LLM to process raw audio to understand the conversational context.
- Cost: The large size of audio data increases the number of tokens consumed with each query (and increase the latency)
The solution is to decouple the real-time audio stream from the session-saving process and doing so asynchronously to avoid blocking the audio stream.
In addition, we can save only the text transcripts of the conversation to Agent Engine session, to be more efficient and cost effective.
But how do we get the audio transcript?
The Agent Development Kit (ADK) provides two powerful parameters in the agent’s run_config
:
output_audio_transcription
: Transcribes the agent’s spoken response into text.input_audio_transcription
: Transcribes the user’s spoken input into text.
run_config = RunConfig(
response_modalities=[modality],
output_audio_transcription=AudioTranscriptionConfig(),
input_audio_transcription=AudioTranscriptionConfig(),
)
Source Credit: https://medium.com/google-cloud/agent-engine-with-gemini-bidi-streaming-session-workaround-2384ecdbc040?source=rss—-e52cf94d98af—4