In Part 1, I argued that LLMs don’t have a memory problem, they have an architecture problem. Von Neumann solved this in 1945 by separating computation from memory organisation into layers. Most AI agents still haven’t learned that lesson. This article is about what I built instead.
The Problem With Every Agent Framework I Tried
Every agent framework gives you the same mental model: a pipeline.
User sends a message → retrieve documents or data (Databases or Tools or MCP) → inject into prompt → LLM generates response → save to database → repeat.
It works. Until it doesn’t. When the context grows, latency climbs. When memory updates block the response thread, the UX freezes. When two agents share state, race conditions appear. When the agent needs to learn from a conversation and respond at the same time, you have to choose one.
I decided to stop patching the pipeline and design something different. The result is the Agent Processor Unit an open-source architecture that applies hardware-level design principles to AI agent memory and task scheduling.
The APU: A Micro-Architecture for Agents
Just like a physical CPU splits responsibilities across specialised units, the ALU computes, the Control Unit orchestrates, the MMU manages memory, the APU does the same for an AI agent.

Four units. Each with a single responsibility. None of them blocking the others.
Unit 1: The Semantic MMU, a Memory Hierarchy for Agents
The most important insight from hardware architecture is that not all memory is equal. Fast memory is small. Slow memory is large. A good system manages the boundary between them automatically.
The APU implements this as a four-tier hierarchy:

The agent never touches all four tiers for every query. The MMU routes each request to the fastest tier that can answer it.

The Living DLL, A Page Table for Context
Agent-Processor-Unit/apu/mmu/controller.py at main · EZFRICA/Agent-Processor-Unit
The MMU manages a Doubly Linked List of knowledge blocks. Each block is a discrete unit of memory: a travel preference, a user constraint, a learned fact. The DLL is the agent’s active working context.
Three block types, three positions in the list:
- temp → HEAD (most recent context, O(1) access)
- projet → mid-list (active planning, frequently referenced)
- fondamental → before TAIL (permanent knowledge, always reachable)
The DLL is not static. Every time a block is matched by a query, it moves to the front. The Move-to-Front heuristic from cache replacement theory. Frequently used knowledge stays warm. Rarely used knowledge drifts toward the tail and becomes a candidate for archival.
LRU Paging and Page Faults
The context window has a hard limit: 12 active blocks. This is intentional. Prompt hygiene matters an agent with 40 injected blocks is slower and less coherent than one with 12 well-chosen ones.
When a 13th block is needed, the MMU automatically pages out the least recently used block to L3 (Weaviate). When a query later references that paged-out block, the MMU detects a page fault and pages it back in exactly like a virtual memory system.
# A real glimpse into the Paging mechanism
if dll["dynamic_block_count"] >= dll["dynamic_block_max"]:
# Semantic MMU: Page Out oldest block to make room
dynamic_nodes = [n for n in dll["nodes"].values() if not n.get("is_fixed")]
lru_node = min(dynamic_nodes, key=lambda x: x.get("last_accessed", "1970-01-01T00:00:00"))
await page_out_block(lru_node["id"], dll)
# Page Fault scenario
1. Memory is full (12/12 blocks active)
2. New block needed → LRU block paged out to Weaviate
3. User asks about the evicted topic
4. MMU detects miss → fetches from L3 → pages back in
5. Agent responds with zero visible interruption
You can test this directly in the repository. The logs show every page-out and page-in event in real time.
Unit 2: The Semantic Scheduler, Asynchronous Everything

Agent-Processor-Unit/apu/core/scheduler.py at main · EZFRICA/Agent-Processor-Unit
This is where the architecture diverges most sharply from the pipeline model.
In a standard agent, memory writes block the response. The user asks a question, the agent learns something new, saves it to the database, and then responds. That sequential dependency is the root cause of most latency issues.
The APU Scheduler breaks this dependency entirely using an asyncio.PriorityQueue across a Thread-Safe bridge.

The interaction flow prioritises tasks explicitly:
# Four priority levels
Priority 0 (Real-time) → User-facing LLM response
Priority 1 (High) → Tool execution (agent is waiting)
Priority 2 (Low) → Write-back to L4 (Letta Cloud)
Priority 3 (Idle) → GC, Move-to-Front reordering, block detection
The interaction flow becomes:
User: "I am allergic to peanuts."
│
├─► L1/L3 write (immediate, ~10ms)
│
├─► Response to user: "Noted. I'll avoid peanuts." ← user reads this NOW
│
└─► Scheduler pushes SYNC_L4 task (background, invisible)
│
└─► Letta Cloud updated 2 seconds later
└─► User never saw this happen
Here is the exact code pushing the heavy API call to the background worker loop, without blocking the user interface:
async def push(self, task_type: str, payload: dict, priority: int = 2):
"""Add a task to the queue (Thread-safe from any loop)."""
task = APUTask(priority=priority, task_type=task_type, payload=payload)
# Safely schedule the queue.put coroutine on the worker's thread loop
asyncio.run_coroutine_threadsafe(self.queue.put(task), self._loop)
The user gets an instant response. Letta gets updated in the background. If Letta is down, the task retries with exponential backoff up to 3 times. The conversation is never interrupted.
Hot Path vs. Cold Path
This separation defines the two paths in the system:
Hot path (interaction, real-time): BMJ selection → L1 Cache → Weaviate → LLM response
Cold path (persistence, background): Write-back → Letta Cloud
Weaviate is the source of truth during a conversation. Letta is a cold archive that is always eventually consistent but never on the critical path. This is exactly how modern databases separate read replicas from write-ahead logs.
Unit 3: The BMJ Algorithm, the TLB of the APU

Agent-Processor-Unit/apu/mmu/controller.py at main · EZFRICA/Agent-Processor-Unit
The Bidirectional Metadata Jump is the routing engine. It answers one question before every response: where is the answer in working memory, or in the archive?
The BMJ traverses the DLL from HEAD, scoring each block using keyword overlap and recency. Two thresholds govern its behavior:
- Above 0.80 confidence → answer is in the DLL, skip Weaviate entirely
- Below 0.80 → escalate to L3 vector search
In hardware terms, the BMJ is the Translation Lookaside Buffer the fast-path resolver that avoids the full page table walk. Most queries are answered in L1 or the DLL. Weaviate is the fallback, not the default.
The result: 78% L1 cache hit rate in measured conversations, meaning 78% of queries never touch the vector database.
Unit 4: The TEU, a Cache for the Real World
Agent-Processor-Unit/apu/teu/controller.py at main · EZFRICA/Agent-Processor-Unit
Tool calls are expensive. A Google Search takes 300–800ms. If a user asks the same question twice in five minutes, making two API calls is wasteful.
The Tool Execution Unit (TEU) wraps every external call in an L1 IO Cache with a 15-minute TTL. The second identical query returns in under 1ms. The TEU also runs tools in isolation a failing search never propagates an exception to the main agent thread.
The Force Override, Human-in-the-Loop as a First-Class Feature
Agent-Processor-Unit/dashboard/app.py at main · EZFRICA/Agent-Processor-Unit
Every other agent memory system I have seen is a black box. You can observe the output but not the reasoning. You cannot inspect what the agent is about to use as context, and you cannot correct it without restarting the conversation.
The APU Dashboard exposes the full memory state in real time. An operator can:
- See every active block and its position in the DLL
- Disable a block before it reaches the LLM
- Promote a block to HEAD position
- Inject a correction mid-conversation
- Roll back a write-back that introduced incorrect information
This is the interrupt mechanism from the hardware analogy. A human operator can fire an interrupt at any point in the pipeline, preempt the current context, and take control. The agent does not restart. The memory does not reset.
The Architecture in One Table

What This Is Not
This is not a wrapper around LangChain. It is not a prompt engineering trick. It is not a bigger context window.
It is a memory architecture one that treats the agent’s knowledge as a managed resource with explicit tiers, explicit eviction policies, explicit consistency guarantees, and explicit human override.
The context window bottleneck is real, but it is a symptom. The disease is treating memory as a flat, unstructured blob that you inject wholesale into every prompt. The APU treats memory the way hardware architects have treated it for 40 years: as a hierarchy where speed, size, and persistence are always in tension, and the system’s job is to manage that tension automatically.
Try It
The full architecture is open-source.
Repository: github.com/EZFRICA/Agent-Processor-Unit
Start with the Page Fault test, fill the memory to 12 blocks, add a 13th, then ask about the evicted topic and watch the MMU page it back in. It takes 5 minutes and makes the architecture immediately concrete.
The next article in this series will cover the Semantic Scheduler in depth multi-agent concurrency, token-budget management, and real-time interrupt handling during tool execution.
Ezekias Bokove is a Google Developer Expert for Google Cloud and founder of GDG Cloud Abidjan & GDGs Afrique Francophone. He writes about Cloud, AI, agent architecture, and distributed systems.
From Agent OS to Agent Processor Unit was originally published in Google Cloud – Community on Medium, where people are continuing the conversation by highlighting and responding to this story.
Source Credit: https://medium.com/google-cloud/from-agent-os-to-agent-processor-unit-cdd797fb6287?source=rss—-e52cf94d98af—4
