From Agent OS to Agent Processor Unit

In Part 1, I argued that LLMs don’t have a memory problem, they have an architecture problem. Von Neumann solved this in 1945 by separating computation from memory organisation into layers. Most AI agents still haven’t learned that lesson. This article is about what I built instead.

The Problem With Every Agent Framework I Tried

Every agent framework gives you the same mental model: a pipeline.

User sends a message → retrieve documents or data (Databases or Tools or MCP) → inject into prompt → LLM generates response → save to database → repeat.

It works. Until it doesn’t. When the context grows, latency climbs. When memory updates block the response thread, the UX freezes. When two agents share state, race conditions appear. When the agent needs to learn from a conversation and respond at the same time, you have to choose one.

I decided to stop patching the pipeline and design something different. The result is the Agent Processor Unit an open-source architecture that applies hardware-level design principles to AI agent memory and task scheduling.

The APU: A Micro-Architecture for Agents

Just like a physical CPU splits responsibilities across specialised units, the ALU computes, the Control Unit orchestrates, the MMU manages memory, the APU does the same for an AI agent.

From Agent OS to Agent Processor Unit — Agent Processor Unit

Four units. Each with a single responsibility. None of them blocking the others.

Unit 1: The Semantic MMU, a Memory Hierarchy for Agents

The most important insight from hardware architecture is that not all memory is equal. Fast memory is small. Slow memory is large. A good system manages the boundary between them automatically.

The APU implements this as a four-tier hierarchy:

The agent never touches all four tiers for every query. The MMU routes each request to the fastest tier that can answer it.

The Living DLL, A Page Table for Context

Agent-Processor-Unit/apu/mmu/controller.py at main · EZFRICA/Agent-Processor-Unit

The MMU manages a Doubly Linked List of knowledge blocks. Each block is a discrete unit of memory: a travel preference, a user constraint, a learned fact. The DLL is the agent’s active working context.

Three block types, three positions in the list:

temp → HEAD (most recent context, O(1) access)
projet → mid-list (active planning, frequently referenced)
fondamental → before TAIL (permanent knowledge, always reachable)

The DLL is not static. Every time a block is matched by a query, it moves to the front. The Move-to-Front heuristic from cache replacement theory. Frequently used knowledge stays warm. Rarely used knowledge drifts toward the tail and becomes a candidate for archival.

LRU Paging and Page Faults

The context window has a hard limit: 12 active blocks. This is intentional. Prompt hygiene matters an agent with 40 injected blocks is slower and less coherent than one with 12 well-chosen ones.

When a 13th block is needed, the MMU automatically pages out the least recently used block to L3 (Weaviate). When a query later references that paged-out block, the MMU detects a page fault and pages it back in exactly like a virtual memory system.

# A real glimpse into the Paging mechanism 
if dll["dynamic_block_count"] >= dll["dynamic_block_max"]:
    # Semantic MMU: Page Out oldest block to make room
    dynamic_nodes = [n for n in dll["nodes"].values() if not n.get("is_fixed")]
    lru_node = min(dynamic_nodes, key=lambda x: x.get("last_accessed", "1970-01-01T00:00:00"))
    await page_out_block(lru_node["id"], dll)

# Page Fault scenario
1. Memory is full (12/12 blocks active)
2. New block needed → LRU block paged out to Weaviate
3. User asks about the evicted topic
4. MMU detects miss → fetches from L3 → pages back in
5. Agent responds with zero visible interruption

You can test this directly in the repository. The logs show every page-out and page-in event in real time.

Unit 2: The Semantic Scheduler, Asynchronous Everything

Agent-Processor-Unit/apu/core/scheduler.py at main · EZFRICA/Agent-Processor-Unit

This is where the architecture diverges most sharply from the pipeline model.

In a standard agent, memory writes block the response. The user asks a question, the agent learns something new, saves it to the database, and then responds. That sequential dependency is the root cause of most latency issues.

The APU Scheduler breaks this dependency entirely using an asyncio.PriorityQueue across a Thread-Safe bridge.

The interaction flow prioritises tasks explicitly:

# Four priority levels 
Priority 0 (Real-time)  → User-facing LLM response
Priority 1 (High)       → Tool execution (agent is waiting)
Priority 2 (Low)        → Write-back to L4 (Letta Cloud)
Priority 3 (Idle)       → GC, Move-to-Front reordering, block detection

The interaction flow becomes:

User: "I am allergic to peanuts."
  │
  ├─► L1/L3 write (immediate, ~10ms)
  │
  ├─► Response to user: "Noted. I'll avoid peanuts." ← user reads this NOW
  │
  └─► Scheduler pushes SYNC_L4 task (background, invisible)
           │
           └─► Letta Cloud updated 2 seconds later
                    └─► User never saw this happen

Here is the exact code pushing the heavy API call to the background worker loop, without blocking the user interface:

async def push(self, task_type: str, payload: dict, priority: int = 2):
    """Add a task to the queue (Thread-safe from any loop)."""
    task = APUTask(priority=priority, task_type=task_type, payload=payload)
    # Safely schedule the queue.put coroutine on the worker's thread loop
    asyncio.run_coroutine_threadsafe(self.queue.put(task), self._loop)

The user gets an instant response. Letta gets updated in the background. If Letta is down, the task retries with exponential backoff up to 3 times. The conversation is never interrupted.

Hot Path vs. Cold Path

This separation defines the two paths in the system:

Hot path (interaction, real-time): BMJ selection → L1 Cache → Weaviate → LLM response

Cold path (persistence, background): Write-back → Letta Cloud

Weaviate is the source of truth during a conversation. Letta is a cold archive that is always eventually consistent but never on the critical path. This is exactly how modern databases separate read replicas from write-ahead logs.

Unit 3: The BMJ Algorithm, the TLB of the APU

Agent-Processor-Unit/apu/mmu/controller.py at main · EZFRICA/Agent-Processor-Unit

The Bidirectional Metadata Jump is the routing engine. It answers one question before every response: where is the answer in working memory, or in the archive?

The BMJ traverses the DLL from HEAD, scoring each block using keyword overlap and recency. Two thresholds govern its behavior:

Above 0.80 confidence → answer is in the DLL, skip Weaviate entirely
Below 0.80 → escalate to L3 vector search

In hardware terms, the BMJ is the Translation Lookaside Buffer the fast-path resolver that avoids the full page table walk. Most queries are answered in L1 or the DLL. Weaviate is the fallback, not the default.

The result: 78% L1 cache hit rate in measured conversations, meaning 78% of queries never touch the vector database.

Unit 4: The TEU, a Cache for the Real World

Agent-Processor-Unit/apu/teu/controller.py at main · EZFRICA/Agent-Processor-Unit

Tool calls are expensive. A Google Search takes 300–800ms. If a user asks the same question twice in five minutes, making two API calls is wasteful.

The Tool Execution Unit (TEU) wraps every external call in an L1 IO Cache with a 15-minute TTL. The second identical query returns in under 1ms. The TEU also runs tools in isolation a failing search never propagates an exception to the main agent thread.

The Force Override, Human-in-the-Loop as a First-Class Feature

Agent-Processor-Unit/dashboard/app.py at main · EZFRICA/Agent-Processor-Unit

Every other agent memory system I have seen is a black box. You can observe the output but not the reasoning. You cannot inspect what the agent is about to use as context, and you cannot correct it without restarting the conversation.

The APU Dashboard exposes the full memory state in real time. An operator can:

See every active block and its position in the DLL
Disable a block before it reaches the LLM
Promote a block to HEAD position
Inject a correction mid-conversation
Roll back a write-back that introduced incorrect information

This is the interrupt mechanism from the hardware analogy. A human operator can fire an interrupt at any point in the pipeline, preempt the current context, and take control. The agent does not restart. The memory does not reset.

The Architecture in One Table

What This Is Not

This is not a wrapper around LangChain. It is not a prompt engineering trick. It is not a bigger context window.

It is a memory architecture one that treats the agent’s knowledge as a managed resource with explicit tiers, explicit eviction policies, explicit consistency guarantees, and explicit human override.

The context window bottleneck is real, but it is a symptom. The disease is treating memory as a flat, unstructured blob that you inject wholesale into every prompt. The APU treats memory the way hardware architects have treated it for 40 years: as a hierarchy where speed, size, and persistence are always in tension, and the system’s job is to manage that tension automatically.

Try It

The full architecture is open-source.

Repository: github.com/EZFRICA/Agent-Processor-Unit

Start with the Page Fault test, fill the memory to 12 blocks, add a 13th, then ask about the evicted topic and watch the MMU page it back in. It takes 5 minutes and makes the architecture immediately concrete.

The next article in this series will cover the Semantic Scheduler in depth multi-agent concurrency, token-budget management, and real-time interrupt handling during tool execution.

Ezekias Bokove is a Google Developer Expert for Google Cloud and founder of GDG Cloud Abidjan & GDGs Afrique Francophone. He writes about Cloud, AI, agent architecture, and distributed systems.

From Agent OS to Agent Processor Unit was originally published in Google Cloud – Community on Medium, where people are continuing the conversation by highlighting and responding to this story.

Source Credit: https://medium.com/google-cloud/from-agent-os-to-agent-processor-unit-cdd797fb6287?source=rss—-e52cf94d98af—4

Deven Goratela

Administrator

Visit Website View All Posts

Related Stories

State of Agentic Coding

How Siemens “sliced the elephant,” modernizing legacy code with agentic workflows

NotebookLM is the AI research tool I wish had existed in university

You may have missed