If you’ve spent any time around local LLMs, you’ve almost certainly heard about MCP. The Model Context Protocol started life as a Claude-only thing in late 2024, and it has since turned into the de facto way for any model, local or cloud, to talk to outside tools. Adding an MCP server to your stack is the closest thing the AI world has to installing a browser extension. You point your model at it, the model gets new capabilities, and it makes your stack that little bit better.
The catch is that the ecosystem moves fast enough that half the guides on the internet are already out of date. Many previously-listed MCP servers on Anthropic’s repo have been quietly moved to the archive list, where new servers took their place. Mem0 archived its first-party MCP server in favour of pushing users to its hosted cloud. Home Assistant, which had a community-led MCP integration for months, now ships one in core. I discovered all of this when I sat down to build out my Pi setup. While the ecosystem looks to be in a transitionary period, it went better than I expected.
For those not in the know, Pi is a minimal terminal harness, and the Pi MCP extension reads a single pi.mcp.json file to wire up every server. After configuring a couple of Docker containers, my Qwen 3.6 35B-A3B running through vLLM had access to a vector store, an LLM-curated memory layer, a headless browser, a clean web scraper, and my entire smart home. Oh, and the self-hosted memory layer is the same memory layer as one that costs $249 a month.
A quick refresher on MCP
A common language for models to talk
If you’re new to MCP, I’ll explain it briefly. It’s a protocol that lets your LLM call out to tools through a standardised interface, the same way a browser extension can add capabilities to Chrome without Chrome having to ship them. An MCP server exposes a list of tools, an MCP client (in my case, Pi) discovers them, and the model picks which ones to call when it needs to do something it can’t do with text alone.
The reason this matters for local models specifically is that a model running on your own hardware will struggle to out-reason cloud-based models, but that’s at least partially caused by their web access. When using a local model, with the right MCP servers it can read your Obsidian vault, talk to your Home Assistant instance, query your Postgres database, scrape a page that’s behind a corporate VPN, and remember what you told it last week, all without a single byte leaving your network.
The official MCP reference-server repo has archived most of what used to be in it. The MCP team is steering people toward the new registry and toward first-party servers maintained by the actual product teams (like GitHub, Microsoft, Qdrant, and Home Assistant). If you’ve followed an older tutorial and struggled, that might be why.
What I already had running
SearXNG and Context7
I’ve written about SearXNG MCP before, and it’s still the single most useful server in my stack. A self-hosted SearXNG instance plus the community MCP wrapper gives a local model real web search without a Perplexity Pro subscription or a Tavily API key. That one always stays loaded.
Context7 is the other carry-over. Upstash maintains it, and it gives any model fresh library documentation for whatever you’re coding against. There’s a small asterisk in that the docs database is Upstash-hosted, so it isn’t fully offline, but for the trade of “my local model writes against the actual current Next.js API instead of hallucinating from 2023” I’m fine with that.
Those two stay. I’ve added a lot more on top of that, since.
Mem0 and Qdrant gave me persistent memory for free
Mem0 Pro is $249 a month, and this isn’t
I run a single Qdrant container as a general-purpose vector store, and I run mem0’s OpenMemory MCP server in front of it as a curated memory layer. Together they replace Mem0 Pro, which is $249 a month if you want graph memory, and they do it on my own hardware with my own embedding model.
The first thing to be clear about is that Qdrant and mem0 aren’t alternatives, but stack layers. Qdrant is a vector database, meaning that you put text in, it stores an embedding, and you can query it by similarity later. There’s no intelligence behind that or deduplication logic. and all it’s really doing is model-agnostic vector math. Mem0 is a memory system that runs on top of a vector store, and in my case that vector store is the same Qdrant container I run for everything else. Mem0 is what turns raw similarity search into something that actually behaves like memory: it decides what’s worth remembering, it dedupes against what’s already there, and it updates old facts when new ones arrive. One container, two MCP servers in front of it, two different jobs.
How mem0 works is important to understand as well. On every add call, it runs the conversation through an LLM twice. The first pass extracts candidate facts from what you said. The second pass reconciles each candidate against semantically-similar memories already in Qdrant, then emits an event of ADD, UPDATE, DELETE, or NONE. That’s how mem0 knows to overwrite similar memories rather than writing new ones. Reads are easy with Qdrant, because a search or list is just a Qdrant nearest-neighbour lookup with no LLM involved. Reads can also be done from mem0, though all writes only go through it.
“Why run both MCP servers and not just one?” I hear you ask. The answer is that they handle different things. Mem0 is for conversation facts: what I told the model about my hardware, what I prefer, what’s running where. Raw Qdrant is for everything else I want semantic search over, including my Obsidian vault, scraped docs, and code snippets. Dumping Obsidian into mem0 would be a disaster, because mem0’s LLM would try to “understand” every note as it ingested it using the LLM, and if you have a lot of notes, that’s going to be messy. Putting conversation facts in raw Qdrant would lose the dedup and update logic that makes mem0 worth the trouble in the first place. Rough rule: if you’d want the store to notice that a new fact contradicts an old one, use mem0; if you just want fuzzy lookup over a pile of text you already trust, point it at Qdrant directly.
From when I last tested mem0, you will still need to clean the data store every few weeks if you actually use it. There are other rough edges too. Writes are slow, because every add is an LLM round trip before anything hits Qdrant. Swapping the embedder means recreating the container, because mem0 carries the embedding model in its own image and ships pre-computed vectors to Qdrant rather than the other way around. And the curator’s judgement is only as good as the model behind it, so a tiny local model will dedupe worse than a frontier one. Mem0 is a clear win over no memory at all, but it’s a curated memory layer at its heart.
There’s one more thing to be aware of, too: the official mem0ai/mem0-mcp repo is archived. Mem0 is pushing users to its cloud MCP instead. The working path now is the newer OpenMemory MCP container, which I run against vLLM as the curator and local sentence-transformers as the embedder.
Replaces paid subscriptions too
Crawl4AI is the other free tool that you’d have to pay for normally in this stack. Firecrawl charges $16 a month for 3,000 clean markdown-from-URL extractions that Crawl4AI gives you for free, locally, in a Docker container.
To give you an idea of what it can do, I asked my local LLM to fetch the XDA home page and tell me what it can see on the page. The model picked the Crawl4AI tool, got back clean markdown, and produced a summary of articles currently visible. It’s similar to what SearXNG can do, but Crawl4AI has more features than a simple web fetch. For example, it has data extraction and cosine simularity, session management, dynamic crawling, hooks, and a whole lot more.
Paired with SearXNG, it turns my MCP stack into a local research engine. Paired with everything else so far, I can search a topic to retrieve a list of URLs, get Crawl4AI to pull each one down as markdown, mem0 remembers what was interesting, and Qdrant lets me semantically search across the whole pile later.
Playwright MCP lets small models drive a browser
All thanks to accessibility controls
Microsoft maintains the Playwright MCP server directly, and it drives browsers using accessibility-tree snapshots rather than screenshots. That difference is the primary reason it runs at all on smaller local models that can’t do vision, though it’s also faster as well.
In the past, a lot of models that could use a browser would take a screenshot, send the image to a multimodal model, ask it where to click, and translate the answer back into a mouse coordinate. That worked fine for frontier models, but used to fall apart on local models. Playwright MCP sidesteps the whole problem by giving the model the accessibility tree directly, the same structured representation a screen reader would use. The model gets element IDs, roles, and labels in plain text, picks one, and the server clicks it.
Playwright is also just extremely powerful. I was able to ask it to go to XDA, retrieve my interests from its memories, then fetch an article that fit what it knows about me. It opened the browser, navigated, read the page, then presented the article to me to read. That’s a local model running on my own GPU, picking articles, reading them, all from what the model could retrieve in its database about me.
Playwright and Crawl4AI are similar tools, like mem0 and Qdrant, but they’re not identical. I use Crawl4AI for read-only fetches, though it has features similar to Playwright as well. However, I Playwright for anything that needs an actual browser session, including logging in or clicking through pagination. if a static HTML pull would have worked, use Crawl4AI; if you’d otherwise be sitting in front of a browser, use Playwright. Running both costs nothing because Playwright doesn’t fire up Chromium until the model actually calls it, and you can have it so the MCP server isn’t in the context of the model until you need it, so it’s not wasting tokens, either.
They close the gap significantly
If you’re running a local model and treating it as a sealed text box, you’re doing it wrong. A stack like this closes the gap between a model that can only work with what’s in its weights and one that can search the web, scrape a page into clean markdown, drive a browser, recall last week’s conversation, and turn off your lights, except none of it leaves your network unless you want it to.
This won’t be enough to beat a frontier model at most tasks, but you’d be surprised by how much the gap closes once your local model has the tools the cloud ones lean on. The paid equivalents aren’t cheap: Mem0 Pro at $249 a month, Firecrawl at $16, a search API on top of that. The self-hosted version costs you electricity, and as I’ve already shown, local LLMs aren’t actually that expensive to run.
There are two things to keep in mind, though. The ecosystem is growing rapidly; repos get archived, community integrations get pulled into core, replacements come along that are better than the incumbent, and first-party servers replace last year’s wrappers. Whatever you build now, expect to rebuild at least some part of it within the next year. The other thing is that every server you add costs a small model some attention. The setup that works isn’t the one with the most tools loaded, it’s the one with the five or six you’ll actually use, with the rest a config edit away. With Pi, you can unload tools from being actively presented in context, so that helps, at least.
For me, that’s all my setup needs: a vector store, a memory layer, a scraper, a browser, and web search, all on the hardware I already own.
