Building Offline RAG on iOS: How to Run Gemma 3N Locally

Running a Large Language Model (LLM) like Gemma 3N on an iPhone requires a fundamental shift in mindset. As a cloud developer, I’m used to infinite RAM and simple API calls to models like Gemini. But for this project, those luxuries were gone.

The goal was strict: build a mobile app with a bundled LLM, an embedding model, and a vector database — all operating fast enough to be usable, and entirely without an internet connection. Here is how we squeeze that much power into a pocket-sized device.

Just Your Typical RAG Chatbot, but Not

The goal was to create an iPhone RAG (retrieval augmented generation) chatbot app capable of answering incredibly complex, technical questions about the maintenance of industrial equipment. The source of truth for those questions was a single 350 page PDF reference document that was jammed with complicated tables, images, and detailed text.

To make this work, I needed a hybrid approach: combining keyword matching with semantic (vector) search. But here is the catch: Semantic search requires an embedding model running locally. Suddenly, our limited memory budget isn’t just for the LLM; it has to be shared with the embedding model, the vector database, and the app logic itself.

So given the LLM, a hybrid RAG database, and a separate embedding model, the big question is this: how do you put all of that into a single iPhone application, given that most LLMs are very large (not just in terms of number of parameters, but also pure size as measured in gigabytes), and most phones don’t have that much memory (at least compared to desktops or cloud-based machines)?

The first step is to choose a model.

Choosing a Model

Since my target hardware was an iPhone 16, I had a hard ceiling of 8GB of RAM. But the OS and the app code take probably about 3 GB of that, leaving us with a very tight budget for the models and the database. Finding an effective LLM that can run in 3 or 4 GB of memory (a reasonable amount) can be a challenge.

To narrow down the candidates, I didn’t just look at benchmarks. I used Ollama on my desktop to host multiple quantized small-scale models, feeding them specific questions related to the industry this app is for. I was able to create a set of sample questions and compare the answers from multiple LLMs using this handy tool. This gave me a sense of which models had decent built-in knowledge that would be helpful for this use case, and which ones I should skip.

This testing process highlighted that while several models were fast, Gemma 3N offered the best reasoning capabilities for our specific technical domain. Although I saw some good results from models like Gemma 3 (not 3N) and Qwen, ultimately I got the best answers from Gemma 3N. That’s good, because the 3N models are designed to be hosted on edge devices just like the iPhone.

At the highest level, there are two versions of Gemma 3N called E2B and E4B. “E2B” stands for “effectively 2 billion parameters”, and “E4B” means “effectively 4 billion parameters”. Normally you want to use the largest model that makes sense for your use case, because typically a 4B model gives better results than a 2B model, but in this case we need to think about memory usage.

By the way, the “Effective” prefix highlights that the model can run with a reduced memory and compute footprint compared to its total number of parameters. For example, E2B actually contains over 5 billion parameters, but through some innovative optimization methods like Per-Layer Embedding (PLE) caching, conditional parameter loading, and the use of MatFormer architecture, the number of parameters loaded is actually much closer to only 2 billion.

Although both Gemma E2B and Gemma E4B work on the iPhone, the quality of answers from E2B wasn’t significantly less than those from E4B in my case, and since E2B was smaller and faster, that tipped the scales in terms of choosing the E2B variant.

How to Use a LLM on an iPhone

When writing an iPhone app in Swift, there are two obvious options for hosting an LLM: Google’s MediaPipe, and Apple’s MLX Swift.

MediaPipe is cross-platform (iOS, Android, and web) and supports TensorFlow Lite (TFLite) models, recently rebranded to LiteRT, where “RT” stands for Runtime. You can find LiteRT models on Hugging Face.

MLX on the other hand was written by Apple and only runs on Apple hardware. It supports models stored in Safetensors files, which can also be found on Hugging Face.

Based on my testing, MLX was much faster for certain operations, and the number of model variants available on Hugging Face for MLX was quite a bit larger than the number for MediaPipe. For these reasons, and because I had no need for cross-platform functionality, I went with MLX Swift.

Important note about Quantizing: When you browse the models available on Hugging Face, you’ll see many variants. Even narrowing down to Gemma 3N E2B, you’ll see several different versions of those. There are really two main things I look for in this case: instruction tuning, and the number of bits used for quantizing. (“Quantizing” is the process of taking each of the parameters in a model and shrinking them down in order to save space.)

Instruction tuning is often indicated by an “it” string in the model name. That means it was trained to follow instructions, which is a necessity when dealing with something like a RAG chatbot.

Think of Quantization as compressing a high-resolution image. We take the massive, high-precision parameters of the model (usually 16-bit floating point numbers) and shrink them down to 4-bit integers. While this sounds like a drastic loss of data, it allows us to fit a massive brain into a tiny memory budget with surprisingly little loss in intelligence.

Bottom line — look for an instruction-tuned model that is quantized to 4 bits. The model I used is called gemma-3n-E2B-it-lm-4bit.

Including the Model in your App

When you download a model from Hugging Face, it comes as a set of files. Although the majority of the model is saved in .safetensor files, other files are included to configure the model and support the associated tokenizer.

The best way to include that in your app is to create a Folder reference in Xcode that points to the folder with the model files. This way you can update the folder as you need and don’t have to worry about adding or modifying individual files.

A Note on the App Store: The Apple App Store has limits as to how big an app can be, both in terms of initial loading, and total size. You won’t be able to create a very large app like this one and offer it via the App Store. Instead, this approach is good only for situations where you are deploying to corporate devices by using a Mobile Device Management (MDM) solution or something like that. Alternatively, you could leave the model files outside of your app and download them on the first run of the app.

Loading and Calling the Model

From a code perspective, I created a single service called LocalLLMService.swift that handles loading the model and also sending back responses, either streamed or all-at-once.

Let’s start with code for loading the model from our embedded resources. For brevity, I’ll only include the most important parts. You can find the entire file in this GitHub Gist.

First, let’s include the packages we need.

import Foundation
import MLX
import MLXNN
import MLXLLM
import MLXLMCommon
import Tokenizers

Then we load the model in the loadModel() function:

// Get full path to model directory
guard let bundlePath = Bundle.main.resourcePath else {
      throw ModelError.modelNotFound("Unable to access app bundle")
}


let fullModelPath = (bundlePath as NSString).appendingPathComponent(modelPath)


// Verify model directory exists
let fileManager = FileManager.default
var isDirectory: ObjCBool = false
guard fileManager.fileExists(atPath: fullModelPath, isDirectory: &isDirectory),
        isDirectory.boolValue else {
      throw ModelError.modelNotFound(fullModelPath)
}

// Verify required model files exist
let modelFile = (fullModelPath as NSString).appendingPathComponent("model.safetensors")
let tokenizerFile = (fullModelPath as NSString).appendingPathComponent("tokenizer.json")
let configFile = (fullModelPath as NSString).appendingPathComponent("config.json")

guard fileManager.fileExists(atPath: modelFile) else {
      throw ModelError.modelNotFound("model.safetensors not found")
}
guard fileManager.fileExists(atPath: tokenizerFile) else {
      throw ModelError.tokenizerLoadingFailed("tokenizer.json not found")
}
guard fileManager.fileExists(atPath: configFile) else {
      throw ModelError.modelLoadingFailed("config.json not found")
}

// Load MLX model container with Metal acceleration
print("  Loading MLX model container...")

// Create model configuration with local directory URL
let modelURL = URL(fileURLWithPath: fullModelPath)
let modelConfig = ModelConfiguration(
      directory: modelURL,
      defaultPrompt: "You are a helpful assistant."
)

// Load the model container using LLMModelFactory
self.modelContainer = try await LLMModelFactory.shared.loadContainer(
      configuration: modelConfig
) { progress in
      print("  Loading progress: \(Int(progress.fractionCompleted * 100))%")
  }

Once that’s done, there’s a crucial step that helps with memory issues:

// Configure MLX GPU buffer cache limit to prevent memory accumulation
// MLX caches freed GPU memory for reuse, but this can cause OOM on repeated inferences
// Set limit to 50 MB to allow some caching while preventing excessive accumulation
let cacheLimit = 50 * 1024 * 1024  // 50 MB
MLX.GPU.set(cacheLimit: cacheLimit)

Memory use is the major issue when dealing with LLMs on mobile hardware. The MLX framework does have a tendency to hold on to memory, which can accumulate and cause your app to crash after just a couple of queries. The above code explicitly controls how much memory is allocated, which fixes this problem.

One other note about memory: another really important step is to add an entitlement (via Xcode) to indicate that your app needs more memory. This results in the com.apple.developer.kernel.increased-memory-limit entitlement to be added to your app.

Now that the model is loaded, let’s look at how it’s called. The code supports both streaming and non-streaming responses. Let’s look at the streaming responses:

guard let container = modelContainer else {
    print("✗ Model container not initialized")
    return
}

do {
print("  Generating streaming response with MLX...")
print("  Prompt: \"\(prompt.prefix(50))\(prompt.count > 50 ? "..." : "")\"")


// Set up generation parameters
let params = GenerateParameters(
    temperature: temperature,
    topP: topP,
    repetitionPenalty: repetitionPenalty
)

// Capture values to avoid retaining self in closure
let maxTokensLimit = self.maxTokens

// Generate with streaming callback
let result = try await container.perform { context in
    // Prepare input with user messages using context processor
    let fullPrompt = prompt
    let input = try await context.processor.prepare(input: .init(prompt: fullPrompt))

    var localTokenCount = 0
    return try MLXLMCommon.generate(
        input: input,
        parameters: params,
        context: context
    ) { tokens in
        // tokens array is cumulative (all tokens so far), not incremental
        localTokenCount = tokens.count

        // Decode new tokens to text (synchronous decode)
        let newText = context.tokenizer.decode(tokens: tokens)

        // Stop if we've hit the EOS token ID (model is done) - check first for natural completion
        if let eosTokenId = context.tokenizer.eosTokenId,
           tokens.contains(eosTokenId) {
            return .stop
        }

        // Stop if we see end-of-turn markers in the decoded text
        if newText.contains("<end_of_turn>") || newText.contains("</s>") {
            return .stop
        }

        // Stop if we've hit max tokens (safety limit)
        if localTokenCount >= maxTokensLimit {
            print("⚠️ Max token limit reached (\(maxTokensLimit)) - appending truncation notice")
            // Send truncation notice to user
            Task { @MainActor in
                onPartialResponse("\n\n[Response truncated - maximum length reached]")
            }
            return .stop
        }

        // Clean up EOS markers before sending to callback
        var cleanedText = newText
        cleanedText = cleanedText.replacingOccurrences(of: "<end_of_turn>", with: "")
        cleanedText = cleanedText.replacingOccurrences(of: "</s>", with: "")

        // Only send non-empty cleaned text to callback
        if !cleanedText.isEmpty {
            // Call the partial response callback on main thread
            Task { @MainActor in
                onPartialResponse(cleanedText)
            }
        }
        return .more
    }
}

let generationTime = Date().timeIntervalSince(startTime)
print("  Generation time: \(String(format: "%.3f", generationTime))s")
print("  Tokens generated: \(result.tokens.count)")

// Force MLX to evaluate computation graph and release GPU buffers
// This triggers the cache limit policy, allowing old buffers to be freed
MLX.eval()

There are a couple of key points to take into consideration. First, at the top of the function we set the LLM parameters like temperature, top-P, etc. Second, we can specify a maximum number of output tokens, and the code stops calling the LLM once that limit is reached.

Finally (and perhaps most importantly), this implementation differs from standard streaming. The callback returns the total accumulated response so far, rather than just the new tokens. Your UI code should replace the current text view entirely on every update, rather than appending to it. This differs from the normal approach of the caller keeping the current answer and appending the new tokens.

Finally, note the last step (MLX.eval()), which is used to force MLX to release some internal buffers, which is another part of the memory saving approach.

Conclusion: The Cloud in Your Pocket

A year ago, building a RAG system capable of answering complex maintenance questions required a cloud GPU cluster and an API key. Today, we have that same capability running offline on a phone.

By carefully selecting an capable, small model like Gemma 3N, utilizing the unified MLX ecosystem, and respecting the strict memory limits of iOS, we didn’t just build a chatbot — we built an entire RAG solution. We proved that the edge is no longer just for “toy” models. It is ready for real work.

The constraints of mobile development — battery, thermal, and RAM — forces us to be better engineers. And honestly? Watching those tokens stream onto an iPhone screen feels a lot more satisfying than getting a JSON response from a server.

Building Offline RAG on iOS: How to Run Gemma 3N Locally was originally published in Google Cloud – Community on Medium, where people are continuing the conversation by highlighting and responding to this story.

Source Credit: https://medium.com/google-cloud/building-offline-rag-on-ios-how-to-run-gemma-3n-locally-ffdfda6f7217?source=rss—-e52cf94d98af—4