Learn how to build browser-based AI agents that run entirely on the client side using WebAssembly. This guide covers implementing a RAG system with local LLM inference for private, offline-capable applications.

Building Private & Offline LLM-Powered Experiences: Client-Side AI Agents with Wasm

AI is everywhere, but most of the powerful stuff runs on massive cloud servers. Cloud-based LLMs are impressive, yet they raise significant questions around data privacy, connectivity requirements, and ongoing API costs. What if we could run those same powerful language models directly in a user's browser—keeping their data on-device and enabling apps to work offline?

Client-Side AI Agents with WebAssembly (Wasm) make this possible. This approach lets us run complex AI models, including those using Retrieval-Augmented Generation (RAG), entirely in the browser. The applications are compelling: a chatbot that answers questions from local files without exposing your data, a coding assistant that works offline, or a personal knowledge manager that never ships your thoughts to a distant server.

In this post, I'll walk through how to build one of these agents. We'll explore the rationale and mechanics of client-side AI, focusing on a browser-based RAG system using Wasm for local LLM inference. By the end, you'll have both a conceptual understanding and concrete steps to start building truly private, offline LLM-powered applications.

The Unseen Revolution: Why Client-Side AI Matters

The shift to client-side AI addresses core issues with how AI is deployed today:

Uncompromised Privacy: When inference happens on the client, user data—prompts, retrieved information—never leaves their device. No data breaches, no server logs, and significantly reduced compliance burden for sensitive data under regulations like GDPR or HIPAA.
Offline Capability: Field workers in remote areas, students on long flights, or anyone with unreliable connectivity can still access intelligent assistance. Once the model downloads, the app works without a network.
Cost Efficiency: Per-token API costs disappear. Every inference after the initial model download runs locally at zero marginal cost—a substantial saving for high-volume applications.
Reduced Latency: Network round-trips to cloud servers always add lag. Local inference eliminates this, delivering near-instant responses and a noticeably smoother user experience.
Natural Scalability: Client-side AI scales with your user base without increasing server infrastructure costs. Each user's device handles its own compute.
User Control & Customization: Users could swap models, integrate them deeply with local files and applications, or potentially fine-tune them locally—flexibility that cloud APIs simply can't offer.

WebAssembly (Wasm): The Engine for Browser AI

WebAssembly (Wasm) makes client-side AI practical. It's a binary instruction format for a stack-based virtual machine, designed as a portable compilation target for languages like C, C++, Rust, and Go.

Why Wasm matters for browser AI:

Near-Native Performance: Wasm executes far faster than JavaScript, approaching native speeds. For compute-intensive neural network inference, this performance is essential.
Safety and Sandboxing: Wasm runs in a secure, sandboxed browser environment with no access to system resources outside what's explicitly granted.
Portability: Wasm modules work consistently across browsers and operating systems, simplifying development and deployment.
Mature Ecosystem: Optimized AI libraries like llama.cpp, ONNX Runtime, and PyTorch Mobile can be compiled to Wasm, providing a substantial head start over building from scratch.

Wasm bridges the gap between server-side AI capabilities and web deployment, enabling complex models to run safely in any modern browser.

Understanding Retrieval-Augmented Generation (RAG)

Running an LLM in the browser is useful, but "raw" LLMs have limitations: they hallucinate, their knowledge is frozen at training time, and they can't access real-time or private information. Retrieval-Augmented Generation (RAG) addresses these gaps.

RAG enhances LLMs by connecting them to external knowledge bases. Instead of relying solely on training data, RAG systems retrieve relevant content from a document collection and augment the LLM's prompt with that context before generating a response.

Benefits of RAG for Client-Side AI:

Reduced Hallucination: Grounding responses in retrieved facts makes them more reliable.
Access to Private Data: The LLM can answer questions about personal documents, notes, or other local data.
Smaller Model Requirements: Retrieval offloads knowledge storage, allowing smaller, browser-friendly LLMs to perform well.
Current Information: The knowledge base can be updated independently of the model.

Core Components:

Knowledge Base (KB): The document collection—text files, PDFs, notes—that the LLM draws from.
Embedding Model: A neural network that converts text into numerical vectors where semantic similarity maps to vector proximity.
Vector Database: A specialized store optimized for similarity search across embeddings.
Large Language Model (LLM): The generative model that synthesizes retrieved context with the user query to produce answers.

Architecting Our Browser-Based RAG Agent

Here's how our client-side RAG agent works:

Client-Side Components:

UI (HTML/CSS/JS): Standard interface for queries and responses. We'll use vanilla JS here, though React, Vue, or Svelte would work equally well.
Wasm Embedding Model: We'll use @xenova/transformers to run a dedicated sentence-transformer model for high-quality embeddings.
In-Memory Vector Store: A simple implementation for this demo. Production systems would use hnswlib-wasm or faiss-wasm with IndexedDB for persistence.
Wasm LLM: We'll use @mlc-ai/web-llm to load and run quantized LLMs in the browser.

Step-by-Step Setup Guide

1. Project Setup

Create two files: index.html and script.js.

index.html:

hljs html89 lines
<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <title>Client-Side RAG Agent</title>
    <style>
        body {
            font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, sans-serif;
            max-width: 800px;
            margin: 20px auto;
            padding: 0 20px;
            line-height: 1.6;
        }
        textarea {
            width: 100%;
            min-height: 100px;
            margin-bottom: 10px;
            padding: 10px;
            border: 1px solid #ccc;
            border-radius: 4px;
            font-family: inherit;
        }
        button {
            padding: 10px 15px;
            background-color: #007bff;
            color: white;
            border: none;
            border-radius: 4px;
            cursor: pointer;
        }
        button:disabled {
            background-color: #cccccc;
            cursor: not-allowed;
        }
        #output {
            background-color: #f9f9f9;
            border: 1px solid #eee;
            padding: 15px;
            border-radius: 4px;
            margin-top: 20px;
            white-space: pre-wrap;
        }
        #status {
            margin-top: 10px;
            font-style: italic;
            color: #555;
        }
        .error {
            color: #dc2626;
            background-color: #fef2f2;
            padding: 10px;
            border-radius: 4px;
        }
        .progress-bar {
            width: 100%;
            height: 4px;
            background: #e5e7eb;
            border-radius: 2px;
            margin-top: 8px;
            overflow: hidden;
        }
        .progress-bar-fill {
            height: 100%;
            background: #3b82f6;
            transition: width 0.3s ease;
        }
    </style>
</head>
<body>
    <h1>Private & Offline RAG Agent</h1>
    <p>Ask a question about the pre-loaded knowledge base.</p>

    <h2>Knowledge Base (Pre-loaded)</h2>
    <p>Sample content about WebAssembly and Client-Side AI is ready for querying.</p>

    <h2>Ask a Question</h2>
    <textarea id="queryInput" placeholder="Enter your question here..."></textarea>
    <button id="askButton" disabled>Ask</button>
    <div id="status">Checking browser compatibility...</div>
    <div class="progress-bar" id="progressBar" style="display: none;">
        <div class="progress-bar-fill" id="progressFill" style="width: 0%"></div>
    </div>
    <div id="output"></div>

    <script type="module" src="script.js"></script>
</body>
</html>

2. Dependencies

Install the required packages:

hljs bash3 lines

npm init -y
npm install @mlc-ai/web-llm @xenova/transformers

For production applications, use a bundler like Vite or Webpack for better asset management and code splitting:

hljs bash2 lines

npm install -D vite

3. Loading Models and Building the Vector Store

script.js:

hljs javascript236 lines
import * as webllm from "@mlc-ai/web-llm";
import { pipeline } from "@xenova/transformers";

// DOM Elements
const queryInput = document.getElementById('queryInput');
const askButton = document.getElementById('askButton');
const statusDiv = document.getElementById('status');
const outputDiv = document.getElementById('output');
const progressBar = document.getElementById('progressBar');
const progressFill = document.getElementById('progressFill');

// State
let chatEngine = null;
let embeddingPipeline = null;
let vectorStore = [];

// Knowledge Base - Example documents about WebAssembly and client-side AI
const knowledgeBase = [
    "WebAssembly, commonly abbreviated as Wasm, provides a binary instruction format designed for stack-based virtual machines. It serves as a compilation target for languages including C, C++, Rust, and Go. The format enables near-native execution speeds while maintaining browser security through sandboxing.",

    "Running AI models directly in the browser eliminates the need to send data to external servers. This architectural choice provides several advantages: user data remains on-device, applications function without network connectivity, and operational costs decrease since no API calls are required.",

    "Retrieval-Augmented Generation combines information retrieval with language model generation. The system first searches a knowledge base for relevant documents, then includes those documents as context when prompting the language model. This approach reduces hallucination and enables the model to reference specific, current information.",

    "A typical RAG implementation requires four components working together. The knowledge base stores source documents. An embedding model converts text into vector representations. A vector database enables efficient similarity search. Finally, a language model generates responses using retrieved context.",

    "Browser-based AI inference became practical with WebGPU, which provides low-level GPU access from JavaScript. Combined with WebAssembly for CPU-bound operations, modern browsers can now run neural networks that previously required dedicated server infrastructure.",

    "Privacy regulations like GDPR and HIPAA impose strict requirements on data handling. Client-side AI architectures simplify compliance because sensitive data never leaves the user's device—there are no server logs to secure and no cross-border data transfers to document.",

    "Quantization reduces model size by representing weights with fewer bits. A model using 4-bit quantization requires roughly one-quarter the memory of its full-precision equivalent. This technique makes it feasible to run capable language models within browser memory constraints.",

    "After the initial model download, client-side inference incurs no per-request costs. For applications with high query volumes, this can represent substantial savings compared to cloud API pricing models that charge per token processed."
];

// --- Utility Functions ---

function updateProgress(percent) {
    progressBar.style.display = 'block';
    progressFill.style.width = `${percent}%`;
}

function hideProgress() {
    progressBar.style.display = 'none';
}

function cosineSimilarity(vecA, vecB) {
    let dotProduct = 0;
    let magnitudeA = 0;
    let magnitudeB = 0;
    for (let i = 0; i < vecA.length; i++) {
        dotProduct += vecA[i] * vecB[i];
        magnitudeA += vecA[i] * vecA[i];
        magnitudeB += vecB[i] * vecB[i];
    }
    magnitudeA = Math.sqrt(magnitudeA);
    magnitudeB = Math.sqrt(magnitudeB);
    if (magnitudeA === 0 || magnitudeB === 0) return 0;
    return dotProduct / (magnitudeA * magnitudeB);
}

async function checkCompatibility() {
    if (!navigator.gpu) {
        return {
            supported: false,
            reason: "WebGPU is not available in this browser. Please use Chrome 113+, Edge 113+, or Safari 17+."
        };
    }

    try {
        const adapter = await navigator.gpu.requestAdapter();
        if (!adapter) {
            return {
                supported: false,
                reason: "No GPU adapter found. A dedicated graphics card is recommended for client-side AI."
            };
        }
        return { supported: true };
    } catch (e) {
        return {
            supported: false,
            reason: `GPU initialization failed: ${e.message}`
        };
    }
}

async function embedText(text) {
    const output = await embeddingPipeline(text, {
        pooling: 'mean',
        normalize: true
    });
    return Array.from(output.data);
}

async function addDocumentToVectorStore(text, index, total) {
    statusDiv.textContent = `Embedding document ${index + 1}/${total}...`;
    updateProgress(((index + 1) / total) * 100);
    const embedding = await embedText(text);
    vectorStore.push({ text, embedding });
}

async function searchVectorStore(queryEmbedding, topK = 3) {
    const similarities = vectorStore.map((doc, index) => ({
        index,
        similarity: cosineSimilarity(queryEmbedding, doc.embedding)
    }));
    similarities.sort((a, b) => b.similarity - a.similarity);
    return similarities.slice(0, topK).map(s => ({
        text: vectorStore[s.index].text,
        score: s.similarity
    }));
}

// --- Model Initialization ---

async function initializeModels() {
    // Check browser compatibility first
    const compatibility = await checkCompatibility();
    if (!compatibility.supported) {
        statusDiv.innerHTML = `<span class="error">${compatibility.reason}</span>`;
        outputDiv.textContent = "Client-side AI requires WebGPU support. Please try a different browser.";
        return;
    }

    try {
        // Step 1: Load embedding model first (smaller, faster)
        statusDiv.textContent = "Loading embedding model (~30MB)...";
        updateProgress(10);

        embeddingPipeline = await pipeline(
            'feature-extraction',
            'Xenova/all-MiniLM-L6-v2',
            {
                progress_callback: (progress) => {
                    if (progress.status === 'progress') {
                        updateProgress(10 + (progress.progress * 0.3));
                    }
                }
            }
        );

        // Step 2: Index knowledge base
        statusDiv.textContent = "Indexing knowledge base...";
        for (let i = 0; i < knowledgeBase.length; i++) {
            await addDocumentToVectorStore(knowledgeBase[i], i, knowledgeBase.length);
        }

        // Step 3: Load LLM (larger, takes longer)
        statusDiv.textContent = "Loading language model (~1.5GB on first visit)...";
        updateProgress(50);

        chatEngine = await webllm.CreateMLCEngine("Phi-3.5-mini-instruct-q4f16_1-MLC", {
            initProgressCallback: (progress) => {
                statusDiv.textContent = progress.text;
                // progress.progress is 0-1
                updateProgress(50 + (progress.progress * 50));
            }
        });

        hideProgress();
        statusDiv.textContent = `Ready. Knowledge base: ${vectorStore.length} documents indexed.`;
        askButton.disabled = false;
        queryInput.disabled = false;

    } catch (error) {
        hideProgress();
        statusDiv.innerHTML = `<span class="error">Model loading failed: ${error.message}</span>`;
        console.error("Model loading error:", error);
    }
}

// --- Query Handler ---

askButton.addEventListener('click', async () => {
    const query = queryInput.value.trim();
    if (!query) {
        alert("Please enter a question.");
        return;
    }

    askButton.disabled = true;
    queryInput.disabled = true;
    outputDiv.textContent = "Processing...";

    try {
        // Step 1: Embed the query
        statusDiv.textContent = "Embedding query...";
        const queryEmbedding = await embedText(query);

        // Step 2: Retrieve relevant documents
        statusDiv.textContent = "Searching knowledge base...";
        const retrievedDocs = await searchVectorStore(queryEmbedding, 3);

        // Step 3: Build prompt with context
        const context = retrievedDocs.map(d => d.text).join("\n\n");
        const messages = [
            {
                role: "system",
                content: "You are a helpful assistant. Answer questions based only on the provided context. If the answer cannot be found in the context, say so clearly."
            },
            {
                role: "user",
                content: `Context:\n${context}\n\nQuestion: ${query}`
            }
        ];

        // Step 4: Generate response
        statusDiv.textContent = "Generating response...";
        outputDiv.textContent = "";

        const response = await chatEngine.chat.completions.create({
            messages,
            stream: true,
        });

        // Stream the response
        for await (const chunk of response) {
            const delta = chunk.choices[0]?.delta?.content || "";
            outputDiv.textContent += delta;
        }

        statusDiv.textContent = "Done.";

    } catch (error) {
        outputDiv.innerHTML = `<span class="error">Error: ${error.message}</span>`;
        statusDiv.textContent = "An error occurred.";
        console.error("RAG error:", error);
    } finally {
        askButton.disabled = false;
        queryInput.disabled = false;
    }
});

// Initialize on load
initializeModels();

4. Memory Management for Constrained Devices

Running both an embedding model and an LLM simultaneously can stress systems with limited RAM. For devices with 8GB or less, consider sequential loading:

hljs javascript20 lines
// Alternative: Sequential model loading for memory-constrained devices
async function initializeModelsSequential() {
    // Load embedding model, index documents, then release it
    const tempEmbedder = await pipeline('feature-extraction', 'Xenova/all-MiniLM-L6-v2');

    for (const doc of knowledgeBase) {
        const embedding = await tempEmbedder(doc, { pooling: 'mean', normalize: true });
        vectorStore.push({ text: doc, embedding: Array.from(embedding.data) });
    }

    // Dispose embedding model to free memory before loading LLM
    await tempEmbedder.dispose();

    // Now load the LLM with more available memory
    chatEngine = await webllm.CreateMLCEngine("Phi-3.5-mini-instruct-q4f16_1-MLC");

    // For queries, we'll need to reload the embedder temporarily
    // or pre-compute common query embeddings
}

The tradeoff: queries require reloading the embedding model (adds ~2-3 seconds latency) or you pre-compute embeddings for expected queries. For most users with 16GB+ RAM, running both models concurrently works fine.

5. Running the Application

Start a local development server:

hljs bash9 lines
# Option 1: Vite (recommended)
npx vite

# Option 2: http-server
npx http-server

# Option 3: Python
python -m http.server

Navigate to http://localhost:5173 (Vite) or http://localhost:8080. The status indicator will show model loading progress—first loads download approximately 1.5GB of model weights, but subsequent visits use cached files.

Try queries like:

"What is WebAssembly?"
"Why does privacy matter for client-side AI?"
"How does RAG reduce hallucination?"

Deployment Considerations

Browser Compatibility

Supported Browsers:

Chrome/Edge 113+ (best WebGPU support)
Firefox 118+ (WebGPU behind flag, improving)
Safari 17+ (partial WebGPU support)

Hardware Requirements:

Minimum 8GB system RAM (16GB recommended for concurrent model loading)
Dedicated GPU significantly improves performance
~1.5GB storage for model weights (cached after first load)

Graceful Degradation:

Always check compatibility and provide clear messaging when requirements aren't met. The checkCompatibility() function in our implementation handles this, but consider also checking available memory:

hljs javascript5 lines
// Check available memory (where supported)
if (navigator.deviceMemory && navigator.deviceMemory < 8) {
    console.warn(`Device reports ${navigator.deviceMemory}GB RAM. Performance may be limited.`);
}

For unsupported browsers, provide fallback options: a cloud API endpoint, a simpler non-AI experience, or clear instructions for switching browsers.

Production Optimizations

For deployment beyond demo purposes:

Use IndexedDB for vector persistence - Avoid re-embedding documents on every page load
Implement service workers - Cache model files for true offline capability
Add loading states - First-time users face a significant download; communicate progress clearly
Consider model preloading - If you know a user will need the AI, start loading before they click

Key Takeaways

Wasm enables a new class of web applications by bringing near-native AI inference to browsers safely and portably.
Privacy and offline capability are increasingly important differentiators, not just nice-to-haves.
RAG transforms small LLMs from novelties into genuinely useful tools by grounding them in specific, retrievable knowledge.
Real constraints exist: initial download sizes (1-2GB), memory requirements (8-16GB RAM), and the current limitation to smaller quantized models. Design your UX around these realities.
The ecosystem is maturing rapidly. Libraries like web-llm and transformers.js have simplified integration dramatically in the past year, and browser WebGPU support continues to improve.

What to Build Next

The foundation above is intentionally minimal. A natural extension: build a personal document assistant that indexes your own markdown notes or text files.

This would involve:

File upload via the File API - Let users drag-and-drop their documents
Document chunking - Split longer documents into retrievable segments. Aim for 200-500 token chunks with 50-100 token overlap between chunks. This overlap prevents losing context at chunk boundaries.
Persistent vector storage - Save embeddings to IndexedDB so documents don't need re-processing on each visit
Smarter retrieval - Implement hybrid search combining vector similarity with keyword matching (BM25)

The core RAG loop stays the same—what changes is the knowledge base and how you populate it.

Client-side AI is shifting from experimental to practical. The infrastructure is ready. The question now is what you'll build with it.