Browser-Based AI Inpainting With WebGPU

The Inference Layer Is Moving to the Client

Browser-based AI model execution is no longer a novelty. The Moebius 0.2B image inpainting model — originally designed to run on PyTorch with NVIDIA CUDA — has been successfully ported to run entirely in the browser using WebGPU and the ONNX runtime. No server. No GPU rental. No API call. Just a user opening a webpage and running a neural network locally, with 1.24GB of model weights cached in the browser after the first load.

Simon Willison documented this porting process in detail, using Claude Code as the primary engineering agent throughout. The result reframes a question we hear often at NerdHeadz: "Do we need a backend for AI inference?" For small, well-structured models, the honest answer is increasingly no.

Why ONNX + WebGPU Is the Architecture That Makes This Work

Five connected geometric prisms of increasing size converging into a single glowing amber terminal slab

ONNX (Open Neural Network Exchange) is a portable, framework-neutral format for neural networks. An .onnx file packages two things together: a computation graph describing the forward pass as a directed sequence of operators, and the learned weight tensors those operators act on. Critically, ONNX describes *what* to compute without specifying *how* or on *what hardware* — which is precisely what makes it deployable to a WebGPU backend running in Chrome, Firefox, or Safari.

PyTorch has native export support for ONNX via torch.onnx.export, so the conversion path from a research model to a browser-runnable artifact is shorter than most engineers expect. The opset versioning system ensures operator semantics are pinned and reproducible across runtimes. Once the weights are converted and hosted — in this case on Hugging Face — the browser loads them directly, with the CacheStorage API handling the 1.3GB file after the first visit so subsequent loads are instant.

The browser is no longer a thin client for AI — it is becoming a legitimate inference runtime.

Working on a product that could benefit from client-side AI inference? Talk to our team about your project.

What Claude Code Actually Did (And What That Tells Builders)

A central deep purple hexagonal column with four amber geometric fragments converging toward it along dashed lines

The engineering work here was handled almost entirely by an AI coding agent. Claude Code converted the PyTorch model to ONNX, published the weights to Hugging Face, scaffolded the frontend application, wired up a progress bar for the large file download, and implemented CacheStorage to prevent repeated 1.3GB downloads on reload. The human's role was testing, directing, and pointing the agent at reference implementations like the Whisper Web demo.

This is the agentic development pattern we apply on our own client projects. Our AI development services team uses coding agents not to replace engineering judgment, but to compress the distance between "this model exists in a research repo" and "this model is running in production." The Moebius port went from concept to deployed demo in a single working session — that compression is real and measurable.

The key insight is that the agent needed rich context to succeed. Cloning the source repo, the weights repository, Transformers.js, and ONNX Runtime into a shared working directory before the session started gave the agent the reference material to make informed architectural decisions. Garbage in, garbage out — this applies as much to agents as to models.

The Product Implications of Client-Side Inference

Three concentric deep purple rings surrounding a compact central prism radiating amber light upward through stacked membranes

Three capabilities have now been confirmed viable at the 200M-parameter scale in a standard browser environment. First, PyTorch-trained models can be converted to ONNX and loaded by WebGPU runtimes in Chrome, Firefox, and Safari without browser-specific workarounds. Second, the CacheStorage API handles multi-gigabyte model files reliably, meaning the UX penalty is a one-time download, not a recurring one. Third, image manipulation tasks — including inpainting, which requires meaningful spatial reasoning — can run entirely on-device.

For product teams, this collapses the cost and complexity of a category of AI features. An inpainting tool built this way has no inference API costs, no cold-start latency, no user data leaving the device, and no server infrastructure to maintain. The tradeoff is a 1.3GB first-load download and a WebGPU capability requirement — real constraints, but manageable ones for the right use case.

We've written about how small, specialized models are consistently outperforming expectations — our breakdown of recent frontier model shifts covers related dynamics in the current model landscape. The Moebius result is another data point in the same direction: parameter efficiency, not raw scale, is what unlocks deployment flexibility.

When to Build Client-Side vs. Server-Side AI

Two opposing geometric masses flanking a central void gap with bifurcating translucent arcs rising from the divide

Client-side inference is the right call when privacy is a hard requirement, when per-query API costs would be prohibitive at scale, or when offline capability matters. It is the wrong call when the model size exceeds what users will tolerate downloading, when the task requires models larger than roughly 1-2B parameters at current WebGPU performance levels, or when the target audience uses hardware that lacks WebGPU support.

The architecture decision is not ideological — it is a product constraint question. A well-scoped AI feature shipped as a pure browser application can outperform a server-dependent equivalent on latency, cost, and user trust. The technical barrier to that choice has dropped substantially. What remains is knowing which problems fit the box.

Our AI agent development practice handles exactly these scoping decisions: matching model architecture to deployment environment before a line of production code is written.

Ready to build? NerdHeadz ships production AI in weeks, not months. Get a free estimate.

The successful browser-side deployment of the Moebius 0.2B model is a concrete proof point that client-side AI inference has crossed from experiment into viable product architecture. For builders, the practical question is no longer whether this is possible — it is whether the tradeoffs fit their specific use case. Teams that learn to make that call accurately will ship faster and spend less.

“The browser is no longer a thin client for AI—it is becoming a legitimate inference runtime.”

— NerdHeadz Engineering

Spotted via Simon Willison from Simon Willison’s Newsletter

Written by

NerdHeadz Team

Author at NerdHeadz

Frequently asked questions

Can a neural network model run entirely in the browser without a server?

Yes. Models converted to ONNX format can run in-browser via WebGPU runtimes supported by Chrome, Firefox, and Safari. The Moebius 0.2B inpainting model demonstrates this at production quality, with model weights cached locally using the CacheStorage API after the first download.

What is ONNX and why is it used for browser-based AI deployment?

ONNX (Open Neural Network Exchange) is a portable, framework-neutral file format that packages a neural network's computation graph and learned weights together. It separates what a model computes from how and where it runs, making it compatible with WebGPU runtimes that execute directly in the browser without requiring Python or NVIDIA hardware.

How large can a browser-based AI model be before it becomes impractical?

At current WebGPU performance levels, models in the 200M–500M parameter range are practical for browser deployment. The Moebius model required a one-time 1.3GB download, which is manageable for motivated users when paired with browser caching. Models significantly larger than 1–2B parameters currently exceed what most browser environments can load and execute within acceptable latency bounds.

Running a 200M-Parameter Inpainting Model Entirely in the Browser

The Inference Layer Is Moving to the Client

Why ONNX + WebGPU Is the Architecture That Makes This Work

What Claude Code Actually Did (And What That Tells Builders)

The Product Implications of Client-Side Inference

When to Build Client-Side vs. Server-Side AI

NerdHeadz Team

Frequently asked questions

Stay in the loop

Ready to ship something custom?

The Inference Layer Is Moving to the Client

Why ONNX + WebGPU Is the Architecture That Makes This Work

What Claude Code Actually Did (And What That Tells Builders)

The Product Implications of Client-Side Inference

When to Build Client-Side vs. Server-Side AI

NerdHeadz Team

Frequently asked questions

More essays

GLM-5.2: The Open-Weight Agent That Changes Everything

This Week in AI: GLM-5.2 Challenges the Frontier, Agents Mature, and Midjourney Goes Medical

Reasoning Models Explained: How o1, DeepSeek-R1 & RLMs Actually Work

Stay in the loop

Ready to ship something custom?