Mobile AI Field Notes

Local models on phones

Guidance for running on-device and in-browser models, with practical sizing and rollout considerations.

Highlights

What works well on phones

Smaller models are the most practical path for real-world mobile performance.

Small LLMs ~0.5Bโ€“3B params (7B with heavy quantization) for chat and summaries.
On-device vision Great for labeling, OCR-style tasks, and embeddings.
Hard constraints Large chat-class models and long contexts are memory-heavy.
Browser options

Local inference in the browser

WebGPU leads for performance, with WASM as a compatibility fallback.

WebGPU Fast GPU-backed inference when supported by the device and browser.
WASM fallback Works more broadly, but slower than GPU paths.
Capability checks Detect WebGPU, RAM, and iOS support before choosing a model.
Native options

Android, iOS, and Flutter paths

Native runtimes unlock the best on-device performance and stability.

Android Quantized model + NNAPI/GPU delegate runtime for offline assistants.
iOS Core ML or Metal-backed inference with tight memory budgets.
Flutter FFI or platform channels to bridge into native runtimes.
Portal architecture

Suggested rollout plan

A mobile-first blend of on-device privacy and optional cloud power.

Local embeddings Private semantic search that stays fast and offline-friendly.
On-device chat Small helper model for quick answers and summaries.
Cloud escalation Opt-in upgrade for harder prompts with clear privacy controls.

Define the target devices

Confirm the Android/iOS split and how much of the experience must work fully offline.

Pick model sizes + runtimes

Shortlist quantized model sizes for pilot devices and decide on WebGPU vs. native paths.

Prototype and measure

Benchmark latency, heat, and battery impact across a few representative phones.