Mobile AI Field Notes

Local models on phones

Guidance for running on-device and in-browser models, with practical sizing and rollout considerations.

Small LLMs ~0.5B–3B params (7B with heavy quantization) for chat and summaries.

On-device vision Great for labeling, OCR-style tasks, and embeddings.

Hard constraints Large chat-class models and long contexts are memory-heavy.

WebGPU Fast GPU-backed inference when supported by the device and browser.

WASM fallback Works more broadly, but slower than GPU paths.

Capability checks Detect WebGPU, RAM, and iOS support before choosing a model.

Android Quantized model + NNAPI/GPU delegate runtime for offline assistants.

iOS Core ML or Metal-backed inference with tight memory budgets.

Flutter FFI or platform channels to bridge into native runtimes.

Local embeddings Private semantic search that stays fast and offline-friendly.

On-device chat Small helper model for quick answers and summaries.

Cloud escalation Opt-in upgrade for harder prompts with clear privacy controls.

Confirm the Android/iOS split and how much of the experience must work fully offline.

Shortlist quantized model sizes for pilot devices and decide on WebGPU vs. native paths.

Benchmark latency, heat, and battery impact across a few representative phones.

What works well on phones