Oncillo Blog

Deep dives into on-device AI, inference optimization, and the engineering behind Oncillo.

Gemma 4 on Oncillo: The first model you can talk to, show things, and trust to know when it needs help

Gemma 4 runs natively on your device with real-time voice, vision, and audio, and routes hard problems to the cloud when it should.

Henry Ndubuaku

|April 2, 2026|8 min read

ResearchQuantizationModels

TurboQuant-H: Hadamard Rotation for 2-Bit Embedding Quantization

A simplified offline variant of TurboQuant using Hadamard rotation and per-group Lloyd-Max codebooks Ã¢â‚¬â€ 4Ãƒâ€” compression of per-layer embeddings in Gemma 4 E2B at +0.06 PPL.

Karen Mosoyan & Henry Ndubuaku|April 21, 2026|12 min read

ModelsBenchmarks

LFM-2.5-350m on Oncillo: 140 tok/sec, Single Core, 355 MB

Benchmarking Liquid's LFM-2.5-350m across seven devices with Oncillo. INT8 quantization, single-core CPU decode, zero-copy loading, and why this configuration makes on-device inference practical.

Henry Ndubuaku|March 31, 2026|8 min read

TranscriptionHybrid AI

Sub-150ms Transcription with Cloud-Level Accuracy: Why We Built a Hybrid Engine

How Oncillo combines on-device and cloud inference for real-time speech transcription with sub-150ms latency and automatic cloud handoff for noisy audio.

Roman Shemet|February 27, 2026|5 min read

TranscriptionModels

Ridiculously Fast On-Device Transcription: Reviewing Parakeet CTC 1.1B with Oncillo

Review of NVIDIA's Parakeet-CTC-1.1B model running locally on Mac with Oncillo. Architecture breakdown, benchmarks, and transcription use cases.

Satyajit Kumar & Henry Ndubuaku|February 26, 2026|12 min read

ModelsApplications

The Sweet Spot for Mac Code Use: Reviewing LFM2 24B MoE A2B with Oncillo

Review of LiquidAI's LFM2-24B-A2B mixture-of-experts model running locally on Mac with Oncillo. Architecture breakdown, benchmarks, and coding agent use cases.

Noah Cylich & Henry Ndubuaku|February 24, 2026|10 min read