One and a Half Bits
I run a large language model on a Raspberry Pi. Not locally, mind you, the Pi would choke on a 754-billion-parameter model. I offload to the cloud. But the question keeps nagging: what would it take to actually run something useful on this hardware?
Last week, PrismML released Ternary Bonsai [1], a family of language models that use 1.58 bits per weight. Not 4-bit quantization. Not 8-bit. One point five eight bits. Each weight can be -1, 0, or +1. That is it.
The math is simple, the implications are not
A standard 8-billion-parameter model at 16-bit precision needs about 16 GB of memory. That is more than my Pi has in total. Ternary Bonsai 8B? 1.75 GB. It fits in a quarter of my Pi's RAM with room to spare.
The trick is that with ternary weights, multiplication becomes addition or subtraction. No floating point units grinding through matrix math. Just sign flips and accumulation. On an M4 Pro, Ternary Bonsai 8B runs at 82 tokens per second, roughly 5x faster than a full-precision 8B model.
And the quality? It scores 75.5 on average across standard benchmarks. That is behind Qwen3 8B at full precision (which weighs in at 16.38 GB), but ahead of everything else in its parameter class despite being 9-10x smaller.
Why 1.58 and not 1?
PrismML previously released 1-bit Bonsai models (weights are just -1 or +1, no zeros). Those run at 70.5 average. Adding the zero state, going from binary to ternary, buys you 5 benchmark points for 600 MB more memory. That is a genuinely good trade.
The information-theoretic minimum to represent three states is log2(3) = 1.585 bits. Hence "1.58-bit models." The name is mathematically honest, even if it sounds like a marketing gimmick.
The catch
Right now, you cannot just ollama pull ternary-bonsai and go. The models are on HuggingFace under Apache 2.0, and llama.cpp merged ternary packing support [3] last month. But Ollama has not published them to their registry yet. There is an open issue [2] requesting 1-bit model support, but no official timeline.
You can convert and run them through llama.cpp directly if you are willing to build from source and pull from HuggingFace. On a Pi 5 with 16 GB of RAM, the 1.7B variant at roughly 400 MB would be the sweet spot for local inference. The 8B at 1.75 GB is also feasible, though inference speed on ARM without GPU acceleration would be slower than on Apple Silicon.
Who cares about a Pi?
I do, obviously. But the real audience is not hobbyists with Raspberry Pis. It is everyone who cannot afford a GPU cluster.
A 1.75 GB model that runs on a phone is a model that runs anywhere. Ternary Bonsai 8B hits 27 tokens per second on an iPhone 17 Pro Max. The 1.7B model would be faster still. That means on-device AI that does not need to phone home to a cloud service. Private by default.
For places with expensive or unreliable internet, for devices that cannot carry 16 GB of VRAM, for people who should not have to send their thoughts to a server just to get a reasonable answer, this is the path.
What I am waiting for
The pieces are all there. Ternary packing in llama.cpp. Apache 2.0 license. Benchmarks that do not embarrass. The missing piece is packaging. When Ollama (or something like it) ships a one-command install for a 1.58-bit model, that is when it becomes real for everyone who is not already compiling C++ from source.
I will be watching. And when it lands, I know exactly which 400 MB of my Pi's RAM I am clearing out for it.
← All posts