DeepSeek 4 Flash: Local Inference Engine for Metal and CUDA
If you've been following the AI space, you know DeepSeek's models have been making waves for their performance per dollar. But running them locally has always been a bit of a hassle. That's where ds4 comes in.
It's a lightweight inference engine specifically built for DeepSeek 4 Flash, targeting both Metal (Apple Silicon) and CUDA (NVIDIA GPUs). No cloud dependencies, no bloated frameworks. Just a clean, executable that runs the model on your own hardware.
What It Does
ds4 is a standalone inference engine for DeepSeek 4 Flash, the latest 16B parameter model from DeepSeek. It loads the model weights, runs inference on GPU (Metal or CUDA), and lets you interact with it locally. The repo provides both a C source for building the engine yourself and precompiled binaries for macOS and Linux.
Key features:
- Single-file model loading from Hugging Face (via
huggingface-cli) - Supports 4-bit and 8-bit quantization out of the box
- Uses Metal Performance Shaders (MPS) on Apple Silicon, CUDA on NVIDIA
- Minimal dependencies: just a modern C compiler + system GPU drivers
Why It’s Cool
This isn’t another high-level Python wrapper. It’s written in C, and it shows. The code is lean, focused, and easy to understand if you're comfortable with C. The author (antirez, the creator of Redis) clearly values speed and simplicity over abstraction layers.
A few things stand out:
- Quantization is built in – You don't need separate tools to quantize the model. Just download the weights and run
ds4with--q4or--q8. This saves a ton of GPU memory (4-bit gives you ~2GB instead of 8GB+ for FP16). - Metal support is first-class – Many inference engines treat Metal as an afterthought. ds4 uses MPS kernels directly, so Apple Silicon Macs get true native performance.
- Minimal memory overhead – The engine itself uses <100MB of RAM. The model weights are the only real memory cost.
- No bloat – No Python runtime, no Docker, no pip installs. It's a single binary.
For a practical use case, imagine running a 16B model on a MacBook Air with 8GB RAM. With 4-bit quantization, you could run this locally without swapping to disk. That's wild for a local model.
How to Try It
The repo has clear instructions. Here's the quick start: