Smart Routing for Local LLMs: Keep Quality High, Keep Costs Low
If you're building with local LLMs, you've probably felt the tension. You want the best possible response for complex tasks, but you also don't want to burn GPU cycles (or time) on simple queries. What if you could automatically send each request to the most appropriate model, balancing quality and efficiency? That's the idea behind UncommonRoute.
It's a lightweight router that sits between your application and your local LLMs. Think of it as a smart traffic director for your inference endpoints. You define your available models and their strengths, and it handles the rest, aiming to maintain high-quality outputs while optimizing resource usage and cost.
What It Does
UncommonRoute is a model router designed for local LLM setups. You configure it with the endpoints of your running models (like Ollama, LM Studio, or vLLM instances) and set some simple rules or priorities. When your app sends a prompt, the router analyzes it and decides which model should handle it, then returns that model's response seamlessly.
The goal is straightforward: use your most capable (and likely more expensive/resource-heavy) model only when necessary. For simpler, more repetitive, or well-defined tasks, it can route to a smaller, faster model. This keeps your average response time and computational cost down without sacrificing the quality where it matters.
Why It's Cool
The clever part isn't just the routing—it's the simplicity and developer-centric approach. You're not locked into a complex ecosystem. It works with the local models you already have running. The routing logic can be based on anything you can program: prompt classification, keyword matching, cost thresholds, or even dynamic performance metrics.
This opens up clean use cases:
- Tiered Quality: Send creative writing to your 70B parameter model, but route simple data formatting or classification to a speedy 7B model.
- Fallback Handling: If your primary model is busy or fails, automatically reroute to a backup.
- Cost-Aware Development: Experiment and prototype with smaller, cheaper models, and only scale up for production or final outputs.
- Hybrid Clouds: In theory, you could even mix local and paid API models, using local for most tasks and only calling to GPT-4 or Claude for specific, high-stakes prompts.
It turns your collection of models into a coordinated team, rather than a set of isolated tools.
How to Try It
The project is on GitHub, and getting started looks familiar. It's a Node.js/Typescript project.
- Clone the repo:
git clone https://github.com/CommonstackAI/UncommonRoute.git cd UncommonRoute