Stop Overspending on LLMs: A Hands-On Guide to Smart Routing with NadirClaw

Akram Chauhan
Akram Chauhan
14 min read64 views
Stop Overspending on LLMs: A Hands-On Guide to Smart Routing with NadirClaw

Let's be honest, we've all been there. You build this amazing new feature powered by a state-of-the-art LLM, maybe one of Google's Gemini Pro models. It works beautifully. Then the first cloud bill arrives, and your jaw hits the floor. It turns out, using the most powerful, most expensive model for every single task is a quick way to burn through your budget.

What if you're just asking it to format some JSON? Or answer a simple question like "What is 2+2?" You don't need a super-genius AI for that, but you're probably paying for one anyway. It’s like hiring a rocket scientist to do your third-grader’s math homework. Total overkill.

This is where the idea of an "intelligent router" comes in, and it's a concept I'm genuinely excited about. Imagine a smart traffic cop that sits in front of your LLMs. It takes one look at an incoming prompt and says, "Ah, this one's easy. Send it to the cheap, fast model." Then another comes in: "Whoa, this is a tough one. Better send it to the big, powerful model."

Today, we're going to roll up our sleeves and build exactly that using a clever little tool called NadirClaw. We'll set it up to intelligently switch between Google's fast gemini-2.5-flash and the more powerful gemini-2.5-pro, saving us a ton of money in the process.

Ready? Let's get started.

First Things First: Getting Our Tools Ready

Before we can do any of the fun stuff, we need to get our environment set up. We'll need to install NadirClaw itself, along with a few other Python libraries to help us with making API calls, handling data, and even plotting a cool chart later on.

It's a straightforward process. The code below will handle all the installations for us. It will also prompt you for a Gemini API key. If you have one, pop it in! That will let you run the "live routing" sections later. If you don't, no worries—you can still follow along with the first half of the tutorial and see how the classification works locally.

import subprocess, sys
def _pip(*pkgs):
    subprocess.run([sys.executable, "-m", "pip", "install", "-q", *pkgs], check=True)

_pip("nadirclaw", "openai", "sentence-transformers", "matplotlib", "scikit-learn", "pandas", "requests")

import os, json, time, signal, shutil, getpass
from pathlib import Path
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import requests

GEMINI_API_KEY = os.environ.get("GEMINI_API_KEY", "").strip()
if not GEMINI_API_KEY:
    print("Paste your Gemini API key (input hidden), or press Enter to skip:")
    try:
        GEMINI_API_KEY = getpass.getpass(prompt="GEMINI_API_KEY: ").strip()
    except (EOFError, KeyboardInterrupt):
        GEMINI_API_KEY = ""

LIVE_ROUTING = bool(GEMINI_API_KEY)
if LIVE_ROUTING:
    os.environ["GEMINI_API_KEY"] = GEMINI_API_KEY
    print(f"✓ key captured ({len(GEMINI_API_KEY)} chars) — sections 8–11 enabled.")
else:
    print(" no key entered — sections 3–7 still run; live routing skipped.")

The Magic Trick: Classifying Prompts Without an API Call

Here’s the part that really makes NadirClaw special. It can figure out if a prompt is "simple" or "complex" entirely on your local machine, before it ever makes a costly API call. It does this using a lightweight, built-in text classifier.

Let's test it out. We'll create a list of prompts that range from dead simple ("What is 2+2?") to ridiculously complex ("Design a distributed event-sourced order pipeline..."). Then, we'll run them through the NadirClaw command-line tool and see how it categorizes each one.

def classify(prompt: str) -> dict:
    r = subprocess.run(
        ["nadirclaw", "classify", "--format", "json", prompt],
        capture_output=True,
        text=True,
        timeout=180,
    )
    if r.returncode != 0:
        return {"prompt": prompt, "error": (r.stderr or r.stdout).strip()}
    return json.loads(r.stdout.strip())

prompts = [
    "What is 2+2?",
    "Format this JSON: {\"a\":1,\"b\":2}",
    "Read the file at src/main.py",
    "Add a docstring to the foo function",
    "What does this function do?",
    "Refactor the auth module to use dependency injection without breaking existing callers",
    "Design a distributed event-sourced order pipeline that handles 50k req/s with strict ordering",
    "Analyze the tradeoffs between actor-model and CSP-style concurrency for our codebase",
    "Debug why this asyncio.gather call deadlocks under high load and provide a fix",
    "Prove that this scheduling algorithm is optimal step by step and derive the worst-case bound",
]

print("\n[3] Classifying 10 prompts (first call warms the encoder)…")
rows = [classify(p) for p in prompts]
df = pd.DataFrame(rows)
cols = [c for c in ["tier", "score", "confidence", "model", "prompt"] if c in df.columns]
print(df[cols].to_string(index=False))

When you run this, you'll see a neat table. The simple prompts are correctly labeled SIMPLE, and the complex ones are labeled COMPLEX. Pretty cool, right? But how does it actually work?

Under the Hood: What are "Centroids"?

Okay, let's peek behind the curtain. NadirClaw isn't using black magic; it's using math. Specifically, it uses a concept called "centroid vectors."

Think of it like this: NadirClaw has two pre-defined points in a high-dimensional space. One point represents the "perfectly simple" prompt, and the other represents the "perfectly complex" prompt. These are the centroids.

When you give it a new prompt, it converts your text into its own point in that same space. Then, it just measures the distance. Is your prompt's point closer to the "simple" centroid or the "complex" one? Whichever it's closer to, that's the tier it gets assigned.

We can actually load these centroids directly from the NadirClaw package and look at them.

import nadirclaw
PKG = Path(nadirclaw.__file__).parent

SIMPLE_C = np.load(PKG / "simple_centroid.npy").astype(np.float32).flatten()
COMPLEX_C = np.load(PKG / "complex_centroid.npy").astype(np.float32).flatten()

def cosine(a, b):
    return float(a @ b / (np.linalg.norm(a) * np.linalg.norm(b) + 1e-12))

print(f"\n[4] simple_centroid shape={SIMPLE_C.shape} ‖·‖={np.linalg.norm(SIMPLE_C):.3f}")
print(f" complex_centroid shape={COMPLEX_C.shape} ‖·‖={np.linalg.norm(COMPLEX_C):.3f}")
print(f" cosine(simple,complex) = {cosine(SIMPLE_C, COMPLEX_C):.4f} "
      "← if this were 1.0 the classifier couldn't distinguish them.")

The cosine similarity score here is important. If it were 1.0, the two centroids would be in the same spot, and the classifier wouldn't be able to tell them apart. Since it's much lower, we know they are distinct, which is exactly what we want.

Let's Visualize the Decision Boundary

Reading numbers is one thing, but seeing it is another. Let's create a scatter plot to visualize exactly how NadirClaw is making its decisions. We'll plot each of our 10 prompts based on its similarity to the SIMPLE centroid (x-axis) and the COMPLEX centroid (y-axis).

I've also added a dashed line. Any prompt that falls above this line is closer to the complex centroid and gets routed to the Pro model. Anything below it is closer to the simple one and gets routed to the Flash model.

from sentence_transformers import SentenceTransformer

print("\n[5] Loading the same encoder NadirClaw uses (all-MiniLM-L6-v2)…")
encoder = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")
embs = encoder.encode(prompts, normalize_embeddings=True)

sim_simple = np.array([cosine(e, SIMPLE_C) for e in embs])
sim_complex = np.array([cosine(e, COMPLEX_C) for e in embs])

fig, ax = plt.subplots(figsize=(8.5, 6))
colors = ["tab:blue"] * 5 + ["tab:red"] * 5
ax.scatter(sim_simple, sim_complex, c=colors, s=110, edgecolor="k", linewidth=0.5)

for i, _ in enumerate(prompts):
    ax.annotate(str(i + 1), (sim_simple[i], sim_complex[i]), xytext=(6, 4), textcoords="offset points", fontsize=10)

xs = np.linspace(min(sim_simple.min(), sim_complex.min()), max(sim_simple.max(), sim_complex.max()), 50)
ax.plot(xs, xs, "k--", alpha=0.4, label="cos(simple) = cos(complex)")

ax.set_xlabel("cosine similarity to SIMPLE centroid")
ax.set_ylabel("cosine similarity to COMPLEX centroid")
ax.set_title("Routing decision boundary\n(blue = expected simple, red = expected complex)")
ax.legend(loc="lower right")
ax.grid(alpha=0.25)
plt.tight_layout()
plt.savefig("centroid_decision_plot.png", dpi=120)
plt.show()

This chart makes it crystal clear how the routing works. You can see our simple prompts (the blue dots) clustering in the bottom right, and the complex ones (red dots) in the top left.

Fine-Tuning: Confidence and Special Cases

NadirClaw also has a "confidence" score. If a prompt lands really close to that dashed decision line, the router isn't very confident. It's a toss-up. By default, NadirClaw plays it safe: if the confidence is too low (the default is below 0.06), it will automatically escalate the prompt to the more powerful COMPLEX model, just in case.

You can even adjust this threshold. Want to be more aggressive with cost savings? Lower the threshold. Want to be safer and send more borderline cases to the Pro model? Raise it.

The tool is also smart enough to look for special markers that indicate complexity, like requests for agentic behavior or chain-of-thought reasoning.

print("\n[6] Prompts sorted by complexity score:")
sdf = df.sort_values("score").reset_index(drop=True)
for _, row in sdf.iterrows():
    bar = "█" * int(round(float(row["score"]) * 30))
    print(f" score={float(row['score']):.2f} conf={float(row['confidence']):.2f} "
          f"{row['tier']:7s} |{bar:<30s}| {row['prompt'][:55]}")

print("\n[6] Confidence-threshold sweep (low confidence → forced complex):")
print(" NadirClaw default threshold is 0.06.")
for thr in [0.02, 0.06, 0.10, 0.20, 0.30]:
    forced_complex = sum(1 for r in rows if float(r["confidence"]) < thr)
    natural_complex = sum(1 for r in rows if float(r["score"]) >= 0.5)
    print(f" threshold={thr:.2f} → {forced_complex} prompts force-complex "
          f"(low-confidence), {natural_complex} naturally complex by score")

modifier_demos = [
    ("agentic — text-only marker", "You are a coding agent that can execute commands. Now add tests for the new endpoint."),
    ("reasoning — chain-of-thought markers", "Step by step, derive the closed form and prove correctness mathematically. "
                                             "Compare and contrast both approaches."),
    ("vision — would arrive with image_url part (only text shown)", "Describe the screenshot."),
]

print("\n[7] Modifier-marker scan:")
for label, p in modifier_demos:
    r = classify(p)
    print(f" {label}")
    print(f" prompt='{p[:65]}…'")
    print(f" tier={r['tier']} score={float(r['score']):.2f} conf={float(r['confidence']):.2f}")
    print(" NB: agentic & vision routing also trigger from request shape "
          "(tools=[…], image_url parts) — see live calls below.")

Time to Go Live: Running the Proxy Server

Okay, enough theory. Let's see this thing in action. If you entered your Gemini API key earlier, this next part is for you.

We're going to start the NadirClaw proxy server. It will run locally and act as a middleman. Instead of sending our API requests directly to Google, we'll send them to our local NadirClaw server. It will then perform its classification magic and forward the request to the correct Gemini model (Flash or Pro).

PORT = 8856
server_proc = None

if LIVE_ROUTING:
    print(f"\n[8] Starting `nadirclaw serve` on :{PORT} (background subprocess)…")
    env = os.environ.copy()
    env.update({
        "GEMINI_API_KEY": GEMINI_API_KEY,
        "NADIRCLAW_SIMPLE_MODEL": "gemini-2.5-flash",
        "NADIRCLAW_COMPLEX_MODEL": "gemini-2.5-pro",
        "NADIRCLAW_PORT": str(PORT),
    })
    server_proc = subprocess.Popen(
        ["nadirclaw", "serve", "--verbose"],
        env=env,
        stdout=subprocess.PIPE,
        stderr=subprocess.STDOUT,
        preexec_fn=os.setsid if hasattr(os, "setsid") else None,
    )

    ready = False
    for _ in range(60):
        if server_proc.poll() is not None:
            break
        try:
            if requests.get(f"http://localhost:{PORT}/health", timeout=1).ok:
                ready = True
                break
        except Exception:
            time.sleep(1)

    if ready:
        print(" ✓ /health returned 200 — proxy is live.")
    else:
        print(" proxy did not come up; dumping last log lines:")
        if server_proc.stdout:
            try:
                lines = server_proc.stdout.read1(4096).decode("utf-8", errors="replace")
                print(lines[-2000:])
            except Exception as e:
                print(f" (could not read server stdout: {e})")
else:
    print("\n[8] Skipped — no GEMINI_API_KEY.")

def proxy_alive():
    return server_proc is not None and server_proc.poll() is None

Now that the server is running, let's send a simple and a complex prompt through it using the standard OpenAI client library. Notice how we point the base_url to our local proxy.

if proxy_alive():
    from openai import OpenAI
    client = OpenAI(base_url=f"http://localhost:{PORT}/v1", api_key="local")

    side_by_side = [
        ("simple-ish", "Write a one-line docstring for: def add(a, b): return a + b"),
        ("complex", "Refactor a Python class to a dependency-injection pattern, "
                    "explain the trade-offs, and produce migration steps for callers."),
    ]
    summary = []
    for label, p in side_by_side:
        t0 = time.time()
        try:
            resp = client.chat.completions.create(
                model="auto",
                messages=[{"role": "user", "content": p}],
                max_tokens=220,
            )
            dt = time.time() - t0
            text = (resp.choices[0].message.content or "").strip()
            print(f"\n--- [{label}] {dt:.2f}s · model={resp.model} ---")
            print(text[:500] + ("…" if len(text) > 500 else ""))
            summary.append({
                "label": label,
                "model_used": resp.model,
                "latency_s": round(dt, 2),
                "tokens": getattr(resp.usage, "total_tokens", None),
            })
        except Exception as e:
            summary.append({"label": label, "model_used": "ERROR", "latency_s": None, "tokens": str(e)[:80]})
            print(f" [{label}] failed: {e}")

    print("\n[9] Summary:")
    print(pd.DataFrame(summary).to_string(index=False))

Check out the output! You should see that the simple-ish prompt was handled by the faster, cheaper gemini-2.5-flash model, while the complex refactoring task was sent to gemini-2.5-pro. It works!

The Payoff: Calculating Our Savings

This is the moment we've been waiting for. Let's throw a mixed workload of 10 prompts at our router and then calculate how much money we saved compared to just using the expensive Pro model for everything.

if proxy_alive():
    print("\n[10] Mixed 10-prompt workload…")
    workload = [
        "Capital of France?",
        "Read foo.py",
        "Type hint for a list of dicts",
        "Lowercase: HELLO",
        "One-sentence summary of REST",
        "Refactor a callback chain into async/await with proper error handling",
        "Design a sharded multi-region key-value store with linearizable reads",
        "Analyze the asymptotic complexity of this code and prove the bound rigorously",
        "Debug why our gRPC stream stalls when the client TCP window saturates",
        "Compare and contrast B-trees and LSM-trees for write-heavy workloads",
    ]
    runs = []
    client = OpenAI(base_url=f"http://localhost:{PORT}/v1", api_key="local")
    for p in workload:
        t0 = time.time()
        try:
            r = client.chat.completions.create(
                model="auto",
                messages=[{"role": "user", "content": p}],
                max_tokens=140,
            )
            usage = getattr(r, "usage", None)
            runs.append({
                "prompt": p[:55],
                "model": r.model,
                "latency_s": round(time.time() - t0, 2),
                "in_tok": getattr(usage, "prompt_tokens", 0) if usage else 0,
                "out_tok": getattr(usage, "completion_tokens", 0) if usage else 0,
            })
        except Exception as e:
            runs.append({"prompt": p[:55], "model": "ERROR", "latency_s": None, "in_tok": 0, "out_tok": 0, "error": str(e)[:80]})
    rdf = pd.DataFrame(runs)
    print(rdf.to_string(index=False))

    PRICE = {
        "flash": {"in": 0.30 / 1e6, "out": 2.50 / 1e6},
        "pro":   {"in": 1.25 / 1e6, "out": 10.0 / 1e6},
    }

    def price_for(model_str, in_t, out_t):
        m = (model_str or "").lower()
        tier = "flash" if "flash" in m else "pro"
        return in_t * PRICE[tier]["in"] + out_t * PRICE[tier]["out"]

    cost_routed = sum(price_for(r["model"], r["in_tok"], r["out_tok"]) for r in runs)
    cost_no_route = sum(price_for("gemini-2.5-pro", r["in_tok"], r["out_tok"]) for r in runs)

    print(f"\n[10] Cost (NadirClaw routed) : ${cost_routed:.6f}")
    print(f" Cost (always-Pro baseline) : ${cost_no_route:.6f}")
    if cost_no_route > 0:
        print(f" Estimated savings on this run : "
              f"{(1 - cost_routed/cost_no_route) * 100:.1f}%")

    print("\n[11] `nadirclaw report` (parses the JSONL request log):")
    rep = subprocess.run(["nadirclaw", "report"], capture_output=True, text=True, timeout=60)
    print(rep.stdout or rep.stderr)

The savings percentage is the real headline here. Even on this tiny workload, the savings are significant. Imagine this running on thousands or millions of API calls. We're talking about real money.

Finally, let's be good citizens and shut down our proxy server.

if proxy_alive():
    print("\n[12] Stopping the proxy…")
    try:
        if hasattr(os, "killpg"):
            os.killpg(os.getpgid(server_proc.pid), signal.SIGTERM)
        else:
            server_proc.terminate()
        server_proc.wait(timeout=10)
    except Exception:
        try:
            server_proc.kill()
        except Exception:
            pass
    print(" ✓ proxy stopped.")

print("\nDone. ")

And there you have it. We've gone from a basic idea to a fully functional, cost-aware routing system. We saw how the local classification works, visualized the decision logic, and proved the cost savings with a live proxy.

Tools like NadirClaw represent a much-needed shift towards more practical, efficient, and cost-conscious AI development. It's not always about using the biggest, baddest model; it's about using the right model for the job. And now, you know exactly how to do that.

Tags

AI Engineering MLOps AI System Design Performance Optimization AI Infrastructure Large Language Models AI Cost Reduction LLM Cost Optimization LLM Routing System Gemini Model Switching Prompt Classification AI Model Selection Cloud Computing Costs Dynamic LLM Routing

Stay Updated

Get the latest articles and insights delivered straight to your inbox.

We respect your privacy. Unsubscribe at any time.

Aicosoft

AI & Technology News, Insights & Innovation

AICOSOFT delivers cutting-edge AI news, technology breakthroughs, and innovation insights. Stay informed about artificial intelligence, machine learning, robotics, and the latest tech trends shaping tomorrow.

Connect With Us

© 2026 Aicosoft. All rights reserved.