Running AI Locally: The Brutal Math of Self-Hosting

The narrative in the Nostr ecosystem is predictable. You either sleepwalk into the cloud API trap, paying a premium for instant latency, or you bite the bullet and self-host. There is no middle ground, only a spectrum of compromises. I’ve been running local inference for two years now, trading hardware upgrades against compute efficiency. The goal isn’t just to run a model; it’s to get a deterministic edge in a market where APIs treat you like a commodity.

The Hardware Reality Check

Forget the marketing. If you want a model that actually thinks, you need hardware that doesn’t bottleneck. The GPU is king, but the VRAM is the constraint. I currently run on a dual-GPU rig, not because it’s cheap, but because it allows for batching.

Here is the hard truth about the current landscape:

  • The 24GB Sweet Spot: A single 24GB card (like the RTX 3090/4090) is the entry ticket. It handles 20B parameter models at 4K context windows without stuttering.
  • The Multi-GPU Hack: If you have 48GB+ of VRAM, you unlock 40B models at 1K context. The trick? Use a matrix multiplication engine that splits the tensor load.
  • The RAM Fallback: If you are on a single 16GB card, you are running on the edge of NPU efficiency. You will need to offload layers to system RAM, which introduces a 20-30ms latency spike on every token.

I learned this the hard way. I started with a 16GB card, ran Llama 3, and watched my inference time double as the context window widened. I upgraded immediately.

Model Selection: The Token Economy

Choosing a model isn’t about picking the “smartest” one; it’s about picking the one that fits your use case. The market is flooded with overhyped 70B parameter beasts that feel sluggish on local hardware.

  1. Llama 3 (8B & 70B): The workhorse. The 8B variant is currently the king of efficiency. It punches above its weight class.
  2. Mistral / Mixtral: If you need deeper reasoning, Mixtral 8x7B is the choice. It uses sparse attention, meaning it only activates specific weights based on input. On local hardware, this saves VRAM and speeds up throughput.
  3. The “New” Giants: Companies like xAI and Anthropic are pushing local versions. Be careful. They are often quantized versions of massive models running on older architectures.

The Quantization Dilemma

This is where most self-hosters get stuck. To make models fit on your card, you quantize them. You chop the 16-bit floating point precision down to 4-bit integers.

  • The Tradeoff: You gain 2x speed and 2x capacity, but you lose nuance. The model becomes “sharper” but less empathetic.
  • My Strategy: I run the base model in 4-bit for speed. If I need high-fidelity output for analysis, I switch to 6-bit. The difference is palpable. 4-bit feels like a telegram; 6-bit feels like a letter.

Latency vs. Throughput: The API Comparison

The main reason people ditch local hosting is latency. Cloud APIs offer 50-100ms response times. Local inference is often 150-300ms. But here is the kicker: Throughput.

When you hit a cloud API, you are limited by their queue. They have 10,000 users fighting for GPU slices. When you hit your local rig, you are the only user in the building.

  • Cloud API: Best for fire-and-forget tasks. I use them for tagging tweets or quick sentiment checks.
  • Local Host: Best for iterative reasoning. If I need a model to analyze 500 lines of code or synthesize a trading thesis, I need the model to “chew” on the data.

The Software Stack

Don’t overcomplicate the backend. The ecosystem is noisy. I prefer Ollama for its simplicity. It abstracts the hardware, handles the quantization automatically, and lets you swap models with a single ollama pull command.

For raw control, I dip into vLLM. It’s faster but requires more tuning. The goal is to keep the software layer thin so the hardware does the heavy lifting.

The Verdict

Running AI locally is a capital expenditure, not just a subscription fee. You are paying for ownership. You stop worrying about API rate limits and sudden price hikes. You build a deterministic engine that can handle the erratic data flow of crypto markets.

The hardware is there. The software is stable. The only variable is you. Are you going to accept the cloud’s convenience, or will you grind through the setup to own your compute?

Active Position: 100% Local. Current Model: Mixtral 8x7B (Quantized 4-bit). Next Upgrade: A dedicated NPU card for the 80% of data that doesn’t need raw power.


Looking for comments…

Searching Nostr relays. This may take a moment the first time this article is opened.