I Bought the GPU Anyway

I bought a DGX Spark to fine-tune models. That's the real reason — a Grace-Blackwell GB10 with ~120 GB of unified memory is a training rig, not an inference box, and fine-tuning is where all of this is headed.

But the last thing I published was What Free LLMs Actually Cost — an inference benchmark that ended with a blunt conclusion: for a personal assistant, don't host inference, just pay for the API. Local on a CPU NUC was free in dollars and unusable in latency. The fast-and-free quadrant was empty; pick one dimension to spend — dollars, latency, or hardware — there is no fourth option.

So before I touched training, I owed that post a rematch on real hardware. Same benchmark, same prompt sizes, same table — the only thing that changed is the local leg: from a 12-core CPU to an actual GPU.

Here's what a GPU bought me on the inference side, the ceiling I didn't see coming — and why none of it is actually why the Spark is on my desk.

## The Hardware


Machine	NVIDIA DGX Spark (GB10)
GPU	Grace-Blackwell, CUDA compute 12.1
Unified memory	~120 GB usable for inference (measured: 118.6 GiB)
Memory type	LPDDR5X, unified CPU+GPU
Driver / CUDA	580.159.03 / CUDA 13.0
Idle power	4 W
Serving	Ollama, Q4 models

The headline number is that 120 GB of unified memory. A consumer GPU tops out at 24–32 GB of VRAM; the Spark can hold a 70B model and a 131K-token context without blinking. That capability is real and it matters — and it matters most for training, which is the reason I bought it. Hold onto it — it's also where the inference twist lives.

## The Benchmark

Same three prompt classes as last time — a ~50-token chat question, a ~400-token code review, a ~1500-token agent prompt with full tool definitions. Five runs per cell, median wall-clock. The local models moved from CPU-only Polaris to the GB10.

Note

One honesty note carried over from the first post: I never saved the literal prompt text back then, only the sizes. These are freshly written prompts matched to the same token counts and shapes, so the comparison is fair by size — but they aren't byte-identical to the May run. The cloud rows below are quoted from that post (same methodology), not re-measured.

Backend	Model	Prompt	Wall (s)	Output tok	Tok/s
Local (GB10)	qwen2.5:7b	short	5	239	49
Local (GB10)	qwen2.5:7b	medium	19	890	48
Local (GB10)	qwen2.5:7b	long	5	197	48
Local (GB10)	qwen2.5:32b	short	15	159	11
Local (GB10)	qwen2.5:32b	medium	81	841	10
Local (GB10)	qwen2.5:32b	long	22	207	10
Local (GB10)	llama3.1:70b	short	47	219	5
Local (GB10)	llama3.1:70b	medium	151	726	5
Local (GB10)	llama3.1:70b	long	68	327	5
Local (CPU NUC, May)	qwen2.5:7b	long	123	818	6
Paid cloud (May)	deepseek-v4-flash	long	34	2,243	64
Free cloud (May)	nemotron-3-super	long	120	3,755	21

## The NUC's Fatal Flaw, Fixed

Read the qwen2.5:7b rows against the same model on the NUC.

CPU NUC, last post: ~7 tok/s. A short chat reply took 20+ seconds.
GB10, this post: ~48 tok/s.

That's a 7× jump on the identical model, and it crosses the line that actually matters. 7 tok/s is a phone tree. 48 tok/s is chat. For the first time, a local, private, offline-capable model is answering at a speed I'd tolerate for ambient assistant work — the exact thing I told you two months ago the NUC could never do.

On its own that would be the whole post: the empty quadrant isn't empty anymore, you just have to pay for it in hardware. But then I ran the big models.

## Bandwidth Is the New Ceiling

The Spark's party trick is that it can run a 70B model at all. It can. It runs llama3.1:70b — a model that wouldn't even load on the NUC — at a steady ~5 tok/s.

Five. On a machine that does the 7B at forty-eight.

The 32B lands exactly where you'd interpolate: ~10.5 tok/s. Notice the pattern — throughput scales almost perfectly inversely with model size:

Model	Q4 weights	Tok/s	Weights × tok/s
qwen2.5:7b	4.7 GB	48	~226 GB/s
qwen2.5:32b	19 GB	10.5	~200 GB/s
llama3.1:70b	42 GB	4.8	~202 GB/s

That last column is the tell. Multiply each model's weight size by its decode speed and you get the same number every time — ~200 GB/s of effective memory bandwidth. Token generation is bandwidth-bound: every token, the GPU streams the entire model out of memory. Double the model, halve the speed. The GB10's LPDDR5X unified memory is enormous but not especially fast, and ~200 GB/s effective is the wall every model decodes against.

Important

The Spark's 120 GB buys you capacity, not speed. It can hold models nothing else your size can — but it reads them at unified-memory bandwidth, so the bigger the model, the slower it runs. Fast or big. Still not both.

## The Quadrant, Revisited

So did the DGX Spark fill the empty quadrant?

Partly — and the "partly" is the entire story.

Fast + local, small model: yes. 7B at 48 tok/s locally is a quadrant that genuinely didn't exist for me before. The Spark filled it.
Fast + local, big model: no. The 70B the Spark uniquely fits runs at 5 tok/s. The bandwidth ceiling replaced the compute ceiling.

The first post said: pick one of dollars, latency, or hardware. I picked hardware. What I got wasn't a free lunch — it was a better menu. Inside the hardware dimension there's now its own "pick two": fast, big, local — choose any two. Fast + local means small. Big + local means slow. The Spark moved the wall; it didn't remove it.

## What About the Money

The uncomfortable part. Last post's conclusion — paid cloud is the cheapest viable option for a personal assistant — is still true. DeepSeek V4 Flash at ~64 tok/s for a few dollars a year still beats a $3,000–4,000 box on pure economics, and it isn't close. A Spark will not pay for itself in inference you'd otherwise have bought.

What changed isn't the cost answer. It's why you'd go local at all:

Before, local meant accepting phone-tree latency to get privacy and offline. A real tradeoff most people shouldn't take for a chat assistant.
Now, for small-to-mid models, local means privacy and offline at usable speed. The tradeoff mostly evaporates — you stop giving up latency to keep your data home.
And you can run models cloud-free that you simply couldn't host any other way — slowly, but at all.

That's the Spark's actual pitch. Not "cheaper tokens." It's "the privacy/offline use cases from the first post, minus the latency penalty — plus a ceiling high enough to run the big stuff when you're willing to wait."

## What It's For (on the Inference Side)

Setting training aside for a second — purely as an inference box, refining the list from last time now that the hardware is real:

Interactive chat, small models — 7B/8B/mid at 40–50 tok/s. The thing the NUC couldn't do, now local and private.
Big models, batch/async — run a 70B overnight for summarisation or eval where 5 tok/s is fine because nobody's waiting.
Embeddings & indexing — functionally instant at personal volumes, same as before.
Privacy-required interactive work — journaling, financial parsing, anything you'd never send to an API, now without the latency tax.

What it's still not for: beating the cloud on price, or running a 70B as a snappy interactive assistant. The bandwidth wall is real and it's physics, not a driver update.

The GPU didn't come up at first — a debugging aside

Before any of this worked, nvidia-smi greeted me with "No devices were found" — despite the driver modules being loaded and /dev/nvidia* existing. The kernel log had the real story:

NVRM: ksec2PrepareBootCommands_GB20B: SEC2 secure boot partition timed out.
NVRM: RmInitAdapter: Cannot initialize GSP firmware RM
NVRM: GPU 000f:01:00.0: RmInitAdapter failed! (0x62:0x65:2028)

The GPU's security engine (SEC2) was timing out during secure boot, so the GSP firmware never bootstrapped, so the compute stack never initialised. A warm reboot didn't fix it and neither did reloading the kernel modules — SEC2 runs at power-on, below the software layer. The fix was a genuine cold power cycle: full shutdown, pull power for a minute, boot. nvidia-smi came back listing the GB10.

Then a second gotcha: Ollama still ran on CPU, because its daemon had started while the GPU was wedged and cached a CPU-only decision. A snap restart ollama and it re-discovered the GPU — 100% GPU, ~80 GB of VRAM in use, and the numbers above.

## Up Next: The Actual Reason

Everything above is a side-quest. I didn't buy a DGX Spark to run inference — a few dollars a year of API credit does that better, as the last post argued and this one confirms. I bought it to fine-tune models.

And here's the thing the inference benchmark accidentally proved: the Spark's 120 GB of unified memory is overkill for inference — it lets you hold a 70B you can only decode at a crawl. But memory capacity isn't a nice-to-have for training, it's the whole game. Fitting the model, the gradients, the optimizer state, and a batch of activations in memory at once is the constraint that normally shoves you onto rented multi-GPU cloud. The Spark folds that into a box on my desk that idles at 4 W. The feature that's wasted on inference bandwidth is exactly the feature that makes local fine-tuning viable.

That's the next post: a real fine-tune on the Spark — a base model, a dataset, LoRA versus a full run, and what the throughput, memory footprint, and loss curves actually look like on hardware that costs less than a month of the cloud instance you'd otherwise rent.

The first post ended with "if you want a chat assistant, just use the API — the math is openly on your side." Still true, and this post doesn't overturn it. The inference rematch was a debt I owed that post. The fine-tuning is why the Spark is here — and that's where I'm headed next.