ADVERTISEMENT

Experiment: I ran a datacenter-class 120B parameter model for just $800

You can build a computer for around $800 to run the 120 billion-parameter AI model GPT-OSS-120B locally, privately, securely, and at a decent speed – well above 10 tokens per second. We’re talking about a datacenter-class, state-of-the-art AI model.

gpt-oss-120b

Image by Cybernews.

Ernestas Naprys
Ernestas Naprys Senior Journalist
Oct 2, 2025 Updated: 3 October 2025 8 min read
part-of-testing
Testing: the GPU was unused, and the model was running on the CPU.

Why won’t a 70B parameter model run well on a CPU, but a 120B one will?

ai-chatbots-screen
Image by Getty Images/NurPhoto.

Experiment 1: having more cores doesn’t automatically mean more performance

  • Just a single assigned thread averaged 3.05 tokens per second. During the generation of tokens, the CPU was partially using 4-6 threads at a time – likely, the OS scheduler distributes the load between several cores
  • Rerunning the test with two threads assigned netted 7.05 tokens per second.
  • Three threads delivered 9.73 tok/sec. The CPU usage was around 20 percent, which was still more than the proportion of assigned 3/32 threads would suggest
  • With four threads, the performance was 11.70 tok/sec
  • Five threads added an additional token and lifted the performance to 12.66 tok/sec
  • Six threads: still improving: 13.72 tok/sec
  • Seven threads: 14.32 tok/sec
  • Eight threads: 14.75 tok/sec
  • Nine threads: 13.93 tok/sec. I ran the same query again just to be sure: 14.35 tok/sec. Again: 14.07 tok/sec.
  • Ten threads: further performance degradation to 13.5 tok/sec
  • Eleven threads delivered a similar 13.73 tok/sec
  • Increasing the CPU Thread Pool Size with more threads further eroded performance. Twelve threads: only 11.05 tok/sec.
intel-test-results
jurgita justinasv Izabelė Pukėnaitė vilius Ernestas Naprys Gintaras Radauskas
Don't miss our latest stories on Google News. Add us as your Preferred Source on Google
Add us as your Preferred Source on Google.
ADVERTISEMENT

Experiment 2: 64GB of RAM is a very “tight” fit

  • One thread assigned: 4.9 tok/sec
  • Two threads: 8.51 tok/sec
  • Three threads: 11.26 tok/sec
  • Four threads: 13.08 tok/sec
  • Five threads: 14.12 tok/sec
  • Six threads: 14.39 tok/sec
  • Seven threads: 14.37 tok/sec
  • Eight threads: 14.24 tok/ sec

The builds: how to choose parts somewhat optimally

amd-build
Image by pcpartpicker.com.
memory-prices
Image by pcpartpicker.com.
intel-build
Image by pcpartpicker.com.

How does this compare to other systems?

Has my data been leaked?

Lessons learned

  • Memory frequency is extremely important to get the best performance when running an LLM on a CPU. This limits purchasing decisions to only two sticks of RAM with higher capacities for dual-channel memory configurations. I had to learn the hard way that four sticks of RAM are very taxing on memory controllers and result in lower frequency and bandwidth.
  • Running on a CPU is quite power-efficient, as the chip uses far less power than GPUs.
  • A CPU-only system is not suitable for running LLMs universally. While GPT-OSS-120B can work with compromises, others won’t. The smaller the “experts,” the better the mixture of them will run on a CPU alone, but there aren’t many such models. And small LLMs will always run better on a GPU.
  • Another important parameter that affects the model quality, performance, and size is quantization. It's about the size of the numbers used to store model weights. GPT-OSS-120B was already released quantized at 4.25-bit weights. Running models with more precision can be even more taxing.
  • There is no discussion: a GPU with a massively higher memory bandwidth and parallelization computing will be much faster. However, multi-GPU setups or professional GPUs with plenty of memory are much more expensive.
  • Vulcan llama.cpp runtime execution always delivers better performance when using a GPU compared to Nvidia Cuda or AMD’s frameworks.
  • I wouldn’t choose AMD GPUs. While they’re fine for LMStudio, the support for many frameworks on Linux seems to be worse and more complicated.
  • If I were to build a budget PC for running LLMs universally, I would still add a GPU, prioritizing its VRAM amount. This would give the flexibility to run small models like Gemma or Mixtral Small very fast while still having the option to launch GPT-OSS-120B.
  • I would choose a 128GB RAM kit “just in case” OpenAI or other vendors release even larger LLMs with many small experts. The smaller the experts, the better they run on a CPU. LMStudio allows activation of even fewer experts than intended, but this leads to considerable answer quality loss.
  • LMStudio’s assignment of threads is weird. It doesn’t match the physical cores and threads of the CPU.
  • If a model doesn’t completely fit in the VRAM, its performance will always tank to a similar level as if it were only running on the CPU.
  • Macs seem to have better support when porting new models to their native MLX framework. For example, the new Qwen3-Next-80B model, which ranks 17th on LMArena, beating many proprietary models, was quickly ported to MLX after launch. Meanwhile, weeks later, the community is still waiting for a way to run it on LMStudio (llama.cpp).
  • Prosumer quad-channel DDR5 memory workstations (Xeon W series or threadripper) might be another good alternative for a platform, providing entry performance and expandability for multi-GPU setups. However, they’re a lot more expensive.

ADVERTISEMENT