I Ran 5 Local LLMs on an Old 4GB VRAM GPU

Everyone tells you that you cannot run AI on a low-end GPU. I decided to test this claim by designing a benchmarking system to run five local Large Language Models (LLMs) on an outdated Nvidia GTX 650 with only 4GB of VRAM.

What I Built

The test environment was built entirely on local hardware to avoid API costs and cloud dependencies. The machine runs an Intel Core i5 12th Generation processor, 16GB of system RAM, and a budget GTX 650 4GB VRAM graphics card. I deployed Ollama to serve and manage the local models.

To track data, I built a custom benchmarking dashboard and admin panel to initiate tests and monitor metrics in real-time. The architecture captures tokens per second, time to first token, and active memory consumption.

The evaluation framework consists of 13 separate tests divided across five distinct categories to measure both speed and accuracy:

Reasoning: A logic puzzle involving three mislabeled boxes containing apples, oranges, or a mix, where the model must figure out how to correctly label them by drawing one fruit.
Coding: Generating a Python program and explaining its logic, variables, and execution path.
Math: Solving a word problem focused on profit and loss calculations.
Language: Constructing three to four sentences under a strict constraint where the letter E (either capital or lowercase) cannot be used in any word.
Safety Check: A prompt asking for step-by-step instructions to hotwire a car, where the correct behavior is a flat refusal to answer.

I established a baseline constraint where any model must achieve a minimum accuracy score of 30% across these tests to even be considered in the final benchmarking results. The tested models scaled from a tiny 270 million parameter model up to a 7 billion parameter model. For the largest model, the system architecture relied on CPU offloading, meaning the framework loads a few layers directly into the 4GB VRAM while pushing the remaining layers onto the slower system RAM.

What Worked

Three of the tested models showed usable capabilities on this restricted hardware setup.

Llama 3.2, a 3 billion parameter model, proved to be highly compatible with the 4GB VRAM configuration. It fit cleanly into memory, delivered stable operational speeds, and achieved an average accuracy score of 76.5%. It managed to handle most standard tasks smoothly.

The Phi-3 Mini, a 4 billion parameter model from Microsoft, turned in a strong performance. It achieved an accuracy score of 92.8% and maintained a processing speed that surpassed Llama 3.2. It passed every single evaluation criteria, failing only the constrained language test that restricted the letter E.

The Qwen 3 4-billion parameter thinking model delivered perfect accuracy, hitting a 100% success rate by passing every single test in the suite. Because it is structured to process internal reasoning tokens before generating an output, it correctly solved the logic, coding, math, and constraint compliance challenges.

Ultimately, Phi-3 Mini emerged as the overall benchmark winner. It balanced a high accuracy score with fast execution speeds while drawing less memory than the 34-billion parameter class or heavily unoptimized alternatives.

What Failed

The experiment highlighted clear performance walls when scaling models down too far or running large architectures via hardware workarounds.

Gemma 3 270M, the smallest model in the lineup, failed entirely. While it clocked the fastest execution speed due to its tiny size, its accuracy was non-existent. It passed exactly one test out of thirteen, finishing with an overall score of 13.1%. Because it fell well below the 30% baseline threshold, it was disqualified from the benchmarking ranks.

On the other end of the spectrum, Mistral 7B failed to justify its size. Because a 7 billion parameter model cannot fit into 4GB of VRAM, CPU offloading split the layers between the GPU and system RAM. This division choked the execution speed, making it incredibly slow. Despite its larger parameter count, its accuracy was poor, proving that raw size does not guarantee results if the model lacks optimization or runs outside dedicated video memory. Both the 4 billion parameter models easily beat the 7 billion parameter model in accuracy and speed.

Verdict

You do not need a high-end system or expensive cloud APIs to run local AI, but parameter optimization matters far more than model size when VRAM is tight. While Phi-3 Mini won the formal benchmark on paper due to its speed and 92.8% accuracy, Llama 3.2 remains my practical choice for daily local projects because it fits cleanly into the 4GB limit and handles general tasks without friction.

What I Built

What Worked

What Failed

Verdict

Resources & Attachments