Featured image for VRAM Requirements for AI: How Much Do You Need?
AI Hardware · · 13 min read · Updated

VRAM Requirements for AI: How Much Do You Need?

Calculate exactly how much VRAM you need for AI models. Complete 2026 guide with requirements for Llama, Mistral, and more. VRAM tables and formulas included.

vramgpuai hardwarellmlocal aigpu memoryquantization

I ran out of VRAM three minutes into my first local AI experiment.

The model loaded fine. I typed my prompt. Then—nothing. My GPU fans spun up to jet-engine levels, my system stuttered, and eventually crashed with an “out of memory” error. The 7B model I was trying to run needed 14GB of VRAM. My GPU had 8GB.

That’s when I learned the most important lesson in local AI: VRAM is everything. Not clock speed. Not CUDA cores. Not tensor cores. VRAM—video memory—is the single factor that determines what AI models you can run locally.

This guide gives you the exact formulas and tables to calculate your VRAM requirements before you waste time (or money) on hardware that can’t run what you need.

Why VRAM Is the Bottleneck for AI

When you run a Large Language Model on your GPU, the entire model needs to fit in VRAM. Not “most of it.” Not the “active parts.” The whole thing—every parameter, every weight.

Here’s why: GPUs are extremely fast at the parallel math required for AI inference, but only when the data is in their local memory (VRAM). If even part of the model spills over to system RAM, performance falls off a cliff. We’re talking 10-100x slower, sometimes more.

Think of it like a chef’s workspace. System RAM is the pantry down the hall—you can store a lot there, but every trip takes time. VRAM is the counter in front of you—limited space, but everything is instantly accessible. For AI inference, you need your entire recipe (the model) on that counter.

VRAM vs System RAM:

  • VRAM (GPU memory): High-speed memory on your graphics card. Directly accessible by GPU cores. This is what matters for AI.
  • System RAM: Your computer’s main memory. Much slower for AI workloads. Used as fallback when VRAM runs out.

Some frameworks support “CPU offloading,” where portions of the model run on system RAM. This works in a pinch, but expect a 5-20x performance penalty. It’s a last resort, not a strategy.

The Simple Formula for VRAM Requirements

Here’s the formula that will save you hours of confusion:

VRAM Required = (Model Parameters × Bytes per Parameter) + Overhead

The “bytes per parameter” depends on the precision format:

Precision FormatBytes per ParameterVRAM per Billion Parameters
FP32 (32-bit)4 bytes~4 GB
FP16 (16-bit)2 bytes~2 GB
INT8 (8-bit)1 byte~1 GB
INT4 (4-bit)0.5 bytes~0.5 GB

Real example calculations (from Hugging Face model specs):

A Llama 3 8B model at different precisions:

  • FP16 (standard): 8B × 2 bytes = 16 GB VRAM
  • INT8 (quantized): 8B × 1 byte = 8 GB VRAM
  • INT4 (heavily quantized): 8B × 0.5 bytes = 4 GB VRAM

A Llama 3 70B model:

  • FP16: 70B × 2 bytes = 140 GB VRAM (not happening on consumer hardware)
  • INT4: 70B × 0.5 bytes = 35 GB VRAM (possible on RTX 5090 or multiple GPUs)

These are baseline requirements. In practice, you need 10-20% extra for:

  • KV cache (grows with context length)
  • Framework overhead
  • CUDA memory management

Rule of thumb: Add 15% to your calculated VRAM need for safety margin.

Complete VRAM Requirements Table (2026 Models)

Here’s a comprehensive table of popular models and their VRAM requirements. These are real-world numbers including typical overhead.

Llama Family (Meta)

ModelParametersQ4_K_MQ5_K_MQ8_0FP16
Llama 3.2 1B1B0.8 GB1 GB1.5 GB2.5 GB
Llama 3.2 3B3B2 GB2.5 GB3.5 GB7 GB
Llama 3.1 8B8B5 GB6 GB9 GB17 GB
Llama 3.1 70B70B40 GB48 GB75 GB145 GB
Llama 3.1 405B405B230 GB275 GB430 GB850 GB

Mistral Family

ModelParametersQ4_K_MQ5_K_MQ8_0FP16
Mistral 7B7B4.5 GB5.5 GB8 GB15 GB
Mixtral 8x7B47B (active: 13B)27 GB33 GB50 GB100 GB
Mistral Large 2123B70 GB85 GB130 GB260 GB

Qwen Family (Alibaba)

ModelParametersQ4_K_MQ5_K_MQ8_0FP16
Qwen 2.5 7B7B4.5 GB5.5 GB8 GB15 GB
Qwen 2.5 14B14B9 GB11 GB16 GB30 GB
Qwen 2.5 72B72B42 GB50 GB78 GB150 GB

Smaller/Efficient Models

ModelParametersQ4_K_MQ5_K_MQ8_0FP16
Phi-3.5 Mini3.8B2.5 GB3 GB4.5 GB8 GB
Phi-3 Medium14B9 GB11 GB16 GB30 GB
Gemma 2 9B9B6 GB7 GB10 GB19 GB
Gemma 2 27B27B16 GB19 GB29 GB56 GB

Reading the table: Q4_K_M and Q5_K_M are the most common quantization formats for daily use—good balance of quality and size. Q8_0 offers near-original quality. FP16 is the unquantized format.

For most users, Q4_K_M or Q5_K_M is the sweet spot. You’ll barely notice quality differences from FP16 for typical use cases.

Context Length: The Hidden VRAM Cost

Here’s what catches many people off guard: VRAM requirements grow with context length.

When an LLM processes text, it maintains a “KV cache”—a record of all previous tokens it needs to reference. This cache consumes VRAM, and it scales linearly with context length.

The additional VRAM math:

KV Cache VRAM ≈ 2 × Layers × Heads × (Head Dimension) × (Context Length) × Precision Bytes

That’s complex, so here’s a practical table for a typical 7B model:

Context LengthAdditional KV Cache VRAM (FP16)
2,048 tokens~0.5 GB
4,096 tokens~1 GB
8,192 tokens~2 GB
16,384 tokens~4 GB
32,768 tokens~8 GB
65,536 tokens~16 GB
131,072 tokens~32 GB

What this means in practice:

That 8B model that “fits” in 5GB at Q4 with 4K context? Try to use its full 128K context window, and you suddenly need 37GB+. This is why you might load a model successfully but crash when you paste in a long document.

Practical guidelines:

  • For 8GB VRAM: Stick to 4K-8K context reliably
  • For 16GB VRAM: Use up to 16K context comfortably
  • For 24GB VRAM: 32K context is reasonable
  • For 32GB+ VRAM: Extended context windows become practical

How Quantization Changes Everything

Quantization compresses model weights by reducing their numerical precision. It’s the magic that makes local AI practical.

How it works (simplified):

  • Original models use 16-bit floating-point numbers (FP16)
  • Quantization converts these to 8-bit, 4-bit, or even lower
  • Each step roughly halves the VRAM requirement
  • Quality degrades slightly, but often imperceptibly

Common quantization formats:

FormatQuality ImpactUse Case
Q8_0Minimal (~1% degradation)When quality is paramount
Q6_KVery slightHigh-quality default
Q5_K_MSlightBalanced choice
Q4_K_MModerateBest value for most users
Q4_0NoticeableWhen VRAM is very tight
Q3_KSignificantLast resort
Q2_KHeavyDon’t use for serious work

My recommendations:

  • Primary workhorse: Q4_K_M or Q5_K_M. Best balance of quality and VRAM.
  • Quality-critical tasks: Q8_0 if you have the VRAM
  • VRAM-constrained: Q4_K_M is the floor for usable quality
  • Avoid: Q3 and below for anything you care about

The “_K_M” suffix indicates K-quants with medium precision—a good middle ground. The “_S” variants save a bit more VRAM with slightly lower quality.

GPU VRAM Options in 2026

Let me map VRAM requirements to actual GPU options. For the latest GPU specs, check NVIDIA’s GeForce page:

Consumer NVIDIA GPUs

GPUVRAMBest Model SizeStreet Price (Jan 2026)
RTX 40608 GB7B Q4~$300
RTX 4060 Ti 16GB16 GB13B Q4, 7B Q8~$450
RTX 407012 GB7B Q8, 13B Q4~$550
RTX 4070 Ti Super16 GB13B Q5~$800
RTX 4080 Super16 GB13B Q5~$1,000
RTX 409024 GB34B Q4, 13B FP16~$1,600
RTX 509032 GB70B Q4~$2,000
RTX 3090 (used)24 GB34B Q4, 13B FP16~$650

AMD GPUs

GPUVRAMBest Model SizeStreet Price
RX 76008 GB7B Q4~$250
RX 7800 XT16 GB13B Q4~$450
RX 7900 XTX24 GB34B Q4~$900

Apple Silicon

ChipUnified Memory OptionsNotes
M38-24 GBShared with system
M3 Pro18-36 GBShared with system
M3 Max36-128 GBBest for large models
M4 Pro24-48 GBGood mid-range
M4 Max36-128 GBComparable to RTX 5090

Apple Silicon uses unified memory shared between CPU, GPU, and system. It’s not directly comparable to dedicated VRAM, but for AI inference, a 64GB M3 Max can run models that would require a $2,000+ NVIDIA GPU.

Checking Your Current VRAM Usage

Before you run out of memory, monitor your usage:

NVIDIA (nvidia-smi)

# Check current VRAM usage
nvidia-smi

# Continuous monitoring (updates every 1 second)
watch -n 1 nvidia-smi

# Just the memory info
nvidia-smi --query-gpu=memory.used,memory.total --format=csv

Reading the output:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15    Driver Version: 550.54.15    CUDA Version: 12.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Memory-Usage |      GPU-Util        |
|===============================+======================+======================|
|   0  NVIDIA GeForce RTX 4090  |  18432MiB / 24564MiB |     45%           |
+-------------------------------+----------------------+----------------------+

This shows 18.4GB used of 24.5GB total. You have ~6GB headroom.

During Ollama Inference

Ollama shows memory usage when you load a model:

>>> /show info
Model:    llama3:8b
Parameters: 8B
Quantization: Q4_K_M
Context Length: 8192
Memory Used: 4.92 GB

Using nvitop (Better Monitoring)

pip install nvitop
nvitop

This gives a beautiful, real-time dashboard of GPU usage.

What To Do When You Run Out of VRAM

Already hit an out-of-memory error? Here’s your troubleshooting checklist:

1. Use More Aggressive Quantization

If you’re running Q8, try Q5_K_M or Q4_K_M. The quality difference is usually acceptable.

# In Ollama, pull a smaller quantization
ollama pull llama3:8b-q4_K_M

2. Reduce Context Length

Most tools let you limit context length. Shorter context = less KV cache = less VRAM.

# Ollama example
ollama run llama3 --num-ctx 4096

3. Choose a Smaller Model

An 8B model outperforms a 70B model that doesn’t run. Sometimes smaller is better.

Quality rankings within similar sizes:

  • Llama 3.1 8B > Mistral 7B > older models
  • For code: DeepSeek Coder performs above its size class

4. Close Other GPU Applications

Check what else is using your GPU:

nvidia-smi

Browsers, video players, and even some desktop effects use VRAM. Close them before loading large models.

5. Enable CPU Offloading (Last Resort)

Some tools let you offload layers to CPU. It’s slow but works.

# llama.cpp example: offload 20 layers to GPU, rest to CPU
./main -m model.gguf -ngl 20

Expect 5-20x slower inference for offloaded layers.

6. Consider the Upgrade

If you’re constantly fighting VRAM limits, a hardware upgrade may be the real solution. See our GPU buying guide for recommendations.

Frequently Asked Questions

Can I run AI with only 4GB VRAM?

Barely. You can run very small models (1-3B parameters) or heavily quantized 7B models with short context. It’s usable for experimentation but frustrating for real work. For serious local AI, 8GB is the minimum; 16GB is comfortable.

Does faster VRAM (GDDR6X vs GDDR6) matter?

For inference, memory bandwidth matters more than capacity once you have enough VRAM. Higher-speed memory (GDDR6X, GDDR7) improves token generation speed. But if you don’t have enough VRAM to load the model, speed is irrelevant. Capacity first, bandwidth second.

Can I combine CPU and GPU memory?

Technically yes, but with severe performance penalties. When a model exceeds VRAM, frameworks like llama.cpp can offload layers to CPU. Expect each offloaded layer to slow things down significantly. It’s a workaround, not a solution.

How does Apple Silicon unified memory compare?

Apple’s unified memory is shared between CPU and GPU, making direct comparisons tricky. A 64GB M3 Max can run models requiring ~40GB of dedicated VRAM on NVIDIA, but memory bandwidth is lower, so generation speed is typically 30-50% slower. The advantage is that you can actually access that much memory on a laptop.

What about Intel Arc GPUs?

Intel Arc GPUs (like the A770 with 16GB) are budget-friendly and increasingly supported. Performance is lower than NVIDIA equivalents, and software compatibility still maturing. They’re viable for experimentation but not my first recommendation for serious work.

Get Enough VRAM, Then Everything Else

Local AI becomes dramatically easier once you have enough VRAM. Models load instantly. No more cryptic crashes. No more mental math about what will fit.

Here’s my summary recommendation:

Your GoalMinimum VRAMRecommended VRAM
Experiment with AI8 GB12 GB
Regular development12 GB16 GB
Run quality 13B models16 GB24 GB
Run 34B+ models24 GB32 GB+

Calculate your needs using the formulas above. Check the model table. Then buy the GPU that fits—or find ways to make your current GPU work with quantization and context limits.

For GPU buying advice, check our complete guide to the best GPUs for AI. For setting up local AI once you have the hardware, see our Ollama tutorial.

Your VRAM is your runway. Make sure it’s long enough for takeoff.

Found this helpful? Share it with others.

Vibe Coder avatar

Vibe Coder

AI Engineer & Technical Writer
5+ years experience

AI Engineer with 5+ years of experience building production AI systems. Specialized in AI agents, LLMs, and developer tools. Previously built AI solutions processing millions of requests daily. Passionate about making AI accessible to every developer.

AI Agents LLMs Prompt Engineering Python TypeScript