Fine-Tune Llama: Customize AI For Your Use Case (2026)
Learn how to fine-tune Llama 4 with LoRA to create a custom AI model. Step-by-step tutorial covering dataset prep, training, and deployment.
Last month I built a code review assistant that actually understands my team’s coding standards. Not just generic best practices—our specific conventions, our preferred patterns, our exact feedback style.
The secret wasn’t clever prompting. It was fine-tuning. I took Meta’s Llama 4 model and trained it on examples of how we do code reviews. Now it gives feedback that sounds like a senior developer on our team, not a generic AI.
Fine-tuning sounds intimidating, but modern techniques like LoRA make it surprisingly accessible. You don’t need a data center or a PhD. With the right approach, you can create genuinely custom AI models on consumer hardware using tools from Hugging Face.
Let me walk you through exactly how to do it.
When to Fine-Tune vs Prompt Engineer
Before investing in fine-tuning, ask whether you actually need it.
Prompt Engineering Limits
Prompt engineering is powerful. With good prompts, you can make AI write in specific styles, follow complex formats, and behave in particular ways. See our prompt engineering guide for fundamentals.
But prompting has limits:
- Token costs: Long system prompts use tokens every time
- Context limits: You can only fit so many examples in a prompt
- Consistency: Models can drift from prompt instructions over long outputs
- Deep behavior: Some behaviors are hard to describe but easy to demonstrate
If you find yourself copying the same 500-word prompt into every conversation, or the AI frequently “forgets” your instructions mid-response, fine-tuning might help.
Fine-Tuning Benefits
Fine-tuning bakes behavior into the model itself:
- No prompt overhead: The model just knows how to behave
- Better consistency: Learned behavior is more stable than prompted behavior
- Unlimitied context for training: You can show thousands of examples
- Proprietary knowledge: Train on data you can’t share in prompts
Decision Framework
Use prompt engineering when:
- Your needs can be described in a few paragraphs
- You’re still experimenting with what you want
- You need flexibility to change behavior easily
- Your requirements are similar to the model’s default behavior
Use fine-tuning when:
- You have many examples of desired behavior
- You need very consistent, specific output style
- Prompt overhead is costing you tokens
- The behavior is hard to describe but easy to show
- You’re building a product with specific requirements
Prerequisites and Requirements
Fine-tuning requires some computational resources. Let’s be realistic about what you need.
Hardware Requirements
Minimum viable setup:
- GPU with 12+ GB VRAM (RTX 3080, RTX 4070 Ti, etc.)
- 32 GB system RAM
- 50+ GB free disk space
- CUDA support (NVIDIA GPUs)
Recommended setup:
- GPU with 24+ GB VRAM (RTX 4090, A5000, etc.)
- 64 GB system RAM
- 200+ GB SSD storage
- Fast internet for downloading models
Cloud alternatives: If your hardware is insufficient, cloud options work well:
- Google Colab Pro+ (~$50/month) has adequate GPUs
- RunPod, Lambda Labs offer hourly GPU rental
- Many tutorials run fine on cloud platforms
For more on hardware considerations, see our GPU guide for AI.
Software Setup
You’ll need:
- Python 3.10+
- CUDA toolkit (version matching your GPU driver)
- Hugging Face account (free, for model access)
- Meta’s Llama license acceptance
Install the core libraries:
pip install torch transformers datasets accelerate peft bitsandbytes trl
These packages provide:
torch: PyTorch deep learning frameworktransformers: Hugging Face model librarydatasets: Dataset handlingaccelerate: Training optimizationpeft: LoRA and adapter methodsbitsandbytes: Quantization for lower memorytrl: Fine-tuning helpers
Understanding the Compute Cost
Let’s be honest about costs:
Time:
- Small dataset (100 examples): 30 min - 2 hours
- Medium dataset (1,000 examples): 2 - 6 hours
- Large dataset (10,000+ examples): 12+ hours
Electricity: Running GPUs at full load uses significant power. Budget accordingly if using personal hardware.
Cloud costs: At ~$1-3/hour for capable GPUs, a training run might cost $5-50 depending on dataset size.
Alternatives for Limited Hardware
If you can’t access a high-end GPU:
- Google Colab: Free tier sometimes has T4 GPUs (limited but usable)
- Kaggle Notebooks: 30 hours/week of free GPU
- Smaller models: Fine-tune smaller Llama variants (8B instead of 70B)
- Aggressive quantization: Use 4-bit training to reduce memory
Understanding LoRA and QLoRA
You don’t need to fully understand these techniques to use them, but a basic grasp helps.
Full Fine-Tuning vs LoRA
Full fine-tuning updates every parameter in the model. For a 7B parameter model, that means adjusting 7 billion numbers. This requires enormous memory and storage—you end up with a complete modified model larger than the original.
LoRA (Low-Rank Adaptation) takes a clever shortcut. Instead of modifying all parameters, it trains small adapter matrices that modify the model’s behavior. For a 7B model, the adapters might be only 20-100 MB.
The result is:
- 99%+ memory reduction during training
- Tiny storage (adapters only)
- Similar quality to full fine-tuning for most use cases
- Can merge adapters into base model later
How LoRA Works
Without getting too mathematical: neural networks have layers that transform information. Each layer has a weight matrix. LoRA adds small “side matrices” to these weights that adjust behavior.
During training, only these small matrices are updated. The original model weights never change. This is why it’s so efficient.
After training, you can either:
- Keep adapters separate (load base model + adapter when running)
- Merge adapters into base model (slightly slower to merge, simpler to deploy)
QLoRA for Memory Efficiency
QLoRA (Quantized LoRA) goes further by loading the base model in 4-bit precision. This cuts memory requirements roughly in half again.
The trade-off is slight quality reduction and slower training. But it enables fine-tuning models that otherwise wouldn’t fit in memory.
When to use each:
- LoRA: When you have enough VRAM for the base model in 16-bit
- QLoRA: When memory is tight, or for larger models
Preparing Your Dataset
Dataset quality matters more than quantity. A small, high-quality dataset often outperforms a large, messy one.
Dataset Formats
For instruction fine-tuning (the most common type), your data should be formatted as instruction-response pairs:
{
"instruction": "How would you describe the character development in this passage?",
"input": "[passage text here]",
"output": "The passage shows Dorothy developing from a passive observer to an active participant..."
}
Or in conversational format:
{
"conversations": [
{"role": "user", "content": "Review this code for issues: [code]"},
{"role": "assistant", "content": "I found three issues: First, ..."}
]
}
Common formats:
- Alpaca format: instruction, input (optional), output
- ShareGPT format: List of turns with roles
- OpenAI format: Messages array with role and content
Quality Over Quantity
100 excellent examples often beat 10,000 mediocre ones.
What makes a good training example:
- Representative: Shows the behavior you actually want
- Consistent: All examples follow the same style
- Complete: Full responses, not cut off
- Correct: No errors in the outputs you’re training toward
What to avoid:
- Contradictory examples (one says “use passive voice,” another uses active)
- Low-quality outputs (typos, unclear writing)
- Ambiguous examples where good behavior isn’t clear
Cleaning and Formatting
Before training, clean your data:
import json
def clean_example(example):
# Remove excessive whitespace
example['instruction'] = ' '.join(example['instruction'].split())
example['output'] = example['output'].strip()
# Ensure minimum quality
if len(example['output']) < 50:
return None # Skip too-short examples
return example
# Load and clean
with open('raw_data.json', 'r') as f:
data = json.load(f)
cleaned = [clean_example(ex) for ex in data]
cleaned = [ex for ex in cleaned if ex is not None]
# Save cleaned data
with open('cleaned_data.json', 'w') as f:
json.dump(cleaned, f)
Train/Validation Split
Always hold out some data for validation:
from sklearn.model_selection import train_test_split
train_data, val_data = train_test_split(cleaned, test_size=0.1, random_state=42)
print(f"Training examples: {len(train_data)}")
print(f"Validation examples: {len(val_data)}")
Validation data helps you detect overfitting—when the model memorizes your training data rather than learning generalizable patterns.
Step-by-Step Fine-Tuning with Hugging Face
Let’s build a complete fine-tuning pipeline.
Environment Setup
Create a fresh virtual environment:
python -m venv finetune-env
source finetune-env/bin/activate # Windows: finetune-env\Scripts\activate
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
pip install transformers datasets accelerate peft bitsandbytes trl
Verify GPU access:
import torch
print(f"CUDA available: {torch.cuda.is_available()}")
print(f"Device: {torch.cuda.get_device_name(0)}")
print(f"Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")
Loading the Base Model
For this tutorial, we’ll use Llama 4 8B—large enough to be capable, small enough to fine-tune on consumer hardware:
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model_name = "meta-llama/Llama-4-8B-Instruct"
# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token
# Load model in 4-bit for memory efficiency (QLoRA)
from transformers import BitsAndBytesConfig
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_use_double_quant=True,
)
model = AutoModelForCausalLM.from_pretrained(
model_name,
quantization_config=bnb_config,
device_map="auto",
trust_remote_code=True,
)
You’ll need to have accepted Meta’s Llama license and be logged into Hugging Face:
huggingface-cli login
For more on accessing models, see our Hugging Face tutorial.
Configuring LoRA
Set up the LoRA configuration:
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
# Prepare model for training
model = prepare_model_for_kbit_training(model)
# LoRA configuration
lora_config = LoraConfig(
r=16, # Rank - higher = more capacity, more memory
lora_alpha=32, # Scaling factor
target_modules=[ # Which layers to adapt
"q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj"
],
lora_dropout=0.1, # Dropout for regularization
bias="none", # Don't train biases
task_type="CAUSAL_LM" # Task type
)
# Apply LoRA to model
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
The output will show something like:
trainable params: 19,922,944 || all params: 8,030,261,248 || trainable%: 0.2481
Only 0.25% of parameters are being trained—that’s the LoRA efficiency.
Training Parameters
Configure the training run:
from transformers import TrainingArguments
training_args = TrainingArguments(
output_dir="./llama4-finetuned",
num_train_epochs=3,
per_device_train_batch_size=4,
gradient_accumulation_steps=4,
learning_rate=2e-4,
warmup_ratio=0.03,
logging_steps=10,
save_strategy="epoch",
evaluation_strategy="epoch",
fp16=True,
optim="paged_adamw_32bit",
max_grad_norm=0.3,
lr_scheduler_type="cosine",
)
Key parameters explained:
- num_train_epochs: How many times to go through the data (2-5 usually)
- batch_size × gradient_accumulation: Effective batch size (16 here)
- learning_rate: 1e-4 to 3e-4 works well for LoRA
- warmup_ratio: Gradual learning rate increase at start
Running the Training
Prepare your data and start training:
from datasets import load_dataset
from trl import SFTTrainer
# Load your dataset
dataset = load_dataset("json", data_files={
"train": "train_data.json",
"validation": "val_data.json"
})
# Format function for your data
def format_instruction(example):
return f"""### Instruction:
{example['instruction']}
### Response:
{example['output']}"""
# Create trainer
trainer = SFTTrainer(
model=model,
train_dataset=dataset["train"],
eval_dataset=dataset["validation"],
tokenizer=tokenizer,
args=training_args,
formatting_func=format_instruction,
max_seq_length=2048,
)
# Train!
trainer.train()
Training progress will be logged:
{'loss': 2.1453, 'learning_rate': 0.0002, 'epoch': 0.5}
{'loss': 1.8234, 'learning_rate': 0.00019, 'epoch': 1.0}
...
Watch for decreasing loss. If validation loss starts increasing while training loss decreases, you’re overfitting.
Saving the Adapter
After training, save your LoRA adapter:
# Save adapter weights
model.save_pretrained("./llama4-lora-adapter")
# Also save tokenizer
tokenizer.save_pretrained("./llama4-lora-adapter")
This creates a small directory (tens of MB) with just your fine-tuned adapter weights.
Evaluating Your Fine-Tuned Model
Before deploying, test your model thoroughly.
Qualitative Testing
Load your fine-tuned model and test manually:
from peft import PeftModel
# Load base model
base_model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-4-8B-Instruct",
quantization_config=bnb_config,
device_map="auto",
)
# Load adapter
model = PeftModel.from_pretrained(base_model, "./llama4-lora-adapter")
# Test generation
def generate(prompt, max_tokens=200):
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
outputs = model.generate(
**inputs,
max_new_tokens=max_tokens,
temperature=0.7,
do_sample=True,
)
return tokenizer.decode(outputs[0], skip_special_tokens=True)
# Test with examples similar to training data
print(generate("### Instruction:\n[Your test instruction]\n\n### Response:\n"))
Compare outputs to:
- The original base model (without fine-tuning)
- Your training examples
- Your actual use case requirements
Quantitative Metrics
For more rigorous evaluation, compute metrics on held-out data:
import numpy as np
# Run inference on validation set
predictions = []
references = []
for example in dataset["validation"]:
prompt = format_instruction(example).split("### Response:")[0] + "### Response:\n"
pred = generate(prompt)
predictions.append(pred)
references.append(example["output"])
# Compute metrics (using your preferred metric)
# Options: BLEU, ROUGE, custom similarity scores, LLM-as-judge
For many fine-tuning use cases, qualitative assessment by humans is more meaningful than automatic metrics.
Overfitting Detection
Signs of overfitting:
- Validation loss increasing while training loss decreases
- Model outputs training examples nearly verbatim
- Good performance on training-like inputs, poor on novel inputs
If overfitting:
- Use less training data
- Train for fewer epochs
- Increase dropout
- Reduce LoRA rank
Deploying Your Custom Model
Once you’re happy with your model, deploy it.
Merging LoRA Weights
For simpler deployment, merge adapters into the base model:
# Merge adapter into base model
merged_model = model.merge_and_unload()
# Save the merged model
merged_model.save_pretrained("./llama4-merged")
tokenizer.save_pretrained("./llama4-merged")
This creates a full model ready to run without loading adapters separately.
Using with Ollama
To use your fine-tuned model with Ollama for easy local deployment:
- Convert to GGUF format (Ollama’s preferred format)
- Create a Modelfile
- Create the Ollama model
# Convert to GGUF (using llama.cpp's convert script)
python convert.py ./llama4-merged --outtype f16 --outfile llama4-finetuned.gguf
# Quantize for efficiency
./quantize llama4-finetuned.gguf llama4-finetuned-q4.gguf Q4_K_M
Create a Modelfile:
FROM ./llama4-finetuned-q4.gguf
SYSTEM "You are a helpful assistant fine-tuned for [your use case]."
Create in Ollama:
ollama create my-custom-llama -f Modelfile
ollama run my-custom-llama
See our Ollama guide for more details on local deployment.
API Deployment Options
For production APIs:
- vLLM: High-throughput inference server
- Text Generation Inference (TGI): Hugging Face’s server
- Replicate: Cloud deployment with simple API
Each option has trade-offs in complexity, cost, and performance.
Common Mistakes and Fixes
Problem: Loss doesn’t decrease
- Learning rate too low or too high
- Dataset formatting issues
- Try learning rates: 1e-4, 2e-4, 3e-4
Problem: Model outputs garbage
- Tokenizer mismatch (using wrong tokenizer)
- Training data format doesn’t match inference format
- Not enough training data
Problem: Model memorizes but doesn’t generalize
- Classic overfitting
- Reduce epochs, increase dropout
- Add more diverse training examples
Problem: Out of memory during training
- Reduce batch size
- Use gradient checkpointing:
model.gradient_checkpointing_enable() - Use smaller LoRA rank (r=8 instead of r=16)
Problem: Training is too slow
- Enable bf16/fp16 training
- Use flash attention if available
- Reduce max sequence length if your data is shorter
Frequently Asked Questions
How much training data do I need?
Quality matters more than quantity. 50-100 high-quality examples can produce noticeable improvement. 500-1000 examples usually enough for good results. More data helps for complex tasks, but diminishing returns set in.
Can I fine-tune on copyrighted material?
Technically possible, legally questionable. The training process doesn’t “contain” the original data, but you may face legal challenges if your model reproduces copyrighted content. Consult legal advice for commercial use.
How long does fine-tuning take?
With QLoRA on a capable GPU: 1-2 hours for small datasets (100-500 examples), 4-8 hours for medium datasets, 12+ hours for large datasets. Cloud GPUs are often faster than consumer hardware.
What’s the quality difference between LoRA and full fine-tuning?
For most use cases, nearly identical. LoRA might slightly underperform for very complex behavior changes, but the efficiency gain is almost always worth it.
Can I fine-tune GPT-5 or Claude?
No—those are closed models. You can only fine-tune open source models like Llama, Mistral, and Phi. Some closed providers offer fine-tuning services (OpenAI), but you don’t get the model weights.
Conclusion
Fine-tuning isn’t magic, but it is powerful. With LoRA and modern tools, you can create genuinely custom AI models that understand your specific needs—not just generic capabilities twisted through prompting.
The key takeaways:
- Fine-tune when prompting isn’t enough (consistency, efficiency, proprietary knowledge)
- Quality training data matters more than quantity
- LoRA makes fine-tuning accessible on consumer hardware
- Test thoroughly before deploying
My code review assistant has saved my team countless hours. The investment in fine-tuning paid for itself in the first week. Your custom model might do the same for your use case.
For more on working with open source AI, explore our guides on Llama models and running AI locally with Ollama.
Now go build something that’s actually yours.
Last updated: January 11, 2026