How to Fine-Tune an LLM: Complete Guide to Custom Model Training (2026)
Author
ZTABS Team
Date Published
Fine-tuning a large language model means taking a pre-trained model and training it further on your specific data so it performs better on your tasks. It's the difference between a general assistant and a domain expert.
But fine-tuning isn't always the right answer. It's expensive, time-consuming, and unnecessary for many use cases. This guide helps you decide when fine-tuning makes sense, walks you through the process step by step, and helps you avoid the mistakes that waste time and money.
When to Fine-Tune vs. Other Approaches
Before investing in fine-tuning, understand the alternatives and when each makes sense.
Prompt Engineering
What it is: Crafting better prompts (including system messages, few-shot examples, and chain-of-thought instructions) to get the behavior you want from a base model.
When to use it:
- You need the model to follow specific formatting
- Your task is well-defined and can be demonstrated with examples
- You're working with a capable base model (GPT-4o, Claude 3.5 Sonnet)
- You need to iterate quickly (no training required)
Limitations:
- Uses tokens on every request for instructions and examples (higher per-request cost)
- Context window limits how many examples you can include
- Inconsistent behavior despite careful prompting
- Can't teach truly new knowledge or behaviors
RAG (Retrieval-Augmented Generation)
What it is: Giving the model access to external knowledge by retrieving relevant documents and including them in the prompt context.
When to use it:
- The model needs access to specific, up-to-date information
- Your knowledge base changes frequently
- You need citations and source attribution
- You want to avoid hallucination about factual information
Limitations:
- Doesn't change the model's behavior or style
- Retrieval quality directly affects output quality
- Adds latency (retrieval step before generation)
- Complex to maintain at scale
Fine-Tuning
What it is: Training the model on your data to internalize patterns, styles, domain knowledge, and task-specific behaviors.
When to use it:
- You need consistent style, tone, or formatting that prompting can't achieve
- You need to reduce per-request cost by eliminating long system prompts
- You have domain-specific terminology the model handles poorly
- You need the model to follow complex, multi-step procedures consistently
- You need to run the model locally for privacy or cost reasons
- You have hundreds+ examples of ideal input-output pairs
Decision Framework
| Scenario | Best Approach | Why | |----------|--------------|-----| | Model needs your company's data | RAG | Knowledge retrieval, not behavior change | | Model needs to match your writing style | Fine-tune | Style is learned behavior | | Model gives inconsistent formats | Fine-tune | Consistent output structure is trained behavior | | Model needs current information | RAG | Dynamic knowledge retrieval | | Model needs domain jargon | Fine-tune (or RAG) | Terminology can be trained | | Quick prototype needed | Prompt engineering | Fastest iteration cycle | | Cost reduction at scale | Fine-tune | Eliminates long prompts | | Model needs to follow complex procedures | Fine-tune + RAG | Combine behavior training with knowledge |
Data Preparation: The Foundation
Fine-tuning quality is directly proportional to data quality. Garbage in, garbage out is more true for fine-tuning than almost any other ML task.
Data Format
Most fine-tuning approaches use a conversational format:
{
"messages": [
{
"role": "system",
"content": "You are a medical coding assistant that assigns ICD-10 codes to clinical notes."
},
{
"role": "user",
"content": "Patient presents with acute lower back pain radiating to left leg, worsening over 2 weeks. History of lumbar disc herniation."
},
{
"role": "assistant",
"content": "Primary: M54.5 (Low back pain)\nSecondary: M51.16 (Intervertebral disc degeneration, lumbar region)\nAdditional: M54.4 (Lumbago with sciatica, left side)"
}
]
}
How Much Data Do You Need?
| Quality Level | Minimum Examples | Ideal Range | Notes | |--------------|-----------------|-------------|-------| | Basic improvement | 50-100 | 200-500 | Noticeable style/format changes | | Good quality | 200-500 | 500-2,000 | Reliable domain performance | | High quality | 500-2,000 | 2,000-10,000 | Strong domain expertise | | Production grade | 2,000+ | 10,000+ | Consistent, reliable, edge-case handling |
Quality beats quantity. 200 carefully curated, high-quality examples will outperform 2,000 noisy, inconsistent examples every time.
Data Quality Checklist
Before starting fine-tuning, validate your dataset against these criteria:
- Consistency: Do similar inputs produce similar outputs? Contradictory examples confuse the model.
- Correctness: Are all outputs factually accurate and properly formatted? The model will learn your mistakes.
- Diversity: Does the dataset cover the range of inputs the model will see in production?
- Balance: Are different categories/types represented proportionally?
- Length: Are responses the right length? The model will learn to match the average response length.
- Edge cases: Are uncommon but important scenarios included?
Data Cleaning Pipeline
import json
from collections import Counter
def validate_dataset(filepath: str) -> dict:
issues = []
examples = []
with open(filepath) as f:
for i, line in enumerate(f):
example = json.loads(line)
messages = example.get("messages", [])
if len(messages) < 2:
issues.append(f"Line {i}: Too few messages")
continue
roles = [m["role"] for m in messages]
if roles[-1] != "assistant":
issues.append(f"Line {i}: Last message should be assistant")
for msg in messages:
if not msg.get("content", "").strip():
issues.append(f"Line {i}: Empty content for {msg['role']}")
assistant_msgs = [m for m in messages if m["role"] == "assistant"]
for msg in assistant_msgs:
if len(msg["content"]) < 10:
issues.append(f"Line {i}: Very short assistant response")
examples.append(example)
lengths = [
len(m["content"])
for ex in examples
for m in ex["messages"]
if m["role"] == "assistant"
]
return {
"total_examples": len(examples),
"issues": issues,
"avg_response_length": sum(lengths) / len(lengths) if lengths else 0,
"min_response_length": min(lengths) if lengths else 0,
"max_response_length": max(lengths) if lengths else 0,
}
Fine-Tuning Methods
There are several approaches to fine-tuning, each with different trade-offs in cost, performance, and complexity.
Full Fine-Tuning
What it is: Updating all parameters of the model during training.
Pros:
- Maximum flexibility and performance potential
- Can significantly alter model behavior
Cons:
- Requires enormous GPU memory (model size × 4-8 for gradients and optimizer states)
- High risk of catastrophic forgetting (losing general capabilities)
- Expensive and slow
- Impractical for models >7B parameters without significant infrastructure
When to use: Almost never in 2026. LoRA and QLoRA achieve comparable results at a fraction of the cost.
LoRA (Low-Rank Adaptation)
What it is: Instead of updating all parameters, LoRA adds small trainable matrices to specific layers. Only these matrices are trained, while the original model weights stay frozen.
Pros:
- 90-99% fewer trainable parameters than full fine-tuning
- Dramatically lower GPU memory requirements
- Fast training (hours instead of days)
- Can maintain multiple LoRA adapters for different tasks
- Easy to merge back into the base model
Cons:
- Slightly lower theoretical maximum performance than full fine-tuning
- Rank selection requires tuning
When to use: This is the default choice for most fine-tuning tasks in 2026.
QLoRA (Quantized LoRA)
What it is: Combines LoRA with 4-bit quantization of the base model. The base model is loaded in 4-bit precision (saving ~75% memory), while LoRA adapters train in full precision.
Pros:
- Fine-tune a 70B model on a single 48GB GPU
- Even lower memory requirements than LoRA
- Training quality nearly matches full LoRA
- Makes fine-tuning accessible on consumer hardware
Cons:
- Slightly slower training due to quantization overhead
- Small quality trade-off compared to full-precision LoRA
- Requires compatible libraries (bitsandbytes)
When to use: When hardware is limited or you're fine-tuning larger models (13B+).
Comparison Table
| Method | GPU Memory (7B model) | Training Time | Quality | Cost | |--------|----------------------|---------------|---------|------| | Full fine-tuning | 60-120 GB | Days | Highest | $$$$ | | LoRA | 12-24 GB | Hours | High | $$ | | QLoRA | 6-12 GB | Hours | High | $ |
Step-by-Step: Fine-Tuning with OpenAI
OpenAI's fine-tuning API is the simplest way to create a custom model. No GPU management, no infrastructure — just data and an API call.
Step 1: Prepare Your Data
Create a JSONL file with your training examples:
{"messages": [{"role": "system", "content": "You are a customer support agent for Acme Corp. Be helpful, concise, and always suggest relevant product features."}, {"role": "user", "content": "How do I export my data?"}, {"role": "assistant", "content": "To export your data:\n\n1. Go to Settings → Data Management\n2. Click 'Export Data'\n3. Choose your format (CSV, JSON, or PDF)\n4. Select the date range\n5. Click 'Start Export'\n\nYour export will be ready in 2-5 minutes. You'll receive an email with a download link.\n\nTip: For recurring exports, check out our Scheduled Exports feature under Settings → Automations."}]}
{"messages": [{"role": "system", "content": "You are a customer support agent for Acme Corp. Be helpful, concise, and always suggest relevant product features."}, {"role": "user", "content": "Can I change my billing plan?"}, {"role": "assistant", "content": "Yes, you can change your plan anytime:\n\n1. Go to Settings → Billing\n2. Click 'Change Plan'\n3. Select your new plan\n4. Confirm the change\n\nUpgrades take effect immediately. Downgrades take effect at the end of your current billing cycle — you keep access to premium features until then.\n\nNote: If you're on an annual plan and want to switch to monthly, contact us and we'll prorate your remaining balance."}]}
Step 2: Upload and Validate
from openai import OpenAI
client = OpenAI()
training_file = client.files.create(
file=open("training_data.jsonl", "rb"),
purpose="fine-tune"
)
print(f"File ID: {training_file.id}")
Step 3: Start Fine-Tuning
job = client.fine_tuning.jobs.create(
training_file=training_file.id,
model="gpt-4o-mini-2024-07-18",
hyperparameters={
"n_epochs": 3,
"learning_rate_multiplier": 1.8,
"batch_size": 4
},
suffix="acme-support-v1"
)
print(f"Job ID: {job.id}")
print(f"Status: {job.status}")
Step 4: Monitor Training
events = client.fine_tuning.jobs.list_events(
fine_tuning_job_id=job.id,
limit=20
)
for event in events.data:
print(f"{event.created_at}: {event.message}")
Step 5: Use Your Fine-Tuned Model
response = client.chat.completions.create(
model="ft:gpt-4o-mini-2024-07-18:your-org:acme-support-v1:abc123",
messages=[
{"role": "system", "content": "You are a customer support agent for Acme Corp."},
{"role": "user", "content": "How do I set up two-factor authentication?"}
]
)
print(response.choices[0].message.content)
OpenAI Fine-Tuning Costs
| Model | Training Cost (per 1M tokens) | Inference Input | Inference Output | |-------|------------------------------|----------------|-----------------| | gpt-4o-mini | $3.00 | $0.30 | $1.20 | | gpt-4o | $25.00 | $3.75 | $15.00 |
For most use cases, fine-tuning gpt-4o-mini is the best value. A fine-tuned mini model often matches or exceeds a prompted gpt-4o for domain-specific tasks at 10x lower inference cost.
Step-by-Step: Fine-Tuning with Hugging Face and LoRA
For more control, fine-tune open-source models using Hugging Face Transformers and PEFT (Parameter-Efficient Fine-Tuning).
Step 1: Install Dependencies
pip install transformers datasets peft accelerate bitsandbytes trl
Step 2: Load Model and Tokenizer
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
model_name = "meta-llama/Llama-3.1-8B-Instruct"
quantization_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type="nf4"
)
model = AutoModelForCausalLM.from_pretrained(
model_name,
quantization_config=quantization_config,
device_map="auto",
torch_dtype=torch.bfloat16,
)
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token
Step 3: Configure LoRA
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
model = prepare_model_for_kbit_training(model)
lora_config = LoraConfig(
r=16,
lora_alpha=32,
target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj"],
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM"
)
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
Step 4: Prepare Dataset
from datasets import load_dataset
dataset = load_dataset("json", data_files="training_data.jsonl", split="train")
def format_chat(example):
messages = example["messages"]
text = tokenizer.apply_chat_template(messages, tokenize=False)
return {"text": text}
dataset = dataset.map(format_chat)
dataset = dataset.train_test_split(test_size=0.1)
Step 5: Train
from trl import SFTTrainer
from transformers import TrainingArguments
training_args = TrainingArguments(
output_dir="./results",
num_train_epochs=3,
per_device_train_batch_size=4,
gradient_accumulation_steps=4,
learning_rate=2e-4,
weight_decay=0.01,
warmup_ratio=0.03,
logging_steps=10,
save_strategy="epoch",
evaluation_strategy="epoch",
bf16=True,
gradient_checkpointing=True,
optim="paged_adamw_8bit",
)
trainer = SFTTrainer(
model=model,
args=training_args,
train_dataset=dataset["train"],
eval_dataset=dataset["test"],
dataset_text_field="text",
max_seq_length=2048,
packing=True,
)
trainer.train()
Step 6: Save and Merge
trainer.save_model("./lora-adapter")
from peft import AutoPeftModelForCausalLM
merged_model = AutoPeftModelForCausalLM.from_pretrained(
"./lora-adapter",
device_map="auto",
torch_dtype=torch.bfloat16,
)
merged_model = merged_model.merge_and_unload()
merged_model.save_pretrained("./merged-model")
tokenizer.save_pretrained("./merged-model")
Evaluation: How to Know If Fine-Tuning Worked
Evaluation is where most fine-tuning projects fail. Without proper evaluation, you're flying blind.
Quantitative Metrics
| Metric | What It Measures | How to Compute | |--------|-----------------|---------------| | Training loss | Model fit to training data | Logged during training | | Validation loss | Generalization quality | Computed on held-out data | | Task-specific accuracy | Correctness on your task | Custom evaluation script | | ROUGE / BLEU | Text similarity to reference outputs | Standard NLP libraries | | Latency | Response time | Timed inference calls |
Evaluation Best Practices
Always hold out test data. Never evaluate on data the model trained on. Split your data 80/10/10 (train/validation/test) before training begins.
Create a golden evaluation set. Hand-craft 50-100 examples that represent the full range of your use case, including edge cases. Evaluate every model version against this set.
Compare against baselines. Always benchmark against:
- The base model with no prompting
- The base model with your best prompt
- The base model with RAG
- Your fine-tuned model
Human evaluation. For subjective quality (tone, helpfulness, accuracy), have domain experts rate a random sample of outputs. Automated metrics don't capture everything.
import json
def evaluate_model(model, tokenizer, test_file: str):
results = []
with open(test_file) as f:
for line in f:
example = json.loads(line)
messages = example["messages"]
expected = messages[-1]["content"]
input_messages = messages[:-1]
inputs = tokenizer.apply_chat_template(
input_messages,
return_tensors="pt",
add_generation_prompt=True
).to(model.device)
outputs = model.generate(inputs, max_new_tokens=512, temperature=0.1)
generated = tokenizer.decode(outputs[0][inputs.shape[1]:], skip_special_tokens=True)
results.append({
"input": input_messages[-1]["content"],
"expected": expected,
"generated": generated,
"exact_match": generated.strip() == expected.strip()
})
accuracy = sum(r["exact_match"] for r in results) / len(results)
return {"accuracy": accuracy, "total": len(results), "results": results}
Deployment Options
After fine-tuning, you need to serve your model. Here are the main options:
OpenAI (Hosted Fine-Tuned Models)
If you fine-tuned through OpenAI's API, deployment is automatic — use the model ID in your API calls. This is the simplest option but ties you to OpenAI's infrastructure and pricing.
Self-Hosted with Ollama
For open-source models, Ollama makes local deployment straightforward:
# Create a Modelfile
cat << 'EOF' > Modelfile
FROM ./merged-model
PARAMETER temperature 0.7
PARAMETER top_p 0.9
SYSTEM "You are a customer support agent for Acme Corp."
EOF
# Create the model
ollama create acme-support -f Modelfile
# Run it
ollama run acme-support "How do I export my data?"
Self-Hosted with vLLM
For production deployments that need high throughput:
pip install vllm
python -m vllm.entrypoints.openai.api_server \
--model ./merged-model \
--port 8000 \
--max-model-len 4096 \
--gpu-memory-utilization 0.9
vLLM provides an OpenAI-compatible API, so your existing code works with minimal changes.
Cloud Inference Providers
| Provider | Best For | GPU Options | Pricing Model | |----------|---------|-------------|---------------| | AWS SageMaker | Enterprise, existing AWS | Full range | Per-hour | | Google Vertex AI | GCP users, Gemini ecosystem | A100, H100 | Per-hour or per-token | | Azure ML | Enterprise, OpenAI fine-tunes | Full range | Per-hour | | Replicate | Fast deployment, API serving | A40, A100 | Per-second | | Modal | Serverless GPU, burst workloads | A100, H100 | Per-second | | Together AI | Inference optimization, cost | Various | Per-token |
Cost Analysis
Training Costs
| Model Size | Method | GPU Required | Training Time (1K examples) | Approximate Cost | |-----------|--------|-------------|---------------------------|-----------------| | 7-8B | QLoRA | 1× A100 40GB | 1-2 hours | $2-5 | | 7-8B | LoRA | 1× A100 80GB | 1-2 hours | $3-8 | | 13B | QLoRA | 1× A100 40GB | 2-4 hours | $5-12 | | 70B | QLoRA | 1× A100 80GB | 6-12 hours | $15-40 | | 70B | LoRA | 2-4× A100 80GB | 4-8 hours | $30-80 |
Inference Cost Comparison
| Approach | Cost per 1M tokens | Latency | Quality | |----------|-------------------|---------|---------| | GPT-4o (prompted) | $2.50 / $10.00 | ~500ms | Highest | | GPT-4o-mini (prompted) | $0.15 / $0.60 | ~200ms | High | | GPT-4o-mini (fine-tuned) | $0.30 / $1.20 | ~200ms | High (domain) | | Llama 3.1 8B (self-hosted) | ~$0.05 / $0.05 | ~100ms | Good | | Llama 3.1 8B (fine-tuned, self-hosted) | ~$0.05 / $0.05 | ~100ms | Good (domain) |
Use our LLM Cost Calculator to model costs for your specific volume and use case.
Common Mistakes and How to Avoid Them
1. Fine-Tuning When You Should Be Prompting
Many teams jump to fine-tuning before exhausting prompt engineering. A well-crafted prompt with few-shot examples often gets you 80-90% of the way there at zero training cost.
Fix: Always benchmark your best prompt against your fine-tuned model. If the difference isn't significant, stick with prompting.
2. Insufficient or Low-Quality Data
Fine-tuning on 20 examples or on noisy, inconsistent data produces a model that's worse than the base model with good prompting.
Fix: Invest in data quality. 200 perfect examples beat 2,000 mediocre ones. Have domain experts review and correct every training example.
3. Not Holding Out Test Data
Training on all your data means you can't properly evaluate the model. You might think it's performing well when it's just memorized the training set.
Fix: Always split your data. 80% train, 10% validation, 10% test — and never peek at the test set until final evaluation.
4. Overfitting
Training for too many epochs or with too high a learning rate causes the model to memorize training examples instead of learning general patterns.
Fix: Monitor validation loss during training. When it starts increasing while training loss decreases, you're overfitting. Stop training and use the checkpoint with the lowest validation loss.
5. Ignoring the Base Model's Capabilities
Fine-tuning can cause "catastrophic forgetting" — the model loses general capabilities while gaining domain-specific ones.
Fix: Use LoRA (which preserves base model weights) and periodically evaluate on general benchmarks to ensure you haven't degraded core capabilities.
6. Not Planning for Model Updates
Base models get updated regularly. When Llama 4 comes out, your Llama 3 fine-tune becomes outdated. You need a reproducible training pipeline.
Fix: Version your training data, scripts, and hyperparameters. Automate the training pipeline so re-running with a new base model is straightforward.
7. Skipping Human Evaluation
Automated metrics (loss, BLEU, ROUGE) don't tell the full story. A model with low loss can still produce outputs that are technically correct but unhelpful or awkward.
Fix: Budget for human evaluation. Have domain experts rate a random sample of outputs on a rubric (accuracy, helpfulness, tone, completeness).
The Decision Checklist
Before starting a fine-tuning project, answer these questions:
- Have you tried prompt engineering thoroughly? If not, do that first.
- Do you have at least 200 high-quality examples? If not, invest in data before training.
- Is your goal to change behavior or add knowledge? For knowledge, use RAG. For behavior, fine-tune.
- Do you have a clear evaluation plan? If you can't measure improvement, you can't know if fine-tuning worked.
- Do you have budget for iteration? First fine-tuning attempts rarely produce the final model. Budget for 3-5 training runs.
- Do you have a deployment plan? Fine-tuning is useless without serving the model to users.
If you answered yes to all six questions, you're ready to fine-tune. If you need help architecting your fine-tuning pipeline or choosing the right approach for your use case, our AI development team specializes in LLM fine-tuning for production systems.
Need Help Building Your Project?
From web apps and mobile apps to AI solutions and SaaS platforms — we ship production software for 300+ clients.
Related Articles
AI Agent Orchestration: How to Coordinate Agents in Production
AI agent orchestration is how you coordinate multiple agents, tools, and workflows into reliable production systems. This guide covers orchestration patterns, frameworks, state management, error handling, and the protocols (MCP, A2A) that make it work.
10 min readAI Agent Testing and Evaluation: How to Measure Quality Before and After Launch
You cannot ship an AI agent to production without a testing strategy. This guide covers evaluation datasets, accuracy metrics, regression testing, production monitoring, and the tools and frameworks for testing AI agents systematically.
10 min readAI Agents for Accounting & Finance: Bookkeeping, AP/AR, and Reporting
AI agents automate accounting tasks — invoice processing, expense management, reconciliation, and financial reporting — reducing manual work by 60–80% while improving accuracy. This guide covers use cases, ROI, compliance, and implementation.