Self-Hosted LLMs: How to Run Llama, Mistral & Open-Source Models On-Premise
Author
ZTABS Team
Date Published
Running your own large language models on-premise gives you full control over data privacy, latency, costs, and customization. With open-source models like Llama 3.1, Mistral, and Qwen matching or exceeding GPT-3.5-level performance, self-hosting has become a viable production strategy — not just an experiment.
This guide covers everything you need to deploy and run open-source LLMs on your own infrastructure: why to self-host, hardware requirements, deployment frameworks, model selection, fine-tuning, performance optimization, and a detailed cost comparison against API providers.
Why Self-Host an LLM?
Data privacy and compliance
When you use OpenAI, Anthropic, or Google APIs, your data leaves your infrastructure. For companies in healthcare, finance, legal, government, or any industry with strict data handling requirements, this is often a non-starter. Self-hosting keeps all data within your network perimeter.
| Concern | API Provider | Self-Hosted | |---------|-------------|-------------| | Data leaves your network | Yes | No | | Vendor has access to prompts | Potentially (varies by provider) | No | | HIPAA compliance | Requires BAA + careful configuration | Full control | | GDPR data residency | Depends on provider regions | You choose the location | | SOC 2 audit trail | Limited visibility | Full logging | | Air-gapped deployment | Not possible | Fully supported |
Cost at scale
API pricing is usage-based. At low volume, APIs are cheaper. At high volume, self-hosting wins dramatically.
| Daily Requests | GPT-4o-mini (API) Monthly Cost | Self-Hosted Llama 3.1 8B Monthly Cost | |---------------|-------------------------------|--------------------------------------| | 1,000 | ~$45 | $200–$500 (GPU server) | | 10,000 | ~$450 | $200–$500 | | 50,000 | ~$2,250 | $200–$500 | | 100,000 | ~$4,500 | $500–$1,000 | | 500,000 | ~$22,500 | $1,000–$3,000 |
The crossover point where self-hosting becomes cheaper than API calls typically occurs between 5,000 and 20,000 requests per day, depending on the model size and hardware choice.
Latency
Self-hosted models eliminate network round-trips. For real-time applications — code completion, in-app suggestions, interactive chat — this can cut latency in half.
| Metric | API (GPT-4o-mini) | Self-Hosted (Llama 3.1 8B on A100) | |--------|-------------------|--------------------------------------| | Time to first token | 200–500ms | 30–100ms | | Tokens per second | 80–120 | 100–200 | | Total latency (100 tokens) | 1–2s | 0.5–1s | | Availability | 99.9% (OpenAI SLA) | Depends on your infra |
Customization
Self-hosting unlocks capabilities that API providers do not offer:
- Fine-tuning with your proprietary data on your own terms
- Custom tokenizers for domain-specific vocabulary
- Modified inference parameters beyond what APIs expose
- Model merging to combine strengths of multiple models
- Speculative decoding and other advanced inference techniques
Hardware Requirements
GPU comparison for LLM inference
The GPU is the most critical (and expensive) component. LLM inference requires large amounts of GPU memory (VRAM) to hold the model weights.
| GPU | VRAM | FP16 Performance | New Price (est.) | Used/Cloud Hourly | Best For | |-----|------|------------------|-----------------|-------------------|----------| | NVIDIA A100 80GB | 80GB | 312 TFLOPS | $15,000 | $1.50–$3.00/hr | Production workloads, large models | | NVIDIA H100 80GB | 80GB | 989 TFLOPS | $30,000 | $2.50–$5.00/hr | Maximum throughput, fine-tuning | | NVIDIA A10G | 24GB | 125 TFLOPS | $3,500 | $0.50–$1.00/hr | Small to medium models, cost-effective | | NVIDIA L4 | 24GB | 121 TFLOPS | $2,500 | $0.30–$0.80/hr | Inference-optimized, power-efficient | | NVIDIA RTX 4090 | 24GB | 165 TFLOPS | $1,600 | N/A (consumer) | Development, small-scale production | | NVIDIA RTX 3090 | 24GB | 71 TFLOPS | $800 (used) | N/A (consumer) | Budget development | | Apple M2 Ultra | 192GB unified | ~32 TFLOPS | $5,000 (Mac Studio) | N/A | Large models without NVIDIA |
VRAM requirements by model size
| Model Size | FP16 VRAM | INT8 VRAM | INT4 (GPTQ/AWQ) VRAM | Example Models | |-----------|-----------|-----------|----------------------|----------------| | 1–3B | 4–6GB | 2–3GB | 1–2GB | Phi-3 Mini, Qwen2 1.5B | | 7–8B | 14–16GB | 7–8GB | 4–5GB | Llama 3.1 8B, Mistral 7B | | 13–14B | 26–28GB | 13–14GB | 7–8GB | Llama 2 13B, Qwen 14B | | 34–35B | 68–70GB | 34–35GB | 18–20GB | CodeLlama 34B, Yi 34B | | 70–72B | 140GB+ | 70GB+ | 36–40GB | Llama 3.1 70B, Qwen2 72B | | 405B | 810GB+ | 405GB+ | 200GB+ | Llama 3.1 405B |
Recommended configurations
| Use Case | GPU Setup | Budget (Hardware Only) | Can Run | |----------|-----------|----------------------|---------| | Development/prototyping | 1x RTX 4090 (24GB) | $1,600 | 7–8B FP16, 13B INT4 | | Small production | 1x A10G or L4 (24GB) | $2,500–$3,500 | 7–8B FP16, 13B INT4 | | Medium production | 1x A100 80GB | $15,000 | 70B INT4, 34B FP16 | | Large production | 2x A100 80GB | $30,000 | 70B FP16 | | Maximum scale | 4x H100 80GB | $120,000 | 405B INT4 |
Deployment Frameworks
Ollama
Ollama is the simplest way to run LLMs locally. It packages models with their runtime into a single binary, similar to how Docker packages applications.
Best for: Development, prototyping, small-scale production, desktop deployment.
# Install
curl -fsSL https://ollama.com/install.sh | sh
# Pull and run a model
ollama pull llama3.1:8b
ollama run llama3.1:8b
# Use the API
curl http://localhost:11434/api/chat -d '{
"model": "llama3.1:8b",
"messages": [{"role": "user", "content": "Explain Kubernetes in 3 sentences."}],
"stream": true
}'
OpenAI-compatible API: Ollama exposes an API that is compatible with the OpenAI SDK, making migration from API to self-hosted straightforward.
import OpenAI from 'openai';
const client = new OpenAI({
baseURL: 'http://localhost:11434/v1',
apiKey: 'ollama',
});
const response = await client.chat.completions.create({
model: 'llama3.1:8b',
messages: [{ role: 'user', content: 'Hello!' }],
});
| Pros | Cons | |------|------| | Easiest setup (single command) | Lower throughput than vLLM/TGI | | Built-in model management | Limited batching capabilities | | OpenAI-compatible API | Single-GPU only (no tensor parallelism) | | Works on Mac (Apple Silicon), Linux, Windows | Less tuning control | | Supports GGUF quantized models | Not optimized for high-concurrency production |
vLLM
vLLM is a high-throughput inference engine built for production serving. Its key innovation is PagedAttention, which manages GPU memory like an operating system manages RAM — dramatically improving throughput.
Best for: High-throughput production serving, multi-GPU setups, batch processing.
# Install
pip install vllm
# Start the server
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-3.1-8B-Instruct \
--port 8000 \
--tensor-parallel-size 1
For multi-GPU serving of a 70B model:
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-3.1-70B-Instruct \
--port 8000 \
--tensor-parallel-size 2 \
--max-model-len 8192 \
--gpu-memory-utilization 0.90
| Pros | Cons | |------|------| | Highest throughput (2–4x over naive) | More complex setup | | PagedAttention for efficient memory | Linux + NVIDIA only | | Continuous batching | Heavier resource requirements | | Multi-GPU tensor parallelism | Requires HuggingFace model format | | OpenAI-compatible API | Steeper learning curve | | Speculative decoding support | |
Text Generation Inference (TGI)
TGI is Hugging Face's production inference server. It integrates tightly with the Hugging Face ecosystem and supports many optimization techniques out of the box.
Best for: Teams already in the Hugging Face ecosystem, Docker-based deployments.
docker run --gpus all --shm-size 1g -p 8080:80 \
ghcr.io/huggingface/text-generation-inference:latest \
--model-id meta-llama/Llama-3.1-8B-Instruct \
--quantize gptq \
--max-input-length 4096 \
--max-total-tokens 8192
| Pros | Cons | |------|------| | Docker-native deployment | Slightly lower throughput than vLLM | | Built-in quantization (GPTQ, AWQ, EETQ) | Fewer configuration options | | Flash Attention support | HuggingFace-centric | | Token streaming | Less community momentum than vLLM | | Production-ready logging and metrics | |
Framework comparison
| Feature | Ollama | vLLM | TGI | |---------|--------|------|-----| | Setup difficulty | Very easy | Medium | Easy (Docker) | | Throughput (tokens/sec) | Good | Excellent | Very good | | Multi-GPU | No | Yes | Yes | | Quantization support | GGUF | GPTQ, AWQ, FP8 | GPTQ, AWQ, EETQ | | Batching | Basic | Continuous (PagedAttention) | Continuous | | OpenAI-compatible API | Yes | Yes | No (custom API) | | Platform support | Mac, Linux, Windows | Linux (NVIDIA) | Linux (NVIDIA) | | Production readiness | Small scale | High scale | High scale |
Model Selection Guide
Top open-source models in 2026
| Model | Parameters | Context Length | Strengths | License | |-------|-----------|---------------|-----------|---------| | Llama 3.1 8B | 8B | 128K | Best quality at small size, multilingual | Llama 3.1 Community | | Llama 3.1 70B | 70B | 128K | Approaches GPT-4-level, strong reasoning | Llama 3.1 Community | | Llama 3.1 405B | 405B | 128K | Best open-source model overall | Llama 3.1 Community | | Mistral 7B | 7B | 32K | Efficient, strong for its size | Apache 2.0 | | Mixtral 8x22B | 141B (active: 39B) | 64K | MoE architecture, excellent reasoning | Apache 2.0 | | Qwen2.5 72B | 72B | 128K | Strong multilingual, competitive with Llama 70B | Apache 2.0 | | Phi-3 Medium | 14B | 128K | Microsoft, strong reasoning for size | MIT | | DeepSeek V3 | 671B (active: 37B) | 128K | MoE, strong coding and math | DeepSeek | | CodeLlama 70B | 70B | 100K | Specialized for code generation | Llama 2 Community |
Model selection by use case
| Use Case | Recommended Model | Why | |----------|------------------|-----| | General assistant (budget) | Llama 3.1 8B | Best quality-per-VRAM at small size | | General assistant (quality) | Llama 3.1 70B | Approaches GPT-4 quality | | Code generation | DeepSeek Coder or CodeLlama | Specialized training data | | Multilingual | Qwen2.5 72B | Strong across many languages | | Function calling | Llama 3.1 70B or Mistral Large | Trained for tool use | | Long documents | Llama 3.1 (128K context) | Native long context support | | Edge/mobile deployment | Phi-3 Mini (3.8B) | Small footprint, strong quality | | Maximum quality | Llama 3.1 405B | Best open-source overall |
Deployment Step-by-Step
Option A: Ollama (simplest path)
Step 1: Provision a server with a GPU.
# AWS example: g5.xlarge (1x A10G, 24GB VRAM, $1.01/hr on-demand)
# Or any machine with an NVIDIA GPU and 24GB+ VRAM
Step 2: Install Ollama.
curl -fsSL https://ollama.com/install.sh | sh
Step 3: Pull your model.
ollama pull llama3.1:8b
Step 4: Create a custom Modelfile (optional, for system prompts and parameters).
FROM llama3.1:8b
SYSTEM "You are a helpful customer support agent for Acme Corp. Answer questions about our products using the provided context. If you don't know, say so."
PARAMETER temperature 0.3
PARAMETER top_p 0.9
PARAMETER num_ctx 8192
ollama create acme-support -f Modelfile
ollama run acme-support
Step 5: Set up as a systemd service for production.
[Unit]
Description=Ollama LLM Service
After=network.target
[Service]
Type=simple
ExecStart=/usr/local/bin/ollama serve
Restart=always
Environment="OLLAMA_HOST=0.0.0.0"
Environment="OLLAMA_NUM_PARALLEL=4"
[Install]
WantedBy=multi-user.target
Option B: vLLM (maximum throughput)
Step 1: Provision a GPU server (A100 or H100 recommended).
Step 2: Install vLLM.
pip install vllm
Step 3: Download the model from Hugging Face.
huggingface-cli login
huggingface-cli download meta-llama/Llama-3.1-8B-Instruct
Step 4: Launch the server.
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-3.1-8B-Instruct \
--host 0.0.0.0 \
--port 8000 \
--max-model-len 8192 \
--gpu-memory-utilization 0.90
Step 5: Put behind a reverse proxy (nginx or Caddy) with TLS and authentication.
Fine-Tuning On-Premise
Fine-tuning adapts a pre-trained model to your specific domain, terminology, or output format. With self-hosted models, you can fine-tune entirely on your own infrastructure.
When to fine-tune
| Scenario | Fine-Tune? | |----------|-----------| | Domain-specific terminology (medical, legal, financial) | Yes | | Consistent output format that prompting cannot achieve | Yes | | Company-specific tone or style | Yes | | Need to reduce prompt length (lower inference cost) | Yes | | General knowledge questions | No (use RAG instead) | | Data changes frequently | No (RAG is more flexible) | | Limited training data (fewer than 50 examples) | No (improve prompts first) |
Fine-tuning approaches
| Method | VRAM Required | Training Data | Training Time | Quality | |--------|-------------|---------------|---------------|---------| | LoRA (Low-Rank Adaptation) | 1.5x model VRAM | 100–10,000 examples | Hours | Good | | QLoRA | ~model VRAM | 100–10,000 examples | Hours | Good (slightly less than LoRA) | | Full fine-tuning | 3–4x model VRAM | 1,000–100,000 examples | Days | Best |
QLoRA example with Unsloth
from unsloth import FastLanguageModel
model, tokenizer = FastLanguageModel.from_pretrained(
model_name="unsloth/llama-3.1-8b-instruct-bnb-4bit",
max_seq_length=4096,
load_in_4bit=True,
)
model = FastLanguageModel.get_peft_model(
model,
r=16,
target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj"],
lora_alpha=16,
lora_dropout=0,
use_gradient_checkpointing="unsloth",
)
from trl import SFTTrainer
from transformers import TrainingArguments
trainer = SFTTrainer(
model=model,
tokenizer=tokenizer,
train_dataset=dataset,
args=TrainingArguments(
per_device_train_batch_size=4,
gradient_accumulation_steps=4,
warmup_steps=10,
max_steps=200,
learning_rate=2e-4,
fp16=True,
output_dir="outputs",
),
)
trainer.train()
model.save_pretrained_merged("my-fine-tuned-model", tokenizer)
For production fine-tuning projects, explore our LLM fine-tuning services.
Performance Optimization
1. Quantization
Quantization reduces model precision from FP16 (16-bit) to INT8 or INT4, cutting memory usage by 2–4x with modest quality loss.
| Quantization | Memory Reduction | Quality Impact | Speed Impact | |-------------|-----------------|----------------|-------------| | FP16 (baseline) | 1x | None | Baseline | | INT8 (bitsandbytes) | 2x | Minimal (under 1% loss) | ~Same speed | | INT4 GPTQ | 4x | Small (1–3% loss) | Faster (less memory bandwidth) | | INT4 AWQ | 4x | Small (1–2% loss) | Faster | | GGUF Q4_K_M | ~4x | Small (1–3% loss) | Good (CPU + GPU) |
2. KV cache optimization
The key-value cache stores attention states for previously processed tokens. For long contexts, this can consume as much VRAM as the model itself. Use PagedAttention (vLLM) or set num_ctx conservatively to manage this.
3. Batching
Serving multiple requests simultaneously is far more efficient than processing them sequentially. vLLM's continuous batching can increase throughput by 2–5x compared to naive sequential inference.
4. Speculative decoding
Use a smaller "draft" model to predict multiple tokens, then verify with the main model in a single forward pass. This can increase generation speed by 2–3x for large models.
Cost Comparison: Self-Hosted vs API
Monthly cost comparison (10,000 requests/day, ~3,000 tokens each)
| Approach | Model | Monthly Cost | Latency (p50) | |----------|-------|-------------|---------------| | OpenAI API | GPT-4o | ~$22,500 | 1–2s | | OpenAI API | GPT-4o-mini | ~$1,350 | 0.5–1s | | Anthropic API | Claude 3.5 Sonnet | ~$27,000 | 1–2s | | Self-hosted (cloud GPU) | Llama 3.1 8B (1x A10G) | ~$750 | 0.3–0.8s | | Self-hosted (cloud GPU) | Llama 3.1 70B (1x A100) | ~$2,200 | 0.5–1.5s | | Self-hosted (on-prem) | Llama 3.1 8B (1x RTX 4090) | ~$100 (electricity) | 0.3–0.8s | | Self-hosted (on-prem) | Llama 3.1 70B (2x A100) | ~$200 (electricity) | 0.5–1.5s |
On-premise hardware has a high upfront cost but extremely low operating cost. Cloud GPU instances offer a middle ground. API providers are the most expensive per-request but require zero infrastructure management.
Break-even analysis
| Setup | Upfront Cost | Monthly Operating Cost | Break-Even vs GPT-4o-mini API (at 10K req/day) | |-------|-------------|----------------------|------------------------------------------------| | 1x RTX 4090 (on-prem) | $3,000 | ~$100 | ~2.4 months | | 1x A10G (cloud) | $0 | ~$750 | Immediately cheaper | | 1x A100 (cloud, for 70B) | $0 | ~$2,200 | Immediately cheaper vs GPT-4o | | 2x A100 (on-prem) | $30,000 | ~$200 | ~1.5 months vs GPT-4o |
Use our LLM Cost Calculator to model the cost comparison for your specific usage patterns.
When NOT to Self-Host
Self-hosting is not always the right choice:
- Low volume (under 1,000 requests/day) — API costs are minimal; infrastructure overhead is not worth it
- No GPU expertise — Managing GPU servers requires specific knowledge
- Need GPT-4o quality — Open-source models are good but still behind GPT-4o on complex reasoning
- Rapid experimentation — APIs let you switch models instantly; self-hosting requires redeployment
- Compliance with cloud-first mandates — Some organizations require managed services for auditability
Next Steps
Self-hosting LLMs is a spectrum. You can start with Ollama on a single GPU for development, graduate to vLLM on a cloud GPU for production, and eventually move to on-premise hardware for maximum cost efficiency and control.
For help with self-hosted LLM deployment, fine-tuning, or building applications on top of open-source models, explore our AI development services and LLM fine-tuning services. To compare self-hosted costs against API providers for your specific use case, try our LLM Cost Calculator.
Need Help Building Your Project?
From web apps and mobile apps to AI solutions and SaaS platforms — we ship production software for 300+ clients.
Related Articles
AI Agent Orchestration: How to Coordinate Agents in Production
AI agent orchestration is how you coordinate multiple agents, tools, and workflows into reliable production systems. This guide covers orchestration patterns, frameworks, state management, error handling, and the protocols (MCP, A2A) that make it work.
10 min readAI Agent Testing and Evaluation: How to Measure Quality Before and After Launch
You cannot ship an AI agent to production without a testing strategy. This guide covers evaluation datasets, accuracy metrics, regression testing, production monitoring, and the tools and frameworks for testing AI agents systematically.
10 min readAI Agents for Accounting & Finance: Bookkeeping, AP/AR, and Reporting
AI agents automate accounting tasks — invoice processing, expense management, reconciliation, and financial reporting — reducing manual work by 60–80% while improving accuracy. This guide covers use cases, ROI, compliance, and implementation.