Self-Hosted LLM Guide: Run Llama & Mistral On-Prem

Running your own large language models on-premise gives you full control over data privacy, latency, costs, and customization. With open-source models like Llama 3.1^[4], Mistral, and Qwen matching or exceeding GPT-3.5-level performance, self-hosting has become a viable production strategy — not just an experiment.

This guide covers everything you need to deploy and run open-source LLMs on your own infrastructure: why to self-host, hardware requirements, deployment frameworks, model selection, fine-tuning, performance optimization, and a detailed cost comparison against API providers.

Why Self-Host an LLM?

Data privacy and compliance

When you use OpenAI, Anthropic, or Google APIs, your data leaves your infrastructure. For companies in healthcare, finance, legal, government, or any industry with strict data handling requirements, this is often a non-starter. Self-hosting keeps all data within your network perimeter.

Concern	API Provider	Self-Hosted
Data leaves your network	Yes	No
Vendor has access to prompts	Potentially (varies by provider)	No
HIPAA compliance	Requires BAA + careful configuration	Full control
GDPR data residency	Depends on provider regions	You choose the location
SOC 2 audit trail	Limited visibility	Full logging
Air-gapped deployment	Not possible	Fully supported

Cost at scale

API pricing is usage-based. At low volume, APIs are cheaper. At high volume, self-hosting wins dramatically.

Daily Requests	GPT-4o-mini (API) Monthly Cost	Self-Hosted Llama 3.1 8B Monthly Cost
1,000	~$45	$200–$500 (GPU server)
10,000	~$450	$200–$500
50,000	~$2,250	$200–$500
100,000	~$4,500	$500–$1,000
500,000	~$22,500	$1,000–$3,000

The crossover point where self-hosting becomes cheaper than API calls typically occurs between 5,000 and 20,000 requests per day, depending on the model size and hardware choice.

Latency

Self-hosted models eliminate network round-trips. For real-time applications — code completion, in-app suggestions, interactive chat — this can cut latency in half.

Metric	API (GPT-4o-mini)	Self-Hosted (Llama 3.1 8B on A100)
Time to first token	200–500ms	30–100ms
Tokens per second	80–120	100–200
Total latency (100 tokens)	1–2s	0.5–1s
Availability	99.9% (OpenAI SLA)	Depends on your infra

Customization

Self-hosting unlocks capabilities that API providers do not offer:

Fine-tuning with your proprietary data on your own terms
Custom tokenizers for domain-specific vocabulary
Modified inference parameters beyond what APIs expose
Model merging to combine strengths of multiple models
Speculative decoding and other advanced inference techniques

Hardware Requirements

GPU comparison for LLM inference

The GPU is the most critical (and expensive) component. LLM inference requires large amounts of GPU memory (VRAM) to hold the model weights. FP16 TFLOPS figures below are NVIDIA's published datasheet numbers^[1]^[2]; cloud hourly rates are AWS p4d / p5 instance pricing^[3].

GPU	VRAM	FP16 Performance	New Price (est.)	Used/Cloud Hourly	Best For
NVIDIA A100 80GB	80GB	312 TFLOPS	$15,000	$1.50–$3.00/hr	Production workloads, large models
NVIDIA H100 80GB	80GB	989 TFLOPS	$30,000	$2.50–$5.00/hr	Maximum throughput, fine-tuning
NVIDIA A10G	24GB	125 TFLOPS	$3,500	$0.50–$1.00/hr	Small to medium models, cost-effective
NVIDIA L4	24GB	121 TFLOPS	$2,500	$0.30–$0.80/hr	Inference-optimized, power-efficient
NVIDIA RTX 4090	24GB	165 TFLOPS	$1,600	N/A (consumer)	Development, small-scale production
NVIDIA RTX 3090	24GB	71 TFLOPS	$800 (used)	N/A (consumer)	Budget development
Apple M2 Ultra	192GB unified	~32 TFLOPS	$5,000 (Mac Studio)	N/A	Large models without NVIDIA

VRAM requirements by model size

Model Size	FP16 VRAM	INT8 VRAM	INT4 (GPTQ/AWQ) VRAM	Example Models
1–3B	4–6GB	2–3GB	1–2GB	Phi-3 Mini, Qwen2 1.5B
7–8B	14–16GB	7–8GB	4–5GB	Llama 3.1 8B, Mistral 7B
13–14B	26–28GB	13–14GB	7–8GB	Llama 2 13B, Qwen 14B
34–35B	68–70GB	34–35GB	18–20GB	CodeLlama 34B, Yi 34B
70–72B	140GB+	70GB+	36–40GB	Llama 3.1 70B, Qwen2 72B
405B	810GB+	405GB+	200GB+	Llama 3.1 405B

Recommended configurations

Use Case	GPU Setup	Budget (Hardware Only)	Can Run
Development/prototyping	1x RTX 4090 (24GB)	$1,600	7–8B FP16, 13B INT4
Small production	1x A10G or L4 (24GB)	$2,500–$3,500	7–8B FP16, 13B INT4
Medium production	1x A100 80GB	$15,000	70B INT4, 34B FP16
Large production	2x A100 80GB	$30,000	70B FP16
Maximum scale	4x H100 80GB	$120,000	405B INT4

Deployment Frameworks

Ollama

Ollama is the simplest way to run LLMs locally. It packages models with their runtime into a single binary, similar to how Docker packages applications.

Best for: Development, prototyping, small-scale production, desktop deployment.

# Install
curl -fsSL https://ollama.com/install.sh | sh

# Pull and run a model
ollama pull llama3.1:8b
ollama run llama3.1:8b

# Use the API
curl http://localhost:11434/api/chat -d '{
  "model": "llama3.1:8b",
  "messages": [{"role": "user", "content": "Explain Kubernetes in 3 sentences."}],
  "stream": true
}'

OpenAI-compatible API: Ollama exposes an API that is compatible with the OpenAI SDK, making migration from API to self-hosted straightforward.

import OpenAI from 'openai';

const client = new OpenAI({
  baseURL: 'http://localhost:11434/v1',
  apiKey: 'ollama',
});

const response = await client.chat.completions.create({
  model: 'llama3.1:8b',
  messages: [{ role: 'user', content: 'Hello!' }],
});

Pros	Cons
Easiest setup (single command)	Lower throughput than vLLM/TGI
Built-in model management	Limited batching capabilities
OpenAI-compatible API	Single-GPU only (no tensor parallelism)
Works on Mac (Apple Silicon), Linux, Windows	Less tuning control
Supports GGUF quantized models	Not optimized for high-concurrency production

vLLM

vLLM is a high-throughput inference engine built for production serving. Its key innovation is PagedAttention, which manages GPU memory like an operating system manages RAM — dramatically improving throughput.

Best for: High-throughput production serving, multi-GPU setups, batch processing.

# Install
pip install vllm

# Start the server
python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --port 8000 \
  --tensor-parallel-size 1

For multi-GPU serving of a 70B model:

python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3.1-70B-Instruct \
  --port 8000 \
  --tensor-parallel-size 2 \
  --max-model-len 8192 \
  --gpu-memory-utilization 0.90

Pros	Cons
Highest throughput (2–4x over naive)	More complex setup
PagedAttention for efficient memory	Linux + NVIDIA only
Continuous batching	Heavier resource requirements
Multi-GPU tensor parallelism	Requires HuggingFace model format
OpenAI-compatible API	Steeper learning curve
Speculative decoding support

Text Generation Inference (TGI)

TGI is Hugging Face's production inference server. It integrates tightly with the Hugging Face ecosystem and supports many optimization techniques out of the box.

Best for: Teams already in the Hugging Face ecosystem, Docker-based deployments.

docker run --gpus all --shm-size 1g -p 8080:80 \
  ghcr.io/huggingface/text-generation-inference:latest \
  --model-id meta-llama/Llama-3.1-8B-Instruct \
  --quantize gptq \
  --max-input-length 4096 \
  --max-total-tokens 8192

Pros	Cons
Docker-native deployment	Slightly lower throughput than vLLM
Built-in quantization (GPTQ, AWQ, EETQ)	Fewer configuration options
Flash Attention support	HuggingFace-centric
Token streaming	Less community momentum than vLLM
Production-ready logging and metrics

Framework comparison

Feature	Ollama	vLLM	TGI
Setup difficulty	Very easy	Medium	Easy (Docker)
Throughput (tokens/sec)	Good	Excellent	Very good
Multi-GPU	No	Yes	Yes
Quantization support	GGUF	GPTQ, AWQ, FP8	GPTQ, AWQ, EETQ
Batching	Basic	Continuous (PagedAttention)	Continuous
OpenAI-compatible API	Yes	Yes	No (custom API)
Platform support	Mac, Linux, Windows	Linux (NVIDIA)	Linux (NVIDIA)
Production readiness	Small scale	High scale	High scale

Model Selection Guide

Top open-source models in 2026

Model	Parameters	Context Length	Strengths	License
Llama 3.1 8B	8B	128K	Best quality at small size, multilingual	Llama 3.1 Community
Llama 3.1 70B	70B	128K	Approaches GPT-4-level, strong reasoning	Llama 3.1 Community
Llama 3.1 405B	405B	128K	Best open-source model overall	Llama 3.1 Community
Mistral 7B	7B	32K	Efficient, strong for its size	Apache 2.0
Mixtral 8x22B	141B (active: 39B)	64K	MoE architecture, excellent reasoning	Apache 2.0
Qwen2.5 72B	72B	128K	Strong multilingual, competitive with Llama 70B	Apache 2.0
Phi-3 Medium	14B	128K	Microsoft, strong reasoning for size	MIT
DeepSeek V3	671B (active: 37B)	128K	MoE, strong coding and math	DeepSeek
CodeLlama 70B	70B	100K	Specialized for code generation	Llama 2 Community

Model selection by use case

Use Case	Recommended Model	Why
General assistant (budget)	Llama 3.1 8B	Best quality-per-VRAM at small size
General assistant (quality)	Llama 3.1 70B	Approaches GPT-4 quality
Code generation	DeepSeek Coder or CodeLlama	Specialized training data
Multilingual	Qwen2.5 72B	Strong across many languages
Function calling	Llama 3.1 70B or Mistral Large	Trained for tool use
Long documents	Llama 3.1 (128K context)	Native long context support
Edge/mobile deployment	Phi-3 Mini (3.8B)	Small footprint, strong quality
Maximum quality	Llama 3.1 405B	Best open-source overall

Deployment Step-by-Step

Option A: Ollama (simplest path)

Step 1: Provision a server with a GPU.

# AWS example: g5.xlarge (1x A10G, 24GB VRAM, $1.01/hr on-demand)
# Or any machine with an NVIDIA GPU and 24GB+ VRAM

Step 2: Install Ollama.

curl -fsSL https://ollama.com/install.sh | sh

Step 3: Pull your model.

ollama pull llama3.1:8b

Step 4: Create a custom Modelfile (optional, for system prompts and parameters).

FROM llama3.1:8b

SYSTEM "You are a helpful customer support agent for Acme Corp. Answer questions about our products using the provided context. If you don't know, say so."

PARAMETER temperature 0.3
PARAMETER top_p 0.9
PARAMETER num_ctx 8192

ollama create acme-support -f Modelfile
ollama run acme-support

Step 5: Set up as a systemd service for production.

[Unit]
Description=Ollama LLM Service
After=network.target

[Service]
Type=simple
ExecStart=/usr/local/bin/ollama serve
Restart=always
Environment="OLLAMA_HOST=0.0.0.0"
Environment="OLLAMA_NUM_PARALLEL=4"

[Install]
WantedBy=multi-user.target

Option B: vLLM (maximum throughput)

Step 1: Provision a GPU server (A100 or H100 recommended).

Step 2: Install vLLM.

pip install vllm

Step 3: Download the model from Hugging Face.

huggingface-cli login
huggingface-cli download meta-llama/Llama-3.1-8B-Instruct

Step 4: Launch the server.

python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --host 0.0.0.0 \
  --port 8000 \
  --max-model-len 8192 \
  --gpu-memory-utilization 0.90

Step 5: Put behind a reverse proxy (nginx or Caddy) with TLS and authentication.

Fine-Tuning On-Premise

Fine-tuning adapts a pre-trained model to your specific domain, terminology, or output format. With self-hosted models, you can fine-tune entirely on your own infrastructure.

When to fine-tune

Scenario	Fine-Tune?
Domain-specific terminology (medical, legal, financial)	Yes
Consistent output format that prompting cannot achieve	Yes
Company-specific tone or style	Yes
Need to reduce prompt length (lower inference cost)	Yes
General knowledge questions	No (use RAG instead)
Data changes frequently	No (RAG is more flexible)
Limited training data (fewer than 50 examples)	No (improve prompts first)

Fine-tuning approaches

Method	VRAM Required	Training Data	Training Time	Quality
LoRA (Low-Rank Adaptation)	1.5x model VRAM	100–10,000 examples	Hours	Good
QLoRA	~model VRAM	100–10,000 examples	Hours	Good (slightly less than LoRA)
Full fine-tuning	3–4x model VRAM	1,000–100,000 examples	Days	Best

QLoRA example with Unsloth

from unsloth import FastLanguageModel

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/llama-3.1-8b-instruct-bnb-4bit",
    max_seq_length=4096,
    load_in_4bit=True,
)

model = FastLanguageModel.get_peft_model(
    model,
    r=16,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                     "gate_proj", "up_proj", "down_proj"],
    lora_alpha=16,
    lora_dropout=0,
    use_gradient_checkpointing="unsloth",
)

from trl import SFTTrainer
from transformers import TrainingArguments

trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=dataset,
    args=TrainingArguments(
        per_device_train_batch_size=4,
        gradient_accumulation_steps=4,
        warmup_steps=10,
        max_steps=200,
        learning_rate=2e-4,
        fp16=True,
        output_dir="outputs",
    ),
)

trainer.train()
model.save_pretrained_merged("my-fine-tuned-model", tokenizer)

For production fine-tuning projects, explore our LLM fine-tuning services.

Performance Optimization

1. Quantization

Quantization reduces model precision from FP16 (16-bit) to INT8 or INT4, cutting memory usage by 2–4x with modest quality loss.

Quantization	Memory Reduction	Quality Impact	Speed Impact
FP16 (baseline)	1x	None	Baseline
INT8 (bitsandbytes)	2x	Minimal (under 1% loss)	~Same speed
INT4 GPTQ	4x	Small (1–3% loss)	Faster (less memory bandwidth)
INT4 AWQ	4x	Small (1–2% loss)	Faster
GGUF Q4_K_M	~4x	Small (1–3% loss)	Good (CPU + GPU)

2. KV cache optimization

The key-value cache stores attention states for previously processed tokens. For long contexts, this can consume as much VRAM as the model itself. Use PagedAttention (vLLM) or set num_ctx conservatively to manage this.

3. Batching

Serving multiple requests simultaneously is far more efficient than processing them sequentially. vLLM's continuous batching can increase throughput by 2–5x compared to naive sequential inference.

4. Speculative decoding

Use a smaller "draft" model to predict multiple tokens, then verify with the main model in a single forward pass. This can increase generation speed by 2–3x for large models.

Cost Comparison: Self-Hosted vs API

Monthly cost comparison (10,000 requests/day, ~3,000 tokens each)

Approach	Model	Monthly Cost	Latency (p50)
OpenAI API	GPT-4o	~$22,500	1–2s
OpenAI API	GPT-4o-mini	~$1,350	0.5–1s
Anthropic API	Claude 3.5 Sonnet	~$27,000	1–2s
Self-hosted (cloud GPU)	Llama 3.1 8B (1x A10G)	~$750	0.3–0.8s
Self-hosted (cloud GPU)	Llama 3.1 70B (1x A100)	~$2,200	0.5–1.5s
Self-hosted (on-prem)	Llama 3.1 8B (1x RTX 4090)	~$100 (electricity)	0.3–0.8s
Self-hosted (on-prem)	Llama 3.1 70B (2x A100)	~$200 (electricity)	0.5–1.5s

On-premise hardware has a high upfront cost but extremely low operating cost. Cloud GPU instances offer a middle ground. API providers are the most expensive per-request but require zero infrastructure management.

Break-even analysis

Setup	Upfront Cost	Monthly Operating Cost	Break-Even vs GPT-4o-mini API (at 10K req/day)
1x RTX 4090 (on-prem)	$3,000	~$100	~2.4 months
1x A10G (cloud)	$0	~$750	Immediately cheaper
1x A100 (cloud, for 70B)	$0	~$2,200	Immediately cheaper vs GPT-4o
2x A100 (on-prem)	$30,000	~$200	~1.5 months vs GPT-4o

Use our LLM Cost Calculator to model the cost comparison for your specific usage patterns.

When NOT to Self-Host

Self-hosting is not always the right choice:

Low volume (under 1,000 requests/day) — API costs are minimal; infrastructure overhead is not worth it
No GPU expertise — Managing GPU servers requires specific knowledge
Need GPT-4o quality — Open-source models are good but still behind GPT-4o on complex reasoning
Rapid experimentation — APIs let you switch models instantly; self-hosting requires redeployment
Compliance with cloud-first mandates — Some organizations require managed services for auditability

Next Steps

Self-hosting LLMs is a spectrum. You can start with Ollama on a single GPU for development, graduate to vLLM on a cloud GPU for production, and eventually move to on-premise hardware for maximum cost efficiency and control.

For help with self-hosted LLM deployment, fine-tuning, or building applications on top of open-source models, explore our AI development services and LLM fine-tuning services. To compare self-hosted costs against API providers for your specific use case, try our LLM Cost Calculator.

Frequently Asked Questions

When does self-hosting an LLM actually cost less than using an API?

Self-hosting beats API costs past roughly 100M tokens/month of consistent throughput on a 70B-class model. Below that, a single H100 at $2-4/hour plus ops overhead ($5-10K/month loaded) almost always costs more than paying per-token to Claude or GPT-4o. Run the math before buying hardware.

What GPU do I need to run a 70B model in production?

A single NVIDIA H100 80GB runs Llama 3 70B at 4-bit quantization comfortably at 20-40 tokens/sec for one concurrent request. For multi-user serving, you need 2x H100s or an H200, plus vLLM or TGI for batching. Consumer GPUs (4090, 5090) work for prototypes but can't serve production traffic.

How much slower is self-hosted inference vs frontier APIs?

Frontier APIs (GPT-4o, Claude) typically stream 70-150 tokens/sec for a single user. A well-tuned self-hosted 70B model on an H100 hits 40-80 tokens/sec. Quality is the bigger gap — open models trail GPT-4-class models by roughly 10-20% on hard reasoning benchmarks, which matters for complex agentic workflows.

What's the operational burden I should expect?

Plan for 0.3-0.5 of a dedicated infra engineer per self-hosted LLM. You'll handle GPU failures, driver updates, model upgrades, load balancing, observability, and eval regressions. Teams that underestimate ops burden typically abandon self-hosting within 12 months and go back to APIs.

Why Self-Host an LLM?

Data privacy and compliance

Concern	API Provider	Self-Hosted
Data leaves your network	Yes	No
Vendor has access to prompts	Potentially (varies by provider)	No
HIPAA compliance	Requires BAA + careful configuration	Full control
GDPR data residency	Depends on provider regions	You choose the location
SOC 2 audit trail	Limited visibility	Full logging
Air-gapped deployment	Not possible	Fully supported

Cost at scale

API pricing is usage-based. At low volume, APIs are cheaper. At high volume, self-hosting wins dramatically.

Daily Requests	GPT-4o-mini (API) Monthly Cost	Self-Hosted Llama 3.1 8B Monthly Cost
1,000	~$45	$200–$500 (GPU server)
10,000	~$450	$200–$500
50,000	~$2,250	$200–$500
100,000	~$4,500	$500–$1,000
500,000	~$22,500	$1,000–$3,000

The crossover point where self-hosting becomes cheaper than API calls typically occurs between 5,000 and 20,000 requests per day, depending on the model size and hardware choice.

Latency

Self-hosted models eliminate network round-trips. For real-time applications — code completion, in-app suggestions, interactive chat — this can cut latency in half.

Metric	API (GPT-4o-mini)	Self-Hosted (Llama 3.1 8B on A100)
Time to first token	200–500ms	30–100ms
Tokens per second	80–120	100–200
Total latency (100 tokens)	1–2s	0.5–1s
Availability	99.9% (OpenAI SLA)	Depends on your infra

Customization

Self-hosting unlocks capabilities that API providers do not offer:

Fine-tuning with your proprietary data on your own terms
Custom tokenizers for domain-specific vocabulary
Modified inference parameters beyond what APIs expose
Model merging to combine strengths of multiple models
Speculative decoding and other advanced inference techniques

Hardware Requirements

GPU comparison for LLM inference

GPU	VRAM	FP16 Performance	New Price (est.)	Used/Cloud Hourly	Best For
NVIDIA A100 80GB	80GB	312 TFLOPS	$15,000	$1.50–$3.00/hr	Production workloads, large models
NVIDIA H100 80GB	80GB	989 TFLOPS	$30,000	$2.50–$5.00/hr	Maximum throughput, fine-tuning
NVIDIA A10G	24GB	125 TFLOPS	$3,500	$0.50–$1.00/hr	Small to medium models, cost-effective
NVIDIA L4	24GB	121 TFLOPS	$2,500	$0.30–$0.80/hr	Inference-optimized, power-efficient
NVIDIA RTX 4090	24GB	165 TFLOPS	$1,600	N/A (consumer)	Development, small-scale production
NVIDIA RTX 3090	24GB	71 TFLOPS	$800 (used)	N/A (consumer)	Budget development
Apple M2 Ultra	192GB unified	~32 TFLOPS	$5,000 (Mac Studio)	N/A	Large models without NVIDIA

VRAM requirements by model size

Model Size	FP16 VRAM	INT8 VRAM	INT4 (GPTQ/AWQ) VRAM	Example Models
1–3B	4–6GB	2–3GB	1–2GB	Phi-3 Mini, Qwen2 1.5B
7–8B	14–16GB	7–8GB	4–5GB	Llama 3.1 8B, Mistral 7B
13–14B	26–28GB	13–14GB	7–8GB	Llama 2 13B, Qwen 14B
34–35B	68–70GB	34–35GB	18–20GB	CodeLlama 34B, Yi 34B
70–72B	140GB+	70GB+	36–40GB	Llama 3.1 70B, Qwen2 72B
405B	810GB+	405GB+	200GB+	Llama 3.1 405B

Recommended configurations

Use Case	GPU Setup	Budget (Hardware Only)	Can Run
Development/prototyping	1x RTX 4090 (24GB)	$1,600	7–8B FP16, 13B INT4
Small production	1x A10G or L4 (24GB)	$2,500–$3,500	7–8B FP16, 13B INT4
Medium production	1x A100 80GB	$15,000	70B INT4, 34B FP16
Large production	2x A100 80GB	$30,000	70B FP16
Maximum scale	4x H100 80GB	$120,000	405B INT4

Deployment Frameworks

Ollama

Ollama is the simplest way to run LLMs locally. It packages models with their runtime into a single binary, similar to how Docker packages applications.

Best for: Development, prototyping, small-scale production, desktop deployment.

# Install
curl -fsSL https://ollama.com/install.sh | sh

# Pull and run a model
ollama pull llama3.1:8b
ollama run llama3.1:8b

# Use the API
curl http://localhost:11434/api/chat -d '{
  "model": "llama3.1:8b",
  "messages": [{"role": "user", "content": "Explain Kubernetes in 3 sentences."}],
  "stream": true
}'

OpenAI-compatible API: Ollama exposes an API that is compatible with the OpenAI SDK, making migration from API to self-hosted straightforward.

import OpenAI from 'openai';

const client = new OpenAI({
  baseURL: 'http://localhost:11434/v1',
  apiKey: 'ollama',
});

const response = await client.chat.completions.create({
  model: 'llama3.1:8b',
  messages: [{ role: 'user', content: 'Hello!' }],
});

Pros	Cons
Easiest setup (single command)	Lower throughput than vLLM/TGI
Built-in model management	Limited batching capabilities
OpenAI-compatible API	Single-GPU only (no tensor parallelism)
Works on Mac (Apple Silicon), Linux, Windows	Less tuning control
Supports GGUF quantized models	Not optimized for high-concurrency production

vLLM

Best for: High-throughput production serving, multi-GPU setups, batch processing.

# Install
pip install vllm

# Start the server
python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --port 8000 \
  --tensor-parallel-size 1

For multi-GPU serving of a 70B model:

python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3.1-70B-Instruct \
  --port 8000 \
  --tensor-parallel-size 2 \
  --max-model-len 8192 \
  --gpu-memory-utilization 0.90

Pros	Cons
Highest throughput (2–4x over naive)	More complex setup
PagedAttention for efficient memory	Linux + NVIDIA only
Continuous batching	Heavier resource requirements
Multi-GPU tensor parallelism	Requires HuggingFace model format
OpenAI-compatible API	Steeper learning curve
Speculative decoding support

Text Generation Inference (TGI)

TGI is Hugging Face's production inference server. It integrates tightly with the Hugging Face ecosystem and supports many optimization techniques out of the box.

Best for: Teams already in the Hugging Face ecosystem, Docker-based deployments.

docker run --gpus all --shm-size 1g -p 8080:80 \
  ghcr.io/huggingface/text-generation-inference:latest \
  --model-id meta-llama/Llama-3.1-8B-Instruct \
  --quantize gptq \
  --max-input-length 4096 \
  --max-total-tokens 8192

Pros	Cons
Docker-native deployment	Slightly lower throughput than vLLM
Built-in quantization (GPTQ, AWQ, EETQ)	Fewer configuration options
Flash Attention support	HuggingFace-centric
Token streaming	Less community momentum than vLLM
Production-ready logging and metrics

Framework comparison

Feature	Ollama	vLLM	TGI
Setup difficulty	Very easy	Medium	Easy (Docker)
Throughput (tokens/sec)	Good	Excellent	Very good
Multi-GPU	No	Yes	Yes
Quantization support	GGUF	GPTQ, AWQ, FP8	GPTQ, AWQ, EETQ
Batching	Basic	Continuous (PagedAttention)	Continuous
OpenAI-compatible API	Yes	Yes	No (custom API)
Platform support	Mac, Linux, Windows	Linux (NVIDIA)	Linux (NVIDIA)
Production readiness	Small scale	High scale	High scale

Model Selection Guide

Top open-source models in 2026

Model	Parameters	Context Length	Strengths	License
Llama 3.1 8B	8B	128K	Best quality at small size, multilingual	Llama 3.1 Community
Llama 3.1 70B	70B	128K	Approaches GPT-4-level, strong reasoning	Llama 3.1 Community
Llama 3.1 405B	405B	128K	Best open-source model overall	Llama 3.1 Community
Mistral 7B	7B	32K	Efficient, strong for its size	Apache 2.0
Mixtral 8x22B	141B (active: 39B)	64K	MoE architecture, excellent reasoning	Apache 2.0
Qwen2.5 72B	72B	128K	Strong multilingual, competitive with Llama 70B	Apache 2.0
Phi-3 Medium	14B	128K	Microsoft, strong reasoning for size	MIT
DeepSeek V3	671B (active: 37B)	128K	MoE, strong coding and math	DeepSeek
CodeLlama 70B	70B	100K	Specialized for code generation	Llama 2 Community

Model selection by use case

Use Case	Recommended Model	Why
General assistant (budget)	Llama 3.1 8B	Best quality-per-VRAM at small size
General assistant (quality)	Llama 3.1 70B	Approaches GPT-4 quality
Code generation	DeepSeek Coder or CodeLlama	Specialized training data
Multilingual	Qwen2.5 72B	Strong across many languages
Function calling	Llama 3.1 70B or Mistral Large	Trained for tool use
Long documents	Llama 3.1 (128K context)	Native long context support
Edge/mobile deployment	Phi-3 Mini (3.8B)	Small footprint, strong quality
Maximum quality	Llama 3.1 405B	Best open-source overall

Deployment Step-by-Step

Option A: Ollama (simplest path)

Step 1: Provision a server with a GPU.

# AWS example: g5.xlarge (1x A10G, 24GB VRAM, $1.01/hr on-demand)
# Or any machine with an NVIDIA GPU and 24GB+ VRAM

Step 2: Install Ollama.

curl -fsSL https://ollama.com/install.sh | sh

Step 3: Pull your model.

ollama pull llama3.1:8b

Step 4: Create a custom Modelfile (optional, for system prompts and parameters).

FROM llama3.1:8b

SYSTEM "You are a helpful customer support agent for Acme Corp. Answer questions about our products using the provided context. If you don't know, say so."

PARAMETER temperature 0.3
PARAMETER top_p 0.9
PARAMETER num_ctx 8192

ollama create acme-support -f Modelfile
ollama run acme-support

Step 5: Set up as a systemd service for production.

[Unit]
Description=Ollama LLM Service
After=network.target

[Service]
Type=simple
ExecStart=/usr/local/bin/ollama serve
Restart=always
Environment="OLLAMA_HOST=0.0.0.0"
Environment="OLLAMA_NUM_PARALLEL=4"

[Install]
WantedBy=multi-user.target

Option B: vLLM (maximum throughput)

Step 1: Provision a GPU server (A100 or H100 recommended).

Step 2: Install vLLM.

pip install vllm

Step 3: Download the model from Hugging Face.

huggingface-cli login
huggingface-cli download meta-llama/Llama-3.1-8B-Instruct

Step 4: Launch the server.

python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --host 0.0.0.0 \
  --port 8000 \
  --max-model-len 8192 \
  --gpu-memory-utilization 0.90

Step 5: Put behind a reverse proxy (nginx or Caddy) with TLS and authentication.

Fine-Tuning On-Premise

Fine-tuning adapts a pre-trained model to your specific domain, terminology, or output format. With self-hosted models, you can fine-tune entirely on your own infrastructure.

When to fine-tune

Scenario	Fine-Tune?
Domain-specific terminology (medical, legal, financial)	Yes
Consistent output format that prompting cannot achieve	Yes
Company-specific tone or style	Yes
Need to reduce prompt length (lower inference cost)	Yes
General knowledge questions	No (use RAG instead)
Data changes frequently	No (RAG is more flexible)
Limited training data (fewer than 50 examples)	No (improve prompts first)

Fine-tuning approaches

Method	VRAM Required	Training Data	Training Time	Quality
LoRA (Low-Rank Adaptation)	1.5x model VRAM	100–10,000 examples	Hours	Good
QLoRA	~model VRAM	100–10,000 examples	Hours	Good (slightly less than LoRA)
Full fine-tuning	3–4x model VRAM	1,000–100,000 examples	Days	Best

QLoRA example with Unsloth

from unsloth import FastLanguageModel

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/llama-3.1-8b-instruct-bnb-4bit",
    max_seq_length=4096,
    load_in_4bit=True,
)

model = FastLanguageModel.get_peft_model(
    model,
    r=16,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                     "gate_proj", "up_proj", "down_proj"],
    lora_alpha=16,
    lora_dropout=0,
    use_gradient_checkpointing="unsloth",
)

from trl import SFTTrainer
from transformers import TrainingArguments

trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=dataset,
    args=TrainingArguments(
        per_device_train_batch_size=4,
        gradient_accumulation_steps=4,
        warmup_steps=10,
        max_steps=200,
        learning_rate=2e-4,
        fp16=True,
        output_dir="outputs",
    ),
)

trainer.train()
model.save_pretrained_merged("my-fine-tuned-model", tokenizer)

For production fine-tuning projects, explore our LLM fine-tuning services.

Performance Optimization

1. Quantization

Quantization reduces model precision from FP16 (16-bit) to INT8 or INT4, cutting memory usage by 2–4x with modest quality loss.

Quantization	Memory Reduction	Quality Impact	Speed Impact
FP16 (baseline)	1x	None	Baseline
INT8 (bitsandbytes)	2x	Minimal (under 1% loss)	~Same speed
INT4 GPTQ	4x	Small (1–3% loss)	Faster (less memory bandwidth)
INT4 AWQ	4x	Small (1–2% loss)	Faster
GGUF Q4_K_M	~4x	Small (1–3% loss)	Good (CPU + GPU)

2. KV cache optimization

3. Batching

Serving multiple requests simultaneously is far more efficient than processing them sequentially. vLLM's continuous batching can increase throughput by 2–5x compared to naive sequential inference.

4. Speculative decoding

Use a smaller "draft" model to predict multiple tokens, then verify with the main model in a single forward pass. This can increase generation speed by 2–3x for large models.

Cost Comparison: Self-Hosted vs API

Monthly cost comparison (10,000 requests/day, ~3,000 tokens each)

Approach	Model	Monthly Cost	Latency (p50)
OpenAI API	GPT-4o	~$22,500	1–2s
OpenAI API	GPT-4o-mini	~$1,350	0.5–1s
Anthropic API	Claude 3.5 Sonnet	~$27,000	1–2s
Self-hosted (cloud GPU)	Llama 3.1 8B (1x A10G)	~$750	0.3–0.8s
Self-hosted (cloud GPU)	Llama 3.1 70B (1x A100)	~$2,200	0.5–1.5s
Self-hosted (on-prem)	Llama 3.1 8B (1x RTX 4090)	~$100 (electricity)	0.3–0.8s
Self-hosted (on-prem)	Llama 3.1 70B (2x A100)	~$200 (electricity)	0.5–1.5s

Break-even analysis

Setup	Upfront Cost	Monthly Operating Cost	Break-Even vs GPT-4o-mini API (at 10K req/day)
1x RTX 4090 (on-prem)	$3,000	~$100	~2.4 months
1x A10G (cloud)	$0	~$750	Immediately cheaper
1x A100 (cloud, for 70B)	$0	~$2,200	Immediately cheaper vs GPT-4o
2x A100 (on-prem)	$30,000	~$200	~1.5 months vs GPT-4o

Use our LLM Cost Calculator to model the cost comparison for your specific usage patterns.

When NOT to Self-Host

Self-hosting is not always the right choice:

Low volume (under 1,000 requests/day) — API costs are minimal; infrastructure overhead is not worth it
No GPU expertise — Managing GPU servers requires specific knowledge
Need GPT-4o quality — Open-source models are good but still behind GPT-4o on complex reasoning
Rapid experimentation — APIs let you switch models instantly; self-hosting requires redeployment
Compliance with cloud-first mandates — Some organizations require managed services for auditability

Why Self-Host an LLM?

Data privacy and compliance

Cost at scale

Latency

Customization

Hardware Requirements

GPU comparison for LLM inference

VRAM requirements by model size

Recommended configurations

Deployment Frameworks

Ollama

vLLM

Text Generation Inference (TGI)

Framework comparison

Model Selection Guide

Top open-source models in 2026

Model selection by use case

Deployment Step-by-Step

Option A: Ollama (simplest path)

Option B: vLLM (maximum throughput)

Fine-Tuning On-Premise

When to fine-tune

Fine-tuning approaches

QLoRA example with Unsloth

Performance Optimization

1. Quantization

2. KV cache optimization

3. Batching

4. Speculative decoding

Cost Comparison: Self-Hosted vs API

Monthly cost comparison (10,000 requests/day, ~3,000 tokens each)

Break-even analysis

When NOT to Self-Host

Next Steps

Frequently Asked Questions

When does self-hosting an LLM actually cost less than using an API?

What GPU do I need to run a 70B model in production?

How much slower is self-hosted inference vs frontier APIs?

What's the operational burden I should expect?

Explore Related Solutions

Need Help Building Your Project?

Related Articles

AI Browser Automation in 2026: ChatGPT Agent, Computer Use, and What Actually Ships

AI Cost Optimization at Scale: How We Cut LLM Bills 60% Without Quality Loss

Blockchain Development in 2026: What's Actually Worth Building

Why Self-Host an LLM?

Data privacy and compliance

Cost at scale

Latency

Customization

Hardware Requirements

GPU comparison for LLM inference

VRAM requirements by model size

Recommended configurations

Deployment Frameworks

Ollama

vLLM

Text Generation Inference (TGI)

Framework comparison

Model Selection Guide

Top open-source models in 2026

Model selection by use case

Deployment Step-by-Step

Option A: Ollama (simplest path)

Option B: vLLM (maximum throughput)

Fine-Tuning On-Premise

When to fine-tune

Fine-tuning approaches

QLoRA example with Unsloth

Performance Optimization

1. Quantization

2. KV cache optimization

3. Batching

4. Speculative decoding

Cost Comparison: Self-Hosted vs API

Monthly cost comparison (10,000 requests/day, ~3,000 tokens each)

Break-even analysis

When NOT to Self-Host

Next Steps

Frequently Asked Questions