My Local LLM Stack: Running AI Without the Cloud
I run language models, speech-to-text, embeddings, vector search and RAG entirely on hardware I own. Here's the exact setup, what it costs and what it can do.
Why I run it locally?
I work with sensitive documents. Contracts, financial records, personal notes. Sending those to OpenAI’s API or Google’s Gemini means they touch servers I don’t control, under terms that can change quarterly. For personal use and for clients in regulated industries, that’s unacceptable.
But I don’t want to give up the capabilities. Chat with my documents, transcription, semantic search — I want all of it.
So I built a local stack that does it.
The hardware
The AI inference stack runs on a single Windows PC with an Nvidia RTX 4060 (8GB VRAM). That’s a $300 consumer gaming GPU — not a datacenter card, not a cluster.
The vector database and orchestration run on a separate TrueNAS box (Intel N95, 16GB RAM). Two machines, both under $1,000 total.
AI Processing (Windows PC)
GPU: NVIDIA RTX 4060 (8GB VRAM)
RAM: 32GB DDR5
Role: LLM inference, embeddings, speech-to-text
Services: LM Studio, WhisperX
Storage and Orchestration (TrueNAS)
CPU: Intel N95
RAM: 16GB
Role: Vector database, RAG, document processing, automation
Services: Qdrant, AnythingLLM, n8n
The stack
LM Studio (Local LLM Inference)
LM Studio runs on the Windows PC and serves models through an OpenAI-compatible API. Any tool that speaks the OpenAI API format can connect — n8n, AnythingLLM, custom scripts, other AI tools.
I run Qwen3.5-9B for general tasks. It fits comfortably in 8GB VRAM with int8 quantization and responds in 1-3 seconds. For embeddings I use nomic-embed-text v2, a 768-dimensional model that works well for semantic search.
You don’t need a 70B parameter model for most tasks. An 8B model handles summarization, classification, question-answering and structured extraction well enough for daily use. Save the big models for tricky or ambiguous questions.
WhisperX (speech-to-text transcription)
WhisperX runs locally via a Docker container with CUDA acceleration. Upload any audio file, get accurate transcription with timestamps and speaker diarization. I built SolScribe on top of this — a full transcription management platform with speaker diarization, AI chat and webhook automation.
Performance: a 30-minute recording transcribes in about 2 minutes on the RTX 4060 using the “base” model. The “large” model is slower but handles accented speech and technical jargon better.
Qdrant (vector database)
Qdrant stores embeddings for semantic search. Unlike keyword search, vector search finds results by meaning. Searching for "Docker networking issues" will find a note titled "Container can't reach host network" even though the words don't match.
I maintain three collections: documents (Paperless-ngx content), pkm-knowledge (Obsidian notes), and bookstack_embeddings (wiki pages). New content gets embedded automatically through n8n workflows.
AnythingLLM (RAG chat)
AnythingLLM connects to LM Studio for inference and Qdrant for retrieval. Upload documents into workspaces, then chat with them. Ask questions, get answers with citations pointing to the exact source.
I use it to search hundreds of technical documents, find specific clauses in documents and query my personal knowledge base conversationally.
n8n (the glue)
n8n orchestrates everything. Three shared sub-workflows form the foundation:
1. LM Studio Call — a reusable webhook that formats and sends requests to LM Studio, with retry logic and timeout handling.
2. OpenRouter Fallback — tries a cloud LLM first (for tasks that benefit from larger models), falls back to local LM Studio if the cloud is unavailable.
3. Qdrant Embedder — takes text, generates an embedding via nomic-embed-text and upserts it into Qdrant with metadata.
Every other AI workflow in my stack calls one of these three. A new note in Obsidian triggers the embedder. A document uploaded to Paperless triggers the embedder. A voice memo gets transcribed, summarized with LM Studio, then embedded.
What it can do
Document Q&A
Upload a 50-page PDF. Ask "What are the payment terms?" and get an answer in 3 seconds with a citation to page 12, paragraph 3. All local. No API calls. No data leaving the network.
Voice-to-Knowledge Pipeline
Record a voice memo on my phone. SolScribe transcribes it. n8n summarizes the transcript with LM Studio, creates an Obsidian note and embeds it in Qdrant. Under a minute, with zero manual steps.
Semantic Search Across Everything
One search query hits Qdrant and returns ranked results from my documents, notes and wiki — ranked by meaning, not keyword frequency. Finding a half-remembered concept from three months ago takes seconds.
Automated Document Processing
Drop a file in the Paperless-ngx consumption folder. It gets OCR'd, tagged by an LLM that reads the content and suggests categories, embedded for semantic search. A notification hits my phone when it's done.
The cost
GPU: ~$300 (RTX 4060, one-time)
Monthly electricity: ~$8-12 (GPU PC runs when needed)
Cloud API costs: $0
Subscription fees: $0
Equivalent cloud services: $100-200/month
Break-even: ~3 months.
Honest limitations
This isn't a cloud replacement for everyone. Some real constraints:
Model quality: A 9B parameter model is not GPT-4. For complex reasoning, creative writing, or nuanced analysis, cloud models are still better. I use OpenRouter or Claude as a fallback for tasks that need it.
Context window: Local models typically handle 4K-8K context well. Larger contexts need more VRAM or quantization tricks.
Concurrent users: This is a single-user setup. Serving a team would need a beefier GPU or multiple inference servers.
Maintenance: Models need updating. Docker containers need monitoring. It's not zero-maintenance, so budget 2-3 hours per week.
But for personal use, sensitive work, and daily AI-assisted operations? Local inference is not just viable, it's preferable.
Getting Started
If you want to build a similar stack, start small:
Install LM Studio (or Ollama) on any machine with an NVIDIA GPU. Download a small model (Phi-3 Mini or Qwen 2.5 3B). Play with the chat interface.
Set up Qdrant with a single Docker command. Create a collection. Try embedding and searching a few documents.
Add n8n and build your first automation: webhook → LM Studio → notification. Once that works, everything else is iteration.
You can run AI privately right now on a regular PC. The hardware is cheap. The software is free. The rest is preference.

