Private AI Installations

I configure your private AI stack. At home, on company premises, or in your cloud.

Local inference, RAG, agents, vector databases, observability. Software selection, hardware sizing, installation, hardening, maintenance. The model runs where you decide, on data that stays yours.

Where the model runs is your decision, not the provider's.

Request an assessment See the concept

Stack

Software, curated and integrated for your context.

Not a single package. A targeted combination of mature open-source components, chosen based on data, constraints, and expected load.

Local inference

Runtimes optimized for running LLMs on your own hardware.

Ollama
vLLM
llama.cpp
LocalAI

Conversational interfaces

UIs to interact with local models — end users and teams.

Open WebUI
AnythingLLM
LM Studio
Text Generation WebUI

RAG and knowledge

Retrieval-Augmented Generation to query documentation, knowledge bases, archives.

PrivateGPT
AnythingLLM
Khoj
Continue.dev
Danswer / Onyx

Agents and automation

AI agents that operate on controlled environments, flows, and data.

Dify
Flowise
Langflow
n8n

Vector database

Semantic indices for RAG, search, similarity matching.

Qdrant
Chroma
Weaviate
Milvus
pgvector

Observability

Traceability of prompts, responses, latency, cost, and quality drift.

Langfuse
Phoenix (Arize)

Infrastructure

Containerization, orchestration, private networking, and lifecycle management.

Docker / Docker Compose
K3s / Kubernetes
Tailscale / Headscale
Portainer
Coolify

Deployment

At home, in the office, on company servers, or in your private cloud.

The "where" is not secondary. It's a data governance choice that precedes every other architectural decision.

On-premise

Workstations, office servers, corporate datacenters. Your hardware, full data control, no data ever leaves the perimeter.

European private cloud

Hetzner, OVH, Scaleway, and EU-sovereign providers. Data in Europe, clear contracts, predictable cost, GDPR and NIS2 compliance.

Hybrid

Heavy compute on-premise, ancillary services in cloud. The best of both worlds: controlled capex, opportunistic scaling.

Edge

Intel NUC, mini-PCs, ARM servers. Inference at the edge — per-device, branch offices, constrained or offline contexts.

Three starting configurations

Where to begin: three packages, each with a specific use case.

These are not rigid offerings: they are coherent starting points, calibrated on the three most common scenarios. From there the rest is sized against real data, workload, and constraints.

01 · Starter

Private AI Starter

For small teams that want an internal ChatGPT, without sending conversations to external providers.

OpenWebUI + Ollama on a single server or workstation
User and role management, authenticated access
Local model sized to expected workload
Backup of configurations and user data
Minimum hardening: firewall, TLS access, separated secrets
Operational documentation, knowledge transfer

02 · Department

Private RAG Department

For departments that want to query their own documents with answers traceable back to sources.

Starter stack + AnythingLLM or OpenWebUI with RAG
Vector database (Qdrant or equivalent)
Document ingestion, chunking calibrated to the domain
Source citations — no answers without a reference
Workspace-level permissions, data separation
Observability: query tracing, latency, output quality (Langfuse)

03 · Production

Private AI Production Stack

For systems that enter real operational processes: SLAs, recovery, structured maintenance.

Everything in Department, redesigned for production
Container orchestration (Docker Compose or Kubernetes)
Periodic backups, tested disaster recovery
Dedicated networking, network isolation, audit logs
Coordinated maintenance: runtime, models, CVE patches
Quarterly quality and throughput reporting

Exact sizing (hardware, model, perimeter) is set after the initial scoping, not before.

Output

What lands at your premises is a working system, not a kit to assemble.

What's included

Hardware audit: GPU compatibility, thermal envelope, estimated throughput
Model sizing against use case and budget
Complete installation of the selected stack
Security hardening and network isolation
Backup, restore, and disaster recovery strategy
Monitoring and observability configured
Operational documentation
Knowledge transfer to the internal team

Optional maintenance

Coordinated runtime and model updates
Periodic thermal and throughput health checks
Security patches and CVE management
Tuning for new use cases
Quarterly quality reporting

Why not install it yourself

Installing Ollama is the easy part. The rest is engineering.

You open the browser, download the binary, it runs. And there you think you're done. In reality you're just starting.

What isn't visible at first

GPU thermal and mechanical behavior under sustained load
CUDA driver / runtime version / kernel conflicts
Model selection against context window and real load
Semantic chunking and retrieval strategy for RAG
Network hardening, secret management, audit logs
Backup of vector indices and training data
Updates and silent regressions
Observability of output quality, not just system metrics

What experience brings

Preventive hardware validation, before spending
Stack chosen on real constraints, not on hype
Documented and reproducible configuration
Security designed in, not bolted on
Operability verified under load, not on demo
Predictable maintenance, not emergencies

The model is one variable. The environment that hosts it is the rest.

Want to understand each tool before deciding?

Every component of the stack has a dedicated page: how it works, what it does for the business, when it fits, how much it costs to install. Written for the decision-maker, not the technician.

Explore the tools catalog →

Want a working Private AI system, not an experiment?

The initial assessment clarifies use case, data, constraints, available or required hardware, and delivery path.

Request an assessment