Local inference
Runtimes optimized for running LLMs on your own hardware.
- Ollama
- vLLM
- llama.cpp
- LocalAI
Private AI Installations
Local inference, RAG, agents, vector databases, observability. Software selection, hardware sizing, installation, hardening, maintenance. The model runs where you decide, on data that stays yours.
Where the model runs is your decision, not the provider's.
Stack
Not a single package. A targeted combination of mature open-source components, chosen based on data, constraints, and expected load.
Runtimes optimized for running LLMs on your own hardware.
UIs to interact with local models — end users and teams.
Retrieval-Augmented Generation to query documentation, knowledge bases, archives.
AI agents that operate on controlled environments, flows, and data.
Semantic indices for RAG, search, similarity matching.
Traceability of prompts, responses, latency, cost, and quality drift.
Containerization, orchestration, private networking, and lifecycle management.
Deployment
The "where" is not secondary. It's a data governance choice that precedes every other architectural decision.
Workstations, office servers, corporate datacenters. Your hardware, full data control, no data ever leaves the perimeter.
Hetzner, OVH, Scaleway, and EU-sovereign providers. Data in Europe, clear contracts, predictable cost, GDPR and NIS2 compliance.
Heavy compute on-premise, ancillary services in cloud. The best of both worlds: controlled capex, opportunistic scaling.
Intel NUC, mini-PCs, ARM servers. Inference at the edge — per-device, branch offices, constrained or offline contexts.
Three starting configurations
These are not rigid offerings: they are coherent starting points, calibrated on the three most common scenarios. From there the rest is sized against real data, workload, and constraints.
01 · Starter
For small teams that want an internal ChatGPT, without sending conversations to external providers.
02 · Department
For departments that want to query their own documents with answers traceable back to sources.
03 · Production
For systems that enter real operational processes: SLAs, recovery, structured maintenance.
Exact sizing (hardware, model, perimeter) is set after the initial scoping, not before.
Output
Why not install it yourself
You open the browser, download the binary, it runs. And there you think you're done. In reality you're just starting.
The model is one variable. The environment that hosts it is the rest.
Read more
Every component of the stack has a dedicated page: how it works, what it does for the business, when it fits, how much it costs to install. Written for the decision-maker, not the technician.
The initial assessment clarifies use case, data, constraints, available or required hardware, and delivery path.