Running Qwen Locally: A Practical Guide to Open-Source AI on Mini PC Hardware

What Ollama, Qwen, and Open WebUI are, how they work together, and what you need to know before recommending local AI to customers.

“Can I run AI on this?” is becoming one of the most common questions IT resellers hear when selling mini PCs. The answer is yes — but the details matter, and getting them wrong will cost you credibility with technically savvy customers.

This article is the technical primer. It walks through the stack — Ollama, Qwen, and Open WebUI — explains what each component does, clarifies the hardware requirements, and gives you the practical knowledge to recommend and deploy local AI on CloudGate hardware with confidence.

The Stack: Three Components, One AI Assistant

Running a local AI chatbot requires three pieces:

1. Ollama — The Runtime

Ollama is an open-source tool that makes it easy to download, manage, and run large language models locally. Think of it as Docker for AI models — you pull a model with a single command (ollama pull qwen2.5:7b-instruct), and Ollama handles loading it into memory, managing the inference, and exposing an API for other tools to connect to.

Ollama runs on Windows, macOS, and Linux. It auto-detects available hardware (CPU, GPU) and optimises accordingly. On a CloudGate running Windows, it will default to CPU inference. On Linux, it can potentially leverage the AMD integrated GPU for faster performance, though this requires configuration.

Installation takes under five minutes. Model download varies by model size — Qwen 2.5 7B is approximately 4.5GB.

2. Qwen 2.5 7B Instruct — The Model

The model is the AI’s “brain” — the trained neural network that generates text responses. Qwen 2.5 is Alibaba’s open-source LLM family, and the 7B Instruct variant is one of the strongest models in its size class.

Why Qwen 2.5 7B specifically:

7 billion parameters — large enough to be genuinely useful, small enough to fit in 16GB RAM
Strong at coding and reasoning — outperforms many larger models from previous generations on technical benchmarks
Good multilingual support — relevant for South Africa’s multilingual environment
Apache 2.0 compatible licence — safe for commercial use, demos, and customer deployments
Q4_K_M quantisation — compressed to ~4.5GB while retaining approximately 95% of full-precision quality. This is the standard quantisation level that balances quality, size, and speed

Other models worth knowing about:

Qwen 2.5 3B — lighter, faster, less capable. Good when you need more headroom for other applications running alongside
Llama 3.2 3B — Meta’s small model. Versatile and well-documented
Phi-3.5 Mini (3.8B) — Microsoft’s small model. Excellent for long-context tasks
Qwen 3 8B — the newest Qwen generation. Slightly larger but improved reasoning

3. Open WebUI — The Interface

Ollama provides the AI engine, but it’s a command-line tool by default. Open WebUI adds a browser-based chat interface that looks and feels like ChatGPT. Users access it through their web browser — Chrome, Edge, Firefox — and interact with the AI through a familiar chat window.

Open WebUI supports conversation history, multiple chat sessions, system prompts (to customise the AI’s behaviour), and multi-user access. It connects to Ollama’s local API, meaning all processing still happens on the CloudGate — Open WebUI just provides the visual layer.

For deployments where the CloudGate serves multiple users on the same network, Open WebUI can be accessed from other devices on the LAN. The AI runs on the CloudGate; users connect from their own browsers.

Hardware Requirements: What Actually Matters

RAM is everything. This is the single most important specification for local AI. The model needs to be loaded into RAM to generate responses. Ollama’s guidelines are clear:

8GB RAM → can run 3B models (basic, limited)
16GB RAM → can run 7B–8B models (the sweet spot for CloudGate)
32GB RAM → can run up to 14B models (significantly more capable)
64GB RAM → can run up to 32B models (approaching cloud-quality responses)

A 7B model in Q4_K_M quantisation uses approximately 5–6GB of RAM. On a 16GB CloudGate, that leaves roughly 10GB for the operating system and other applications. This is workable for a single-user chatbot but doesn’t leave room for running demanding applications alongside the AI.

CPU matters for speed, not capability. Both the CloudGate R7 (Ryzen 7 5825U) and R9 (Ryzen 7 6800H) have 8 cores and 16 threads — more than adequate for AI inference. The R9’s higher clock speed (4.7GHz vs 4.5GHz) and Zen 3+ architecture give it a noticeable speed advantage.

DDR4 vs DDR5 makes a real difference. Memory bandwidth affects how fast the model can generate tokens. The R9’s DDR5 delivers higher bandwidth than the R7’s DDR4, resulting in meaningfully faster response times — the difference between “slow but usable” and “comfortably conversational.”

GPU acceleration is a bonus, not a requirement. Ollama can use AMD integrated graphics for inference acceleration, but on CloudGate hardware, this requires Linux and environment variable configuration (specifically HSA_OVERRIDE_GFX_VERSION). On Windows, CPU inference is the default and works without any GPU configuration. For most customer deployments on Windows, CPU inference is the path of least resistance.

Storage is a non-issue. Models range from 2GB to 40GB+ depending on size. The CloudGate’s 512GB NVMe SSD provides ample space for multiple models plus normal OS and application use.

CloudGate R7 vs R9 for Local AI: The Honest Comparison

	CloudGate R7	CloudGate R9
CPU	Ryzen 7 5825U (up to 4.5GHz)	Ryzen 7 6800H (up to 4.7GHz, Zen 3+)
RAM	16GB DDR4	16GB DDR5
Max Model Size	7B parameters (Q4_K_M)	7B–8B parameters (Q4_K_M)
Expected Speed (CPU)	~5–8 tokens/second	~8–12 tokens/second
With iGPU Accel (Linux)	Limited (Vega, older ROCm)	Better potential (RDNA2)
User Experience	Usable but slower — brief pauses between sentences	Comfortably conversational for single-user chat
Best For	Demos, occasional use, proof of concept	Daily-driver private AI assistant

Both models run the same models at the same quality — the R9 just generates responses faster. For a demo or proof-of-concept, the R7 is perfectly adequate. For a customer who’ll use the AI assistant daily, the R9’s speed advantage justifies the step-up.

How to Set It Up (The 30-Minute Version)

This isn’t a full deployment guide, but here’s the overview so resellers know what’s involved:

Step 1: Install Ollama (~2 minutes). Download from ollama.com, run the installer. On Windows, it’s a standard .exe install. On Linux, it’s a one-line curl command.

Step 2: Pull the model (~5–15 minutes depending on internet speed). Open a terminal/command prompt. Run: ollama pull qwen2.5:7b-instruct. Ollama downloads the ~4.5GB model file.

Step 3: Test it (~1 minute). Run: ollama run qwen2.5:7b-instruct. Type a prompt. If you get a response, the backend is working.

Step 4: Install Open WebUI (~10 minutes). The recommended method is via Docker: docker run -d -p 3000:8080 --add-host=host.docker.internal:host-gateway -v open-webui:/app/backend/data --name open-webui --restart always ghcr.io/open-webui/open-webui:main. Alternatively, install via pip for a lighter setup.

Step 5: Connect and configure (~5 minutes). Open a browser to http://localhost:3000. Create an admin account. Open WebUI auto-detects the local Ollama instance. Select the Qwen model, and start chatting.

Total time: under 30 minutes for a competent IT person. No specialist AI knowledge required.

What Resellers Should Know

Set up a demo unit. Install the full stack on a CloudGate R9 in your office. When a customer asks about AI, open the browser and let them try it. The experience of typing a question and seeing a local AI respond — with no cloud, no subscription, no internet — is genuinely impressive and tangible.

Manage expectations honestly. A 7B local model is useful, but it’s not GPT-4 or Claude Opus. It will occasionally get things wrong, produce verbose responses, or struggle with highly complex reasoning. Position it as a private assistant for everyday tasks, not a replacement for the most powerful cloud AI services.

The 16GB ceiling is the key talking point. If a customer wants more capable AI, the answer is more RAM (32GB opens up 14B models) or a cloud API (which brings the best models to the CloudGate via OpenClaw or similar tools). Be upfront about this — it builds trust and sets up the 32GB conversation for when that SKU becomes available.

The recurring value is in the privacy story. The technical capability is the hook, but the lasting value proposition is data sovereignty. Every conversation with a local AI is a conversation that never touched a cloud server. For regulated industries, that’s not a feature — it’s a requirement.

The Bottom Line

Running a local LLM on a mini PC is no longer a hobbyist experiment. The tools are mature (Ollama), the models are capable (Qwen 2.5 7B), and the interface is polished (Open WebUI). A CloudGate R7 or R9 can have a working private AI assistant in under 30 minutes, with no subscription, no cloud dependency, and no specialist knowledge required.

For resellers, this is a new capability to add to your CloudGate pitch — one that differentiates the product, creates setup and support opportunities, and positions you at the forefront of the private AI conversation.

The AI isn’t in the cloud. It’s in the palm of your customer’s hand.

CloudGate mini PCs are available with full local AI capability. Contact info@cloudgate.co.za or call 010 140 4400 for demo units, setup documentation, and reseller pricing. Visit www.cloudgate.co.za.

Tags:

Open-Source AI