Offline AI Desktop App with Electron + llama.cpp

1. Why I built it

Every month I was spending 20–30 minutes manually typing data from PDF invoices into a spreadsheet. Not because I couldn't afford a tool — there are plenty — but because every cloud-based OCR product I evaluated stored my financial documents on their servers, required a monthly subscription, and introduced a privacy risk I wasn't comfortable with.

My local machine already has more than enough compute to run a 1.5B quantised language model. The only thing stopping me from solving this problem locally was the engineering. So I built jaklens.ai.

2. The tech stack

The final architecture uses these layers:

Electron — the desktop shell, handles window management and native OS integration
Angular 17 — the renderer process UI, standalone components with signals
Node.js IPC — the bridge between the Electron main process and the Angular renderer
llama.cpp — the inference engine, compiled for Windows x64, runs as a child process
Qwen2.5 1.5B GGUF — the quantised model file, bundled inside the installer
SQLite via better-sqlite3 — the local database, stores extracted invoice data

The deliberate choice to use llama.cpp instead of Ollama was about control. With llama.cpp I can bundle the binary directly, set exact context lengths, and pipe stdout character by character. Ollama would have required the user to install a separate service.

3. The hardest part: bundling the model

The GGUF model file is 418 MB. Getting it into the installer correctly — and resolving its path at runtime — is surprisingly easy to get wrong.

In electron-builder.yml, the model is declared as an extra resource:

extraResources:
  - from: "models/qwen2.5-1.5b-instruct-q4_k_m.gguf"
    to: "models/qwen2.5-1.5b-instruct-q4_k_m.gguf"

At runtime, the path differs between development and production. In dev, __dirname points into the project source. In the packaged app, extra resources live under process.resourcesPath. The safe pattern is:

import { app } from 'electron';
import path from 'path';

const modelPath = app.isPackaged
  ? path.join(process.resourcesPath, 'models', 'qwen2.5-1.5b-instruct-q4_k_m.gguf')
  : path.join(__dirname, '../../models', 'qwen2.5-1.5b-instruct-q4_k_m.gguf');

Similarly, the llama-cli binary itself is shipped as an extra resource and resolved with the same pattern. The binary must be marked executable on the target platform — electron-builder handles this automatically for Windows.

4. Streaming tokens through IPC

The inference call spawns llama-cli as a child process and reads its stdout line by line using a readline interface:

import { spawn } from 'child_process';
import readline from 'readline';
import { ipcMain, BrowserWindow } from 'electron';

function runInference(prompt: string, win: BrowserWindow) {
  const proc = spawn(llamaBinaryPath, [
    '-m', modelPath,
    '-p', prompt,
    '--ctx-size', '2048',
    '--temp', '0.1',
    '-n', '512',
    '--no-display-prompt',
  ]);

  const rl = readline.createInterface({ input: proc.stdout! });

  rl.on('line', (token) => {
    win.webContents.send('llm-token', token);
  });

  proc.on('close', () => {
    win.webContents.send('llm-done');
  });
}

On the Angular side, the renderer listens for these events and appends each token to a reactive signal, producing the streaming typewriter effect users expect from modern AI UIs:

import { ipcRenderer } from 'electron';
import { signal } from '@angular/core';

const streamBuffer = signal('');

ipcRenderer.on('llm-token', (_event, token: string) => {
  streamBuffer.update(prev => prev + token + '\n');
});

ipcRenderer.on('llm-done', () => {
  // parse JSON from streamBuffer()
});

5. Prompt engineering for structured output

Getting a 1.5B model to reliably emit valid JSON requires a precise system prompt and a defensive parsing layer. The system prompt looks like this:

You are an invoice data extractor. Extract the following fields from the
invoice image and return ONLY a valid JSON object. Do not include any
explanation or text outside the JSON.

Required schema:
{
  "vendor_name": "string",
  "invoice_number": "string",
  "invoice_date": "YYYY-MM-DD",
  "due_date": "YYYY-MM-DD or null",
  "subtotal": number,
  "tax": number,
  "total": number,
  "currency": "string (ISO 4217)"
}

Even with a strict prompt, the model occasionally wraps the JSON in markdown code fences or adds a brief explanation. The parser handles this with a regex extraction step before JSON.parse(), and if parsing fails, the whole inference call is retried up to three times with a slightly rephrased prompt. On the third failure, the user sees a graceful fallback UI rather than a crash.

6. Lessons learned

Start with q4_K_M quantisation. It's the sweet spot for a 1.5B model — accurate enough for structured extraction, small enough to fit in a reasonable installer, and fast enough on a mid-range CPU (2–4 seconds per invoice).
Test path resolution on a clean machine early. The dev/prod path divergence is the most common source of "works on my machine" bugs in Electron apps that bundle native binaries or model files. Automate a clean-build test in CI before the first release.
Prompt temperature matters more than model size for extraction tasks. Running at --temp 0.1 made structured JSON output dramatically more reliable than the default 0.8. Determinism beats creativity for data extraction.

Try jaklens.ai Read the full case study

ASP.NET Core · AI

Calling a Local LLM from ASP.NET Core — The Right Way

Architecture · SQLite

Offline-First Architecture with SQLite and Sync-on-Reconnect

Angular · Performance

5 Angular 17 Performance Wins That Actually Matter

Building an offline AI desktop app with Electron + llama.cpp