1. Why I built it
Every month I was spending 20–30 minutes manually typing data from PDF invoices into a spreadsheet. Not because I couldn't afford a tool — there are plenty — but because every cloud-based OCR product I evaluated stored my financial documents on their servers, required a monthly subscription, and introduced a privacy risk I wasn't comfortable with.
My local machine already has more than enough compute to run a 1.5B quantised language model. The only thing stopping me from solving this problem locally was the engineering. So I built jaklens.ai.
2. The tech stack
The final architecture uses these layers:
- Electron — the desktop shell, handles window management and native OS integration
- Angular 17 — the renderer process UI, standalone components with signals
- Node.js IPC — the bridge between the Electron main process and the Angular renderer
- llama.cpp — the inference engine, compiled for Windows x64, runs as a child process
- Qwen2.5 1.5B GGUF — the quantised model file, bundled inside the installer
- SQLite via better-sqlite3 — the local database, stores extracted invoice data
The deliberate choice to use llama.cpp instead of Ollama was about control. With llama.cpp I can bundle the binary directly, set exact context lengths, and pipe stdout character by character. Ollama would have required the user to install a separate service.
3. The hardest part: bundling the model
The GGUF model file is 418 MB. Getting it into the installer correctly — and resolving its path at runtime — is surprisingly easy to get wrong.
In electron-builder.yml, the model is declared as an extra resource:
extraResources:
- from: "models/qwen2.5-1.5b-instruct-q4_k_m.gguf"
to: "models/qwen2.5-1.5b-instruct-q4_k_m.gguf"
At runtime, the path differs between development and production. In dev, __dirname points into the project source. In the packaged app, extra resources live under process.resourcesPath. The safe pattern is:
import { app } from 'electron';
import path from 'path';
const modelPath = app.isPackaged
? path.join(process.resourcesPath, 'models', 'qwen2.5-1.5b-instruct-q4_k_m.gguf')
: path.join(__dirname, '../../models', 'qwen2.5-1.5b-instruct-q4_k_m.gguf');
Similarly, the llama-cli binary itself is shipped as an extra resource and resolved with the same pattern. The binary must be marked executable on the target platform — electron-builder handles this automatically for Windows.
4. Streaming tokens through IPC
The inference call spawns llama-cli as a child process and reads its stdout line by line using a readline interface:
import { spawn } from 'child_process';
import readline from 'readline';
import { ipcMain, BrowserWindow } from 'electron';
function runInference(prompt: string, win: BrowserWindow) {
const proc = spawn(llamaBinaryPath, [
'-m', modelPath,
'-p', prompt,
'--ctx-size', '2048',
'--temp', '0.1',
'-n', '512',
'--no-display-prompt',
]);
const rl = readline.createInterface({ input: proc.stdout! });
rl.on('line', (token) => {
win.webContents.send('llm-token', token);
});
proc.on('close', () => {
win.webContents.send('llm-done');
});
}
On the Angular side, the renderer listens for these events and appends each token to a reactive signal, producing the streaming typewriter effect users expect from modern AI UIs:
import { ipcRenderer } from 'electron';
import { signal } from '@angular/core';
const streamBuffer = signal('');
ipcRenderer.on('llm-token', (_event, token: string) => {
streamBuffer.update(prev => prev + token + '\n');
});
ipcRenderer.on('llm-done', () => {
// parse JSON from streamBuffer()
});
5. Prompt engineering for structured output
Getting a 1.5B model to reliably emit valid JSON requires a precise system prompt and a defensive parsing layer. The system prompt looks like this:
You are an invoice data extractor. Extract the following fields from the
invoice image and return ONLY a valid JSON object. Do not include any
explanation or text outside the JSON.
Required schema:
{
"vendor_name": "string",
"invoice_number": "string",
"invoice_date": "YYYY-MM-DD",
"due_date": "YYYY-MM-DD or null",
"subtotal": number,
"tax": number,
"total": number,
"currency": "string (ISO 4217)"
}
Even with a strict prompt, the model occasionally wraps the JSON in markdown code fences or adds a brief explanation. The parser handles this with a regex extraction step before JSON.parse(), and if parsing fails, the whole inference call is retried up to three times with a slightly rephrased prompt. On the third failure, the user sees a graceful fallback UI rather than a crash.
6. Lessons learned
- Start with q4_K_M quantisation. It's the sweet spot for a 1.5B model — accurate enough for structured extraction, small enough to fit in a reasonable installer, and fast enough on a mid-range CPU (2–4 seconds per invoice).
- Test path resolution on a clean machine early. The dev/prod path divergence is the most common source of "works on my machine" bugs in Electron apps that bundle native binaries or model files. Automate a clean-build test in CI before the first release.
- Prompt temperature matters more than model size for extraction tasks. Running at
--temp 0.1made structured JSON output dramatically more reliable than the default 0.8. Determinism beats creativity for data extraction.