1. Why local LLMs in .NET
Three reasons I reach for a local LLM rather than a cloud API in .NET projects: privacy (sensitive business data stays on your infrastructure), latency (no round-trip to an external API — a well-warmed 7B model on a server GPU can match GPT-4o Turbo on latency for short completions), and cost (once the hardware is provisioned, inference is free at the margin).
The pattern I'll describe works for document processing, internal chatbots, code review assistants, and any feature where streaming output and data privacy both matter.
2. llama.cpp HTTP server setup
llama.cpp ships a built-in HTTP server — llama-server — that exposes an OpenAI-compatible REST API on localhost. Start it like this:
./llama-server \
--model ./models/qwen2.5-7b-instruct-q4_k_m.gguf \
--port 8080 \
--ctx-size 4096 \
--n-gpu-layers 0 \
--host 127.0.0.1
With --n-gpu-layers 0 it runs on CPU only. If you have a CUDA GPU, increase this to offload layers for a significant speedup. The server is now listening at http://localhost:8080/v1/chat/completions with the same request/response schema as the OpenAI API — which makes the .NET client straightforward.
3. Calling it from ASP.NET Core
Register a named HttpClient pointed at the local server in Program.cs:
builder.Services.AddHttpClient("llama", client =>
{
client.BaseAddress = new Uri("http://localhost:8080");
client.Timeout = TimeSpan.FromMinutes(5);
});
The streaming call reads the SSE (Server-Sent Events) response from llama-server as an IAsyncEnumerable<string>, yielding each token as it arrives without allocating the full response:
public async IAsyncEnumerable<string> StreamCompletionAsync(
string userMessage,
[EnumeratorCancellation] CancellationToken ct = default)
{
var httpClient = _httpClientFactory.CreateClient("llama");
var payload = new
{
model = "local",
stream = true,
messages = new[] { new { role = "user", content = userMessage } }
};
var request = new HttpRequestMessage(HttpMethod.Post, "/v1/chat/completions")
{
Content = JsonContent.Create(payload)
};
using var response = await httpClient.SendAsync(
request,
HttpCompletionOption.ResponseHeadersRead,
ct);
response.EnsureSuccessStatusCode();
using var stream = await response.Content.ReadAsStreamAsync(ct);
using var reader = new StreamReader(stream);
while (!reader.EndOfStream && !ct.IsCancellationRequested)
{
var line = await reader.ReadLineAsync(ct);
if (line is null || !line.StartsWith("data: ")) continue;
var data = line["data: ".Length..];
if (data == "[DONE]") yield break;
var chunk = JsonSerializer.Deserialize<ChatCompletionChunk>(data);
var token = chunk?.Choices?[0]?.Delta?.Content;
if (token is not null) yield return token;
}
}
Using HttpCompletionOption.ResponseHeadersRead is essential — without it, HttpClient buffers the entire response before returning, which defeats streaming and will time out on long completions.
4. WebSocket streaming to the Angular frontend
For the frontend, WebSockets are the simplest streaming transport for this use case. SignalR is excellent but adds a dependency and negotiation overhead that isn't needed for a one-way token stream. A plain WebSocket endpoint in .NET handles it cleanly:
// In Program.cs:
app.UseWebSockets();
app.Map("/ws/chat", async context =>
{
if (!context.WebSockets.IsWebSocketRequest)
{
context.Response.StatusCode = 400;
return;
}
using var ws = await context.WebSockets.AcceptWebSocketAsync();
var buffer = new byte[4096];
var received = await ws.ReceiveAsync(buffer, CancellationToken.None);
var userMessage = Encoding.UTF8.GetString(buffer, 0, received.Count);
var chatService = context.RequestServices.GetRequiredService<ChatService>();
await foreach (var token in chatService.StreamCompletionAsync(userMessage))
{
var bytes = Encoding.UTF8.GetBytes(token);
await ws.SendAsync(bytes, WebSocketMessageType.Text, true, CancellationToken.None);
}
await ws.CloseAsync(WebSocketCloseStatus.NormalClosure, "Done", CancellationToken.None);
});
On the Angular side, connect to ws://localhost:5000/ws/chat, send the user message on open, and append each incoming message to a signal. The result is a real-time streaming chat UI with no polling and no third-party libraries.
5. Production considerations
A few things to address before deploying this to production:
- Process management. Run llama-server as a system service (systemd on Linux, Windows Service, or a Docker sidecar). Implement a health-check endpoint in your .NET API that pings
http://localhost:8080/healthand returns 503 if the model server is down. - Memory limits. A 7B Q4_K_M model uses roughly 4–5 GB of RAM at full context. Set hard memory limits on the process and monitor with a tool like Prometheus + Grafana. If the server OOMs, your .NET API should return a graceful error rather than hanging.
- Model warm-up. On first request, llama-server loads the model into memory — this can take 3–10 seconds. Send a warm-up prompt (e.g., "Hello.") on application startup so the first real user request isn't slow. Add this to your
IHostedServicestartup logic. - Request queuing. llama.cpp is single-threaded per model instance — concurrent requests queue internally. For high concurrency, run multiple server instances on different ports and round-robin across them. For most enterprise internal tools, a single instance is fine.
- Rate limiting. Add ASP.NET Core's built-in rate limiter to the
/ws/chatendpoint to prevent a single user from monopolising the inference engine.
Building a local LLM integration into your .NET backend? I build and ship these systems for enterprise clients.