Services AI Workflows Work Blog Hire Me
Blog .NET + AI

Calling a local LLM from ASP.NET Core — the right way

Streaming tokens from llama.cpp through a .NET WebSocket endpoint without blocking the thread pool.

Jaks April 2026 7 min read

1. Why local LLMs in .NET

Three reasons I reach for a local LLM rather than a cloud API in .NET projects: privacy (sensitive business data stays on your infrastructure), latency (no round-trip to an external API — a well-warmed 7B model on a server GPU can match GPT-4o Turbo on latency for short completions), and cost (once the hardware is provisioned, inference is free at the margin).

The pattern I'll describe works for document processing, internal chatbots, code review assistants, and any feature where streaming output and data privacy both matter.

2. llama.cpp HTTP server setup

llama.cpp ships a built-in HTTP server — llama-server — that exposes an OpenAI-compatible REST API on localhost. Start it like this:

./llama-server \
  --model ./models/qwen2.5-7b-instruct-q4_k_m.gguf \
  --port 8080 \
  --ctx-size 4096 \
  --n-gpu-layers 0 \
  --host 127.0.0.1

With --n-gpu-layers 0 it runs on CPU only. If you have a CUDA GPU, increase this to offload layers for a significant speedup. The server is now listening at http://localhost:8080/v1/chat/completions with the same request/response schema as the OpenAI API — which makes the .NET client straightforward.

3. Calling it from ASP.NET Core

Register a named HttpClient pointed at the local server in Program.cs:

builder.Services.AddHttpClient("llama", client =>
{
    client.BaseAddress = new Uri("http://localhost:8080");
    client.Timeout = TimeSpan.FromMinutes(5);
});

The streaming call reads the SSE (Server-Sent Events) response from llama-server as an IAsyncEnumerable<string>, yielding each token as it arrives without allocating the full response:

public async IAsyncEnumerable<string> StreamCompletionAsync(
    string userMessage,
    [EnumeratorCancellation] CancellationToken ct = default)
{
    var httpClient = _httpClientFactory.CreateClient("llama");

    var payload = new
    {
        model = "local",
        stream = true,
        messages = new[] { new { role = "user", content = userMessage } }
    };

    var request = new HttpRequestMessage(HttpMethod.Post, "/v1/chat/completions")
    {
        Content = JsonContent.Create(payload)
    };

    using var response = await httpClient.SendAsync(
        request,
        HttpCompletionOption.ResponseHeadersRead,
        ct);

    response.EnsureSuccessStatusCode();

    using var stream = await response.Content.ReadAsStreamAsync(ct);
    using var reader = new StreamReader(stream);

    while (!reader.EndOfStream && !ct.IsCancellationRequested)
    {
        var line = await reader.ReadLineAsync(ct);
        if (line is null || !line.StartsWith("data: ")) continue;
        var data = line["data: ".Length..];
        if (data == "[DONE]") yield break;

        var chunk = JsonSerializer.Deserialize<ChatCompletionChunk>(data);
        var token = chunk?.Choices?[0]?.Delta?.Content;
        if (token is not null) yield return token;
    }
}

Using HttpCompletionOption.ResponseHeadersRead is essential — without it, HttpClient buffers the entire response before returning, which defeats streaming and will time out on long completions.

4. WebSocket streaming to the Angular frontend

For the frontend, WebSockets are the simplest streaming transport for this use case. SignalR is excellent but adds a dependency and negotiation overhead that isn't needed for a one-way token stream. A plain WebSocket endpoint in .NET handles it cleanly:

// In Program.cs:
app.UseWebSockets();

app.Map("/ws/chat", async context =>
{
    if (!context.WebSockets.IsWebSocketRequest)
    {
        context.Response.StatusCode = 400;
        return;
    }

    using var ws = await context.WebSockets.AcceptWebSocketAsync();
    var buffer = new byte[4096];
    var received = await ws.ReceiveAsync(buffer, CancellationToken.None);
    var userMessage = Encoding.UTF8.GetString(buffer, 0, received.Count);

    var chatService = context.RequestServices.GetRequiredService<ChatService>();

    await foreach (var token in chatService.StreamCompletionAsync(userMessage))
    {
        var bytes = Encoding.UTF8.GetBytes(token);
        await ws.SendAsync(bytes, WebSocketMessageType.Text, true, CancellationToken.None);
    }

    await ws.CloseAsync(WebSocketCloseStatus.NormalClosure, "Done", CancellationToken.None);
});

On the Angular side, connect to ws://localhost:5000/ws/chat, send the user message on open, and append each incoming message to a signal. The result is a real-time streaming chat UI with no polling and no third-party libraries.

5. Production considerations

A few things to address before deploying this to production:

Building a local LLM integration into your .NET backend? I build and ship these systems for enterprise clients.