Instrumentation Middleware: Usage & Extension Guide

Overview

The async instrumentation middleware provides non-blocking, streaming-capable instrumentation for all API calls handled by the LLM Proxy. It captures request/response metadata and emits events to a pluggable event bus for downstream processing (e.g., file, cloud, analytics).

Note: Instrumentation middleware and audit logging serve different purposes:

  • Instrumentation: Captures API request/response metadata for observability and analytics
  • Audit Logging: Records security-sensitive operations for compliance and investigations (see Audit Events)

Both systems operate independently and can be configured separately.

Event Bus & Dispatcher Architecture

  • The async event bus is now always enabled and handles all API instrumentation events.
  • The event bus supports multiple subscribers (fan-out), batching, retry logic, and graceful shutdown.
  • Both in-memory and Redis backends are available for local and distributed event delivery.
  • Persistent event logging is handled by a dispatcher CLI or the --file-event-log flag on the server, which writes events to a JSONL file.
  • Middleware captures and restores the request body for all events, and the event context is richer for diagnostics and debugging.

Relationship to Audit Logging

The instrumentation event bus is separate from the audit logging system:

  • Event Bus: Captures API request/response data for observability (instrumentation middleware)
  • Audit Logger: Records security events directly to file/database (audit middleware)

Both systems can run simultaneously:

  • Instrumentation events flow through the event bus to dispatchers
  • Audit events are written directly to audit logs (file and/or database)
  • No overlap in captured data - instrumentation focuses on API performance, audit focuses on security events

Audit Events

The proxy emits audit events for security-sensitive operations:

Proxy Request Audit Events

  • Project Inactive (403): When a request is denied due to inactive project status
    • Action: proxy_request, Result: denied, Reason: project_inactive
    • Includes: project ID, token ID, client IP, user agent, HTTP method, endpoint
  • Service Unavailable (503): When project status check fails due to database errors
    • Action: proxy_request, Result: error, Reason: service_unavailable
    • Includes: error details, project ID, request metadata

Management API Audit Events

  • Project Lifecycle: Create, update (including is_active changes), delete operations
  • Token Management: Create, update, revoke (single and batch operations)
  • All events include actor identification, request IDs, and operation metadata

Audit events are stored in the database and written to audit log files for compliance and security investigations.

For complete system observability, both should be enabled in production environments.

Persistent Event Logging

  • To persist all events to a file, use the --file-event-log flag when running the server:
llm-proxy server --file-event-log ./data/events.jsonl
  • Alternatively, use the standalone dispatcher CLI to subscribe to the event bus and write events to a file or other backends:
llm-proxy dispatcher --backend file --file ./data/events.jsonl

Configuration Reference

  • OBSERVABILITY_ENABLED: Deprecated; the async event bus is always enabled.
  • OBSERVABILITY_BUFFER_SIZE (int): Buffer size for event bus (default: 1000)
  • OBSERVABILITY_MAX_REQUEST_BODY_BYTES (int64): Max bytes of request body captured into observability events (default: 65536). Does not affect proxying.
  • OBSERVABILITY_MAX_RESPONSE_BODY_BYTES (int64): Max bytes of response body captured into observability events (default: 262144). Does not affect proxying.
  • FILE_EVENT_LOG: Path to persistent event log file (enables file event logging via dispatcher)

Hot-Path Performance Tuning (Non-Observability)

These settings primarily affect hot-path performance characteristics rather than core observability semantics:

  • LLM_PROXY_API_KEY_CACHE_TTL (duration): TTL for per-project upstream API key cache (default: 30s).
  • LLM_PROXY_API_KEY_CACHE_MAX (int): Max entries for per-project upstream API key cache (default: 10000).
  • OBSERVABILITY_MAX_RESPONSE_BODY_BYTES (int64): Cap bytes captured from response bodies for observability events (also bounds OpenAI metadata extraction). Default: 262144.

How It Works

  • The middleware wraps all proxy requests and responses.
  • Captures request ID, method, path, status, duration, headers, and full (streamed) response body.
  • Emits an event to the async event bus (in-memory or Redis).
  • Event delivery is fully async, non-blocking, batched, and resilient to failures.

Event Bus Backends

  • Redis Streams (redis-streams): Recommended for production. Provides consumer groups, acknowledgment, at-least-once delivery, and crash recovery. See Redis Streams Backend.
  • In-Memory (in-memory): Fast, simple, for local/dev use. Single process only. No durability or delivery guarantees.
  • Custom: Implement the EventBus interface for other backends (Kafka, HTTP, etc.).

Event Schema Example

// eventbus.Event
Event {
  RequestID       string
  Method          string
  Path            string
  Status          int
  Duration        time.Duration
  ResponseHeaders http.Header
  ResponseBody    []byte
}

Example: Enabling Persistent Logging in Docker

docker run -d \
  -e FILE_EVENT_LOG=./data/events.jsonl \
  ...

Extending the Middleware

  • Custom Event Schema: Extend eventbus.Event or create your own struct. Update the middleware to emit your custom event type.
  • New Event Bus Backends: Implement the EventBus interface (see internal/eventbus/eventbus.go). Plug in your backend (e.g., Redis, Kafka, HTTP, etc.).
  • New Consumers/Dispatchers: Write a dispatcher that subscribes to the event bus and delivers events to your backend (file, cloud, analytics, etc.).

Example: Custom EventBus Backend

type MyEventBus struct { /* ... */ }
func (b *MyEventBus) Publish(ctx context.Context, evt eventbus.Event) { /* ... */ }
func (b *MyEventBus) Subscribe() <-chan eventbus.Event { /* ... */ }

Dispatcher CLI Commands

The LLM Proxy now includes a powerful, pluggable dispatcher system for sending observability events to external services. The dispatcher supports multiple backends and can be run as a separate service.

Available Backends

  • file: Write events to JSONL file
  • lunary: Send events to Lunary.ai platform
  • helicone: Send events to Helicone platform

Basic Usage

# File output
llm-proxy dispatcher --service file --endpoint events.jsonl

# Lunary integration
llm-proxy dispatcher --service lunary --api-key $LUNARY_API_KEY

# Helicone integration  
llm-proxy dispatcher --service helicone --api-key $HELICONE_API_KEY

# Custom endpoint for Lunary
llm-proxy dispatcher --service lunary --api-key $LUNARY_API_KEY --endpoint https://custom.lunary.ai/v1/runs/ingest

Configuration Options

Flag Default Description
--service file Backend service (file, lunary, helicone)
--endpoint service-specific API endpoint or file path
--api-key - API key for external services
--buffer 1000 Event bus buffer size
--batch-size 100 Batch size for sending events
--detach false Run in background (daemon mode)

Environment Variables

  • LLM_PROXY_API_KEY: API key for the selected service
  • LLM_PROXY_ENDPOINT: Default endpoint URL

Event Format

The dispatcher transforms internal events into a rich format suitable for external services:

{
  "type": "llm",
  "event": "start",
  "runId": "550e8400-e29b-41d4-a716-446655440000",
  "timestamp": "2023-12-01T10:00:00Z",
  "input": {"model": "gpt-4", "messages": [...]},
  "output": {"choices": [...]},
  "metadata": {
    "method": "POST",
    "path": "/v1/chat/completions", 
    "status": 200,
    "duration_ms": 1234,
    "request_id": "req-123"
  }
}

Advanced Features

  • Automatic Retry: Exponential backoff for failed requests
  • Batching: Configurable batch sizes for efficiency
  • Graceful Shutdown: SIGINT/SIGTERM handling
  • Extensible: Easy to add new backends

Helicone Manual Logger Integration

The Helicone dispatcher plugin transforms LLM Proxy events into Helicone’s Manual Logger format. This enables detailed cost tracking, analytics, and monitoring of custom model endpoints through Helicone.

Payload Mapping Details

Our implementation maps LLM Proxy events to the Helicone Manual Logger format as follows:

{
  "providerRequest": {
    "url": "/v1/chat/completions",
    "json": { "model": "gpt-4", "messages": [...] },
    "meta": {
      "Helicone-Provider": "openai",
      "Helicone-User-Id": "user-123",
      "request_id": "req-456",
      "provider": "openai"
    }
  },
  "providerResponse": {
    "status": 200,
    "headers": {},
    "json": { "choices": [...], "usage": {...} },
    "base64": "..." // for non-JSON responses
  },
  "timing": {
    "startTime": { "seconds": 1640995200, "milliseconds": 0 },
    "endTime": { "seconds": 1640995201, "milliseconds": 250 }
  }
}

Key Features

Provider Detection: Automatically sets Helicone-Provider header to prevent categorization as “CUSTOM” model, enabling proper cost calculation.

Usage Injection: Injects computed token usage into response JSON when available:

{
  "usage": {
    "prompt_tokens": 10,
    "completion_tokens": 25,
    "total_tokens": 35
  }
}

Request ID Propagation: Preserves request_id from middleware context for correlation.

Non-JSON Response Handling: For binary or non-JSON responses:

  • Sets providerResponse.json to empty object with explanatory note
  • Includes base64 field for binary data when available

Metadata Enrichment: Forwards relevant metadata fields and user properties to Helicone headers.

Configuration

# Basic usage
llm-proxy dispatcher --service helicone --api-key $HELICONE_API_KEY

# Custom endpoint (e.g., for EU region)
llm-proxy dispatcher --service helicone \
  --api-key $HELICONE_API_KEY \
  --endpoint https://eu.api.helicone.ai/custom/v1/log

References

HTTP Response Caching Integration

The proxy includes HTTP response caching that integrates with the instrumentation and observability system. Caching behavior affects both response headers and event publishing.

Cache Response Headers

When caching is enabled (HTTP_CACHE_ENABLED=true), the proxy adds observability headers to all responses:

  • X-PROXY-CACHE: Indicates cache result
    • hit: Response served from cache
    • miss: Response not in cache, fetched from upstream
  • X-PROXY-CACHE-KEY: The cache key used for the request (useful for debugging cache behavior)
  • Cache-Status: Standard HTTP cache status header
    • hit: Cache hit, response served from cache
    • miss: Cache miss, response fetched from upstream
    • bypass: Caching bypassed (e.g., due to Cache-Control: no-store)
    • stored: Response was stored in cache after fetch
    • conditional-hit: Conditional request (e.g., If-None-Match) resulted in 304

Cache Metrics

The proxy keeps lightweight, provider-agnostic counters to assess cache effectiveness:

  • cache_hits_total: Number of requests served from cache (including conditional hits)
  • cache_misses_total: Number of requests that missed the cache
  • cache_bypass_total: Number of requests where caching was bypassed (e.g., no-store)
  • cache_store_total: Number of responses stored in cache after upstream fetch

Notes:

  • Counters are in-memory and surfaced via the existing metrics endpoint when enabled.
  • No external metrics provider is required; Prometheus export is optional and not a core dependency.

Event Bus Behavior with Caching

The caching system integrates with the instrumentation middleware to optimize performance:

  • Cache Hits: Events are not published to the event bus for cache hits (including conditional hits). This prevents duplicate instrumentation data and reduces event bus load.
  • Cache Misses and Stores: Events are published normally when responses are fetched from upstream, whether they get cached or not.

This behavior ensures that:

  • Each unique API call is instrumented exactly once (when first fetched)
  • Cache performance doesn’t impact event bus throughput
  • Downstream analytics systems receive clean, non-duplicated data

Example Headers

# Cache hit response
HTTP/1.1 200 OK
X-PROXY-CACHE: hit
X-PROXY-CACHE-KEY: llmproxy:cache:proj123:GET:/v1/models:accept-application/json
Cache-Status: hit
Content-Type: application/json

# Cache miss response
HTTP/1.1 200 OK
X-PROXY-CACHE: miss
X-PROXY-CACHE-KEY: llmproxy:cache:proj123:POST:/v1/chat/completions:accept-application/json:body-hash-abc123
Cache-Status: stored
Content-Type: application/json

Debugging Cache Behavior

Use the benchmark tool with --debug flag to inspect cache headers:

llm-proxy benchmark \
  --base-url "http://localhost:8080" \
  --endpoint "/v1/chat/completions" \
  --token "$PROXY_TOKEN" \
  --requests 10 --concurrency 1 \
  --cache \
  --debug \
  --json '{"model":"gpt-4o-mini","messages":[{"role":"user","content":"test"}]}'

This will show sample responses with all headers, making it easy to verify cache behavior.

Prometheus Metrics Endpoint

The proxy provides an additional Prometheus-compatible metrics endpoint for monitoring and alerting. This endpoint complements the existing JSON metrics endpoint without replacing it.

Endpoints

  • /metrics: Provider-agnostic JSON metrics (default format)
  • /metrics/prometheus: Prometheus text exposition format

Both endpoints are enabled when ENABLE_METRICS=true (default).

Both endpoints are available when ENABLE_METRICS=true (default).

Available Metrics

The Prometheus endpoint exposes the following metrics:

Application Metrics

Metric Type Description
llm_proxy_uptime_seconds gauge Time since the server started
llm_proxy_requests_total counter Total number of proxy requests
llm_proxy_errors_total counter Total number of proxy errors
llm_proxy_cache_hits_total counter Total number of cache hits
llm_proxy_cache_misses_total counter Total number of cache misses
llm_proxy_cache_bypass_total counter Total number of cache bypasses
llm_proxy_cache_stores_total counter Total number of cache stores

Go Runtime Metrics

Metric Type Description
llm_proxy_goroutines gauge Number of goroutines currently running
llm_proxy_memory_heap_alloc_bytes gauge Number of heap bytes allocated and currently in use
llm_proxy_memory_heap_sys_bytes gauge Number of heap bytes obtained from the OS
llm_proxy_memory_heap_idle_bytes gauge Number of heap bytes waiting to be used
llm_proxy_memory_heap_inuse_bytes gauge Number of heap bytes that are in use
llm_proxy_memory_heap_released_bytes gauge Number of heap bytes released to the OS
llm_proxy_memory_total_alloc_bytes counter Total number of bytes allocated (cumulative)
llm_proxy_memory_sys_bytes gauge Total number of bytes obtained from the OS
llm_proxy_memory_mallocs_total counter Total number of malloc operations
llm_proxy_memory_frees_total counter Total number of free operations
llm_proxy_gc_runs_total counter Total number of GC runs
llm_proxy_gc_pause_total_seconds counter Total GC pause time in seconds
llm_proxy_gc_next_bytes gauge Target heap size for next GC cycle

Example Output

# HELP llm_proxy_uptime_seconds Time since the server started
# TYPE llm_proxy_uptime_seconds gauge
llm_proxy_uptime_seconds 3542.12
# HELP llm_proxy_requests_total Total number of proxy requests
# TYPE llm_proxy_requests_total counter
llm_proxy_requests_total 1523
# HELP llm_proxy_errors_total Total number of proxy errors
# TYPE llm_proxy_errors_total counter
llm_proxy_errors_total 12
# HELP llm_proxy_cache_hits_total Total number of cache hits
# TYPE llm_proxy_cache_hits_total counter
llm_proxy_cache_hits_total 842
# HELP llm_proxy_cache_misses_total Total number of cache misses
# TYPE llm_proxy_cache_misses_total counter
llm_proxy_cache_misses_total 681
# HELP llm_proxy_cache_bypass_total Total number of cache bypasses
# TYPE llm_proxy_cache_bypass_total counter
llm_proxy_cache_bypass_total 0
# HELP llm_proxy_cache_stores_total Total number of cache stores
# TYPE llm_proxy_cache_stores_total counter
llm_proxy_cache_stores_total 681
# HELP llm_proxy_goroutines Number of goroutines currently running
# TYPE llm_proxy_goroutines gauge
llm_proxy_goroutines 12
# HELP llm_proxy_memory_heap_alloc_bytes Number of heap bytes allocated and currently in use
# TYPE llm_proxy_memory_heap_alloc_bytes gauge
llm_proxy_memory_heap_alloc_bytes 2097152
# HELP llm_proxy_memory_total_alloc_bytes Total number of bytes allocated (cumulative)
# TYPE llm_proxy_memory_total_alloc_bytes counter
llm_proxy_memory_total_alloc_bytes 104857600
# HELP llm_proxy_gc_runs_total Total number of GC runs
# TYPE llm_proxy_gc_runs_total counter
llm_proxy_gc_runs_total 42

Prometheus Scrape Configuration

Add the following to your Prometheus configuration:

scrape_configs:
  - job_name: 'llm-proxy'
    static_configs:
      - targets: ['localhost:8080']
    metrics_path: '/metrics/prometheus'
    scrape_interval: 15s

Example Queries

# Request rate (per second)
rate(llm_proxy_requests_total[5m])

# Error rate
rate(llm_proxy_errors_total[5m]) / rate(llm_proxy_requests_total[5m])

# Cache hit ratio
llm_proxy_cache_hits_total / (llm_proxy_cache_hits_total + llm_proxy_cache_misses_total)

# Total uptime in hours
llm_proxy_uptime_seconds / 3600

# Memory usage trend
rate(llm_proxy_memory_total_alloc_bytes[5m])

# Heap allocation
llm_proxy_memory_heap_alloc_bytes

# GC frequency
rate(llm_proxy_gc_runs_total[5m])

# Active goroutines
llm_proxy_goroutines

Testing

# Check Prometheus metrics
curl http://localhost:8080/metrics/prometheus

# Compare with JSON format
curl http://localhost:8080/metrics | jq .

Grafana Dashboard

A ready-to-import Grafana dashboard is available for visualizing LLM Proxy metrics:

The dashboard includes panels for:

  • Request rate, error rate, and uptime
  • Cache performance (hits, misses, bypass, stores)
  • Memory usage and Go runtime metrics
  • Garbage collection statistics

Import the dashboard into Grafana and configure it to use your Prometheus datasource.

Notes

  • The Prometheus endpoint is lightweight and has no external dependencies
  • Metrics are in-memory and reset on server restart
  • Both JSON and Prometheus endpoints can be used simultaneously
  • No secrets are exposed in metrics output

Important: In-Memory vs. Redis Event Bus

  • The in-memory event bus only works within a single process. If you run the proxy and dispatcher as separate processes or containers, they will not share events.
  • For distributed, multi-process, or containerized setups, Redis is required as the event bus backend.

Local Redis Setup for Manual Testing

Add the following to your docker-compose.yml to run Redis locally:

redis:
  image: redis:7
  container_name: llm-proxy-redis
  ports:
    - "6379:6379"
  restart: unless-stopped

Configure both the proxy and dispatcher to use Redis Streams:

LLM_PROXY_EVENT_BUS=redis-streams llm-proxy server ...
LLM_PROXY_EVENT_BUS=redis-streams llm-proxy dispatcher ...

This enables full async event delivery and observability pipeline testing across processes.

For production deployments requiring guaranteed delivery and at-least-once semantics, use the Redis Streams backend. It provides:

  • Consumer Groups: Multiple dispatcher instances can share the workload
  • Acknowledgment: Messages are only removed after successful processing
  • Crash Recovery: Pending messages from crashed consumers are automatically claimed
  • Durable Storage: Messages persist until acknowledged, surviving restarts

Enabling Redis Streams

Set the event bus backend to redis-streams:

LLM_PROXY_EVENT_BUS=redis-streams llm-proxy server ...

Configuration Options

Environment Variable Description Default
LLM_PROXY_EVENT_BUS Event bus backend redis-streams
REDIS_ADDR Redis server address localhost:6379
REDIS_DB Redis database number 0
REDIS_STREAM_KEY Stream key name llm-proxy-events
REDIS_CONSUMER_GROUP Consumer group name llm-proxy-dispatchers
REDIS_CONSUMER_NAME Consumer name (unique per instance) Auto-generated
REDIS_STREAM_MAX_LEN Max stream length (0 = unlimited) 10000
REDIS_STREAM_BLOCK_TIME Block timeout for reading 5s
REDIS_STREAM_CLAIM_TIME Min idle time before claiming pending messages 30s
REDIS_STREAM_BATCH_SIZE Batch size for reading messages 100

Example Configuration

# Full Redis Streams configuration
export LLM_PROXY_EVENT_BUS=redis-streams
export REDIS_ADDR=redis.example.com:6379
export REDIS_DB=0
export REDIS_STREAM_KEY=llm-proxy-events
export REDIS_CONSUMER_GROUP=dispatchers
export REDIS_CONSUMER_NAME=dispatcher-1  # Set unique name per instance
export REDIS_STREAM_MAX_LEN=50000
export REDIS_STREAM_BLOCK_TIME=5s
export REDIS_STREAM_CLAIM_TIME=30s
export REDIS_STREAM_BATCH_SIZE=100

llm-proxy server

How It Works

  1. Publishing: Events are added to the stream via XADD with automatic ID generation
  2. Consumer Groups: Dispatchers join a consumer group and read via XREADGROUP
  3. Acknowledgment: After successful processing, messages are acknowledged via XACK
  4. Recovery: If a consumer crashes, its pending messages are claimed by other consumers after REDIS_STREAM_CLAIM_TIME

Multiple Dispatcher Instances

Redis Streams supports running multiple dispatcher instances that share the workload:

# Instance 1
REDIS_CONSUMER_NAME=dispatcher-1 llm-proxy dispatcher --service lunary

# Instance 2 (on another host or container)
REDIS_CONSUMER_NAME=dispatcher-2 llm-proxy dispatcher --service lunary

Each message is delivered to exactly one consumer in the group. If a consumer fails, its pending messages are automatically reassigned.

Multiple Dispatcher Services (Fan-out)

If you want multiple backends (e.g. file and helicone) to each receive 100% of events, do not run them in the same consumer group.

  • Same REDIS_CONSUMER_GROUP across multiple dispatcher services = load balancing (each event goes to only one service)
  • Different REDIS_CONSUMER_GROUP per service = fan-out (each service reads the full stream independently)

Example:

# File logger consumes all events
REDIS_CONSUMER_GROUP=llm-proxy-dispatchers-file \
  llm-proxy dispatcher --service file --endpoint events.jsonl

# Helicone logger also consumes all events
REDIS_CONSUMER_GROUP=llm-proxy-dispatchers-helicone \
  llm-proxy dispatcher --service helicone --api-key $HELICONE_API_KEY

Redis Streams vs In-Memory

Feature In-Memory Redis Streams
Delivery guarantee None (buffer overflow drops events) At-least-once
Processes Single process only Distributed across multiple processes/hosts
Consumer groups No Yes
Multiple dispatchers No Yes (events distributed via consumer groups)
Crash recovery No Yes (pending message claiming)
Acknowledgment No Yes
Recommended for Development, local testing Production, high reliability

Redis Streams Rollout Checklist

Use this checklist when enabling Redis Streams in new environments:

Prerequisites

  • Redis server accessible from all proxy and dispatcher instances
  • MANAGEMENT_TOKEN configured for admin operations

Configuration

  • Set LLM_PROXY_EVENT_BUS=redis-streams on proxy and dispatcher
  • Set REDIS_ADDR to your Redis server address
  • Set REDIS_STREAM_KEY (default: llm-proxy-events)
  • Set REDIS_CONSUMER_GROUP (default: llm-proxy-dispatchers)
  • Configure REDIS_STREAM_MAX_LEN based on expected throughput (default: 10000)

Verification

  • Verify consumer group exists: redis-cli XINFO GROUPS llm-proxy-events
  • Check stream length: redis-cli XLEN llm-proxy-events
  • Monitor pending count: redis-cli XPENDING llm-proxy-events llm-proxy-dispatchers
  • Verify dispatcher is consuming: check logs for “Using Redis Streams event bus”
  • Confirm events are being acknowledged: pending count should remain stable or decrease

Monitoring

  • Set up alerts for pending count > 1000 (indicates dispatcher lag)
  • Monitor stream length to ensure it stays below max length
  • Track dispatcher health endpoint for lag warnings
  • Monitor dispatcher logs for claim/recovery messages

Troubleshooting

High Pending Count:

  • Increase REDIS_STREAM_BATCH_SIZE (default: 100)
  • Reduce REDIS_STREAM_CLAIM_TIME to claim stuck messages faster (default: 30s)
  • Scale horizontally: add more dispatcher instances (they share workload via consumer group)
  • Check dispatcher logs for errors or slow backend API calls

Stream Length Growing:

  • Increase REDIS_STREAM_MAX_LEN if losing events due to trimming
  • Ensure dispatchers are running and healthy
  • Check that dispatchers are acknowledging messages (XACK)

References

  • See internal/middleware/instrumentation.go for the middleware implementation.
  • See internal/eventbus/eventbus.go for the event bus interface and in-memory backend.
  • See internal/dispatcher/ for the pluggable dispatcher architecture.
  • See docs/issues/done/phase-5-generic-async-middleware.md for the original design issue.
  • See docs/issues/done/phase-5-event-dispatcher-service.md for the dispatcher design.

For questions or advanced integration, open an issue or see the code comments for extension points.


This site uses Just the Docs, a documentation theme for Jekyll.