HTTP Response Caching Strategy

This document outlines the HTTP response caching system implemented in the LLM proxy. The caching system provides significant performance improvements through Redis-backed shared caching with HTTP standards compliance.

Overview

The proxy implements a Redis-backed caching system that improves performance, reduces latency, and decreases load on target APIs. The caching system respects HTTP cache control headers and provides optional in-memory fallback for development environments.

Goals

Reduce Latency: Serve cached responses when appropriate to minimize response time
Reduce API Costs: Minimize redundant API calls to save on usage costs
Improve Reliability: Provide responses even during brief target API outages
Balance Freshness: Allow fine-grained control over cache TTL per endpoint and request type

Cache Architecture

Redis as Cache Store

Redis will be used as the primary cache store for the following reasons:

Performance: In-memory operation with sub-millisecond response times
Distributed: Supports clustered proxy deployments seamlessly
TTL Support: Built-in time-based expiration
Data Structures: Rich data structures for flexible caching patterns
Persistence: Optional persistence for cache warming after restarts

Cache Keys

Cache keys will be constructed using a deterministic algorithm based on:

Request Path: The API endpoint being called
Request Method: GET, POST, etc.
Request Parameters: Query parameters and/or request body (normalized)
Project ID: To isolate caches between different projects

Example key format:

cache:v1:{project_id}:{endpoint}:{method}:{hash_of_parameters}

Cache Values

Cache entries will store:

Response Body: The serialized response
Response Headers: Relevant headers from the original response
Cache Metadata:
- Timestamp of original request
- TTL information
- Hit count

Caching Policies

Cacheable Endpoints

Not all endpoints are suitable for caching. By default:

Cacheable:
- /v1/models (list models)
- /v1/embeddings (vector embeddings)
- Other read-only, deterministic endpoints
Not Cacheable by Default:
- /v1/chat/completions (unless explicitly enabled)
- Any streaming endpoints
- Endpoints with side effects

TTL Configuration

TTL (Time-To-Live) is determined through a precedence hierarchy:

Response Headers (highest precedence):
- s-maxage directive (shared cache specific)
- max-age directive (general cache TTL)
Default TTL (fallback):
- HTTP_CACHE_DEFAULT_TTL environment variable
- Used when upstream permits caching but doesn’t specify TTL
- Default: 300 seconds (5 minutes)
Client-Forced Caching:
- When client sends Cache-Control: public, max-age=N on request
- Used for POST requests or when upstream lacks cache directives
- Enables cache testing and benchmarking scenarios

Cache Invalidation

Current invalidation mechanisms:

Time-Based Expiration: Automatic expiration via Redis TTL
Manual Purge: Planned for future management API
Size Limits: Objects exceeding HTTP_CACHE_MAX_OBJECT_BYTES are not cached
Cache Control Directives: no-store and private bypass caching entirely

Implementation Approach

Cache Middleware

A caching middleware will be added to the proxy middleware chain with responsibilities for:

Cache Lookup: Check if a valid cache entry exists
Cache Serving: Return cached response if available
Cache Storage: Store new responses in cache
Headers Management: Handle cache-related headers

func CachingMiddleware(cacheClient *redis.Client, config CacheConfig) Middleware {
    return func(next http.Handler) http.Handler {
        return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
            // Skip caching for non-cacheable requests
            if !isCacheable(r, config) {
                next.ServeHTTP(w, r)
                return
            }

            // Generate cache key
            cacheKey := generateCacheKey(r)
            
            // Check cache
            if cacheEntry, found := lookupCache(cacheClient, cacheKey); found {
                // Serve from cache
                serveFromCache(w, cacheEntry)
                recordCacheHit(cacheKey)
                return
            }
            
            // Cache miss - proceed with request
            // Use a response recorder to capture the response
            recorder := newResponseRecorder(w)
            next.ServeHTTP(recorder, r)
            
            // Store response in cache if appropriate
            if shouldCache(recorder, config) {
                storeInCache(cacheClient, cacheKey, recorder, config)
            }
            
            recordCacheMiss(cacheKey)
        })
    }
}

Streaming Response Handling

Streaming responses require special consideration:

No Caching by Default: Streaming responses won’t be cached by default
Optional Caching: Configuration option to cache the complete aggregated response
Partial Caching: Cache initial parts of responses if appropriate

Cache Control Headers

The proxy will respect and implement standard cache control headers:

Support for Client Headers:
- Cache-Control: no-cache - Verify with origin before using cached copy
- Cache-Control: no-store - Skip caching entirely
Adding Response Headers:
- X-Cache: HIT/MISS - Indicate cache result
- Age - Time since response was generated
- Standard Cache-Control headers

Performance Considerations

Cache Size Limits: Configurable maximum cache size with LRU eviction
Memory Pressure: Monitoring of Redis memory usage
Cache Warmup: Option to pre-populate cache for frequently used requests
Compression: Optional compression for large responses

Metrics and Monitoring

Comprehensive metrics will be collected:

Hit Rate: Overall and per-endpoint cache hit percentage
Latency Improvement: Time saved by serving from cache
Cache Size: Current cache size and item count
Evictions: Count of cache entries evicted due to memory pressure
TTL Distribution: Histogram of remaining TTL for cached entries

Security Considerations

Isolation: Strict isolation between projects’ cache entries
Sensitive Data: Option to exclude sensitive data from caching
Redis Authentication: Required Redis authentication
Transport Security: Encrypted communication with Redis
Response Validation: Validation of cached responses before serving

Implementation Status

The core caching system has been implemented and is available in the proxy:

✅ Implemented Features

Redis-backed Caching
- Primary Redis backend with in-memory fallback
- Configurable via HTTP_CACHE_BACKEND environment variable
- Redis connection and key prefix configuration
HTTP Standards Compliance
- Respects Cache-Control directives (no-store, private, public, max-age, s-maxage)
- Honors Authorization header behavior for shared cache semantics
- Supports conditional requests with ETag and Last-Modified validators
- TTL derivation with s-maxage precedence over max-age
Request Method Support
- GET/HEAD caching by default when upstream permits
- Optional POST caching when client opts in via request Cache-Control
- Conservative Vary handling with subset of request headers
Streaming Response Support
- Captures streaming responses while serving to client
- Stores complete response after streaming completion
- Subsequent requests serve from cache immediately
Observability Integration
- Response headers: X-PROXY-CACHE, X-PROXY-CACHE-KEY, Cache-Status
- Event bus bypass for cache hits (performance optimization)
- Cache misses and stores published to event bus
Configuration and Tooling
- Environment variable configuration
- Benchmark CLI with cache testing flags (--cache, --cache-ttl, --method)
- Size limits and TTL controls

🔄 Future Enhancements

Advanced Cache Control
- stale-while-revalidate and stale-if-error support
- Full per-response Vary header handling
- Upstream conditional revalidation (If-None-Match/If-Modified-Since)
Management Features
- Cache purge endpoints and CLI commands
- Metrics for hits/misses/bypass/store rates
- Cache warming strategies
Performance Optimizations
- Response compression for large objects
- Bounded L1 in-memory cache layer
- Cache key optimization

Configuration

Enable caching with environment variables:

HTTP_CACHE_ENABLED=true
HTTP_CACHE_BACKEND=redis
REDIS_ADDR=localhost:6379
REDIS_DB=0
REDIS_CACHE_KEY_PREFIX=llmproxy:cache:
HTTP_CACHE_MAX_OBJECT_BYTES=1048576
HTTP_CACHE_DEFAULT_TTL=300

See API Configuration Guide for complete configuration details.