HTTP Response Caching Strategy
This document outlines the HTTP response caching system implemented in the LLM proxy. The caching system provides significant performance improvements through Redis-backed shared caching with HTTP standards compliance.
Overview
The proxy implements a Redis-backed caching system that improves performance, reduces latency, and decreases load on target APIs. The caching system respects HTTP cache control headers and provides optional in-memory fallback for development environments.
Goals
- Reduce Latency: Serve cached responses when appropriate to minimize response time
- Reduce API Costs: Minimize redundant API calls to save on usage costs
- Improve Reliability: Provide responses even during brief target API outages
- Balance Freshness: Allow fine-grained control over cache TTL per endpoint and request type
Cache Architecture
Redis as Cache Store
Redis will be used as the primary cache store for the following reasons:
- Performance: In-memory operation with sub-millisecond response times
- Distributed: Supports clustered proxy deployments seamlessly
- TTL Support: Built-in time-based expiration
- Data Structures: Rich data structures for flexible caching patterns
- Persistence: Optional persistence for cache warming after restarts
Cache Keys
Cache keys will be constructed using a deterministic algorithm based on:
- Request Path: The API endpoint being called
- Request Method: GET, POST, etc.
- Request Parameters: Query parameters and/or request body (normalized)
- Project ID: To isolate caches between different projects
Example key format:
cache:v1:{project_id}:{endpoint}:{method}:{hash_of_parameters}
Cache Values
Cache entries will store:
- Response Body: The serialized response
- Response Headers: Relevant headers from the original response
- Cache Metadata:
- Timestamp of original request
- TTL information
- Hit count
Caching Policies
Cacheable Endpoints
Not all endpoints are suitable for caching. By default:
- Cacheable:
/v1/models(list models)/v1/embeddings(vector embeddings)- Other read-only, deterministic endpoints
- Not Cacheable by Default:
/v1/chat/completions(unless explicitly enabled)- Any streaming endpoints
- Endpoints with side effects
TTL Configuration
TTL (Time-To-Live) is determined through a precedence hierarchy:
- Response Headers (highest precedence):
s-maxagedirective (shared cache specific)max-agedirective (general cache TTL)
- Default TTL (fallback):
HTTP_CACHE_DEFAULT_TTLenvironment variable- Used when upstream permits caching but doesn’t specify TTL
- Default: 300 seconds (5 minutes)
- Client-Forced Caching:
- When client sends
Cache-Control: public, max-age=Non request - Used for POST requests or when upstream lacks cache directives
- Enables cache testing and benchmarking scenarios
- When client sends
Cache Invalidation
Current invalidation mechanisms:
- Time-Based Expiration: Automatic expiration via Redis TTL
- Manual Purge: Planned for future management API
- Size Limits: Objects exceeding
HTTP_CACHE_MAX_OBJECT_BYTESare not cached - Cache Control Directives:
no-storeandprivatebypass caching entirely
Implementation Approach
Cache Middleware
A caching middleware will be added to the proxy middleware chain with responsibilities for:
- Cache Lookup: Check if a valid cache entry exists
- Cache Serving: Return cached response if available
- Cache Storage: Store new responses in cache
- Headers Management: Handle cache-related headers
func CachingMiddleware(cacheClient *redis.Client, config CacheConfig) Middleware {
return func(next http.Handler) http.Handler {
return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
// Skip caching for non-cacheable requests
if !isCacheable(r, config) {
next.ServeHTTP(w, r)
return
}
// Generate cache key
cacheKey := generateCacheKey(r)
// Check cache
if cacheEntry, found := lookupCache(cacheClient, cacheKey); found {
// Serve from cache
serveFromCache(w, cacheEntry)
recordCacheHit(cacheKey)
return
}
// Cache miss - proceed with request
// Use a response recorder to capture the response
recorder := newResponseRecorder(w)
next.ServeHTTP(recorder, r)
// Store response in cache if appropriate
if shouldCache(recorder, config) {
storeInCache(cacheClient, cacheKey, recorder, config)
}
recordCacheMiss(cacheKey)
})
}
}
Streaming Response Handling
Streaming responses require special consideration:
- No Caching by Default: Streaming responses won’t be cached by default
- Optional Caching: Configuration option to cache the complete aggregated response
- Partial Caching: Cache initial parts of responses if appropriate
Cache Control Headers
The proxy will respect and implement standard cache control headers:
- Support for Client Headers:
Cache-Control: no-cache- Verify with origin before using cached copyCache-Control: no-store- Skip caching entirely
- Adding Response Headers:
X-Cache: HIT/MISS- Indicate cache resultAge- Time since response was generated- Standard Cache-Control headers
Performance Considerations
- Cache Size Limits: Configurable maximum cache size with LRU eviction
- Memory Pressure: Monitoring of Redis memory usage
- Cache Warmup: Option to pre-populate cache for frequently used requests
- Compression: Optional compression for large responses
Metrics and Monitoring
Comprehensive metrics will be collected:
- Hit Rate: Overall and per-endpoint cache hit percentage
- Latency Improvement: Time saved by serving from cache
- Cache Size: Current cache size and item count
- Evictions: Count of cache entries evicted due to memory pressure
- TTL Distribution: Histogram of remaining TTL for cached entries
Security Considerations
- Isolation: Strict isolation between projects’ cache entries
- Sensitive Data: Option to exclude sensitive data from caching
- Redis Authentication: Required Redis authentication
- Transport Security: Encrypted communication with Redis
- Response Validation: Validation of cached responses before serving
Implementation Status
The core caching system has been implemented and is available in the proxy:
✅ Implemented Features
- Redis-backed Caching
- Primary Redis backend with in-memory fallback
- Configurable via
HTTP_CACHE_BACKENDenvironment variable - Redis connection and key prefix configuration
- HTTP Standards Compliance
- Respects
Cache-Controldirectives (no-store,private,public,max-age,s-maxage) - Honors
Authorizationheader behavior for shared cache semantics - Supports conditional requests with
ETagandLast-Modifiedvalidators - TTL derivation with
s-maxageprecedence overmax-age
- Respects
- Request Method Support
- GET/HEAD caching by default when upstream permits
- Optional POST caching when client opts in via request
Cache-Control - Conservative
Varyhandling with subset of request headers
- Streaming Response Support
- Captures streaming responses while serving to client
- Stores complete response after streaming completion
- Subsequent requests serve from cache immediately
- Observability Integration
- Response headers:
X-PROXY-CACHE,X-PROXY-CACHE-KEY,Cache-Status - Event bus bypass for cache hits (performance optimization)
- Cache misses and stores published to event bus
- Response headers:
- Configuration and Tooling
- Environment variable configuration
- Benchmark CLI with cache testing flags (
--cache,--cache-ttl,--method) - Size limits and TTL controls
🔄 Future Enhancements
- Advanced Cache Control
stale-while-revalidateandstale-if-errorsupport- Full per-response
Varyheader handling - Upstream conditional revalidation (If-None-Match/If-Modified-Since)
- Management Features
- Cache purge endpoints and CLI commands
- Metrics for hits/misses/bypass/store rates
- Cache warming strategies
- Performance Optimizations
- Response compression for large objects
- Bounded L1 in-memory cache layer
- Cache key optimization
Configuration
Enable caching with environment variables:
HTTP_CACHE_ENABLED=true
HTTP_CACHE_BACKEND=redis
REDIS_ADDR=localhost:6379
REDIS_DB=0
REDIS_CACHE_KEY_PREFIX=llmproxy:cache:
HTTP_CACHE_MAX_OBJECT_BYTES=1048576
HTTP_CACHE_DEFAULT_TTL=300
See API Configuration Guide for complete configuration details.