Autoscaling
Omnia supports two autoscaling mechanisms: standard Kubernetes HPA and KEDA. This document explains when to use each and how they work with AI agents.
Choosing an Autoscaler
Section titled “Choosing an Autoscaler”flowchart TD A{Need scale-to-zero?} -->|Yes| B{Accept cold starts?} A -->|No| C{Need custom metrics?} B -->|Yes| KEDA[Use KEDA] B -->|No| MIN1[Use KEDA with minReplicas: 1] C -->|Yes| KEDA C -->|No| D{Need cron-based scaling?} D -->|Yes| KEDA D -->|No| HPA[Use HPA]Agent Workload Characteristics
Section titled “Agent Workload Characteristics”AI agents have unique scaling characteristics compared to typical web services:
- I/O Bound: Agents spend most time waiting for LLM API responses, not computing
- Long-Running Connections: WebSocket connections persist for entire conversations
- Memory-Heavy: Each connection/session consumes memory for context
- Variable Latency: LLM responses can take 1-30+ seconds
These characteristics inform Omnia’s default autoscaling settings.
HPA (Standard Autoscaling)
Section titled “HPA (Standard Autoscaling)”The Horizontal Pod Autoscaler is Kubernetes’ built-in scaling mechanism.
How It Works
Section titled “How It Works”- HPA periodically queries metrics from the Metrics Server
- Calculates desired replicas based on target utilization
- Scales the Deployment up or down
Default Configuration
Section titled “Default Configuration”Omnia’s HPA defaults are optimized for agent workloads:
autoscaling: enabled: true type: hpa minReplicas: 1 maxReplicas: 10 targetMemoryUtilizationPercentage: 70 # Primary metric targetCPUUtilizationPercentage: 90 # Safety valve scaleDownStabilizationSeconds: 300 # 5 minute cooldownWhy memory is primary:
- Each WebSocket connection uses memory for buffers
- Session context (conversation history) consumes memory
- CPU usage is low while waiting for LLM responses
Why 90% CPU as secondary:
- Acts as a safety valve for unexpected CPU spikes
- Prevents CPU exhaustion during message parsing/serialization
Why 5-minute scale-down stabilization:
- Prevents thrashing during bursty traffic
- WebSocket connections may reconnect shortly after disconnecting
- LLM calls have high latency; recent scale-up may still be needed
When to Use HPA
Section titled “When to Use HPA”Use standard HPA when:
- You don’t need scale-to-zero
- You want simplicity (no additional dependencies)
- Resource-based metrics (CPU/memory) are sufficient
- You’re running in an environment without KEDA
KEDA (Advanced Autoscaling)
Section titled “KEDA (Advanced Autoscaling)”KEDA (Kubernetes Event-Driven Autoscaling) extends HPA with custom metrics and scale-to-zero.
How It Works
Section titled “How It Works”- KEDA polls external metric sources (Prometheus, queues, etc.)
- Creates/manages an HPA behind the scenes
- Can scale to zero when no events/metrics trigger scaling
- Scales from zero when triggers fire
Default Configuration
Section titled “Default Configuration”autoscaling: enabled: true type: keda minReplicas: 1 # Set to 0 for scale-to-zero maxReplicas: 20 keda: pollingInterval: 30 cooldownPeriod: 300Default Trigger
Section titled “Default Trigger”If no custom triggers are specified, Omnia configures a Prometheus trigger:
triggers: - type: prometheus metadata: serverAddress: "http://omnia-prometheus-server.omnia-system.svc.cluster.local/prometheus" query: 'sum(omnia_agent_connections_active{agent="<name>",namespace="<ns>"}) or vector(0)' threshold: "10"This scales based on active WebSocket connections:
- Below 10 connections per pod: consider scaling down
- Above 10 connections per pod: scale up
When to Use KEDA
Section titled “When to Use KEDA”Use KEDA when:
- You want custom metrics (connections, queue depth, etc.)
- You need scale-to-zero (with caveats, see below)
- You want cron-based scaling (business hours)
- You need to scale on external events
Available Triggers
Section titled “Available Triggers”Prometheus (recommended for agents):
triggers: - type: prometheus metadata: serverAddress: "http://prometheus:9090" query: 'sum(omnia_agent_connections_active{agent="my-agent"})' threshold: "10"Cron (business hours scaling):
triggers: - type: cron metadata: timezone: "America/New_York" start: "0 8 * * 1-5" # 8am Mon-Fri end: "0 18 * * 1-5" # 6pm Mon-Fri desiredReplicas: "5"CPU/Memory (like HPA):
triggers: - type: cpu metadata: type: Utilization value: "70"Scale-to-Zero Considerations
Section titled “Scale-to-Zero Considerations”KEDA supports scaling to zero replicas, but there are important tradeoffs:
The Cold Start Problem
Section titled “The Cold Start Problem”sequenceDiagram participant C as Client participant G as Gateway participant K as KEDA participant P as Agent Pod
Note over P: Scaled to 0
C->>G: WebSocket connect G->>G: No backend pods! G-->>C: Connection failed
Note over K: Detects trigger K->>P: Scale to 1 Note over P: Pod starting... Note over P: Pod ready (30s later)
C->>G: Retry connect G->>P: Forward P-->>C: ConnectedWhen scaled to zero:
- New connection arrives at the gateway
- No pods exist to handle it
- Connection fails or times out
- KEDA scales up (too late for that request)
Solutions
Section titled “Solutions”1. Don’t scale to zero (recommended):
autoscaling: minReplicas: 1 # Always have one pod readyYou still benefit from scale-out (1→N) and scale-in (N→1).
2. Accept cold starts: For non-interactive workloads or where occasional failures are acceptable.
3. Use a proxy/queue: The KEDA HTTP Add-on can buffer requests during scale-up, but it’s not production-ready.
Recommendation
Section titled “Recommendation”For most production deployments, use minReplicas: 1. The cost of one always-running pod is usually worth the reliability.
Metrics for Scaling
Section titled “Metrics for Scaling”Omnia agents expose these Prometheus metrics:
| Metric | Description | Good for Scaling? |
|---|---|---|
omnia_agent_connections_active | Current WebSocket connections | Yes - direct load indicator |
omnia_agent_requests_inflight | Pending LLM requests | Yes - indicates processing load |
omnia_agent_request_duration_seconds | Request latency | No - trailing indicator |
omnia_agent_sessions_active | Active sessions | Maybe - depends on session lifetime |
Recommended Scaling Query
Section titled “Recommended Scaling Query”sum(omnia_agent_connections_active{agent="my-agent"}) or vector(0)The or vector(0) ensures the query returns 0 when no metrics exist (new deployments).
Tuning Guidelines
Section titled “Tuning Guidelines”For Latency-Sensitive Workloads
Section titled “For Latency-Sensitive Workloads”Scale up aggressively, scale down conservatively:
autoscaling: type: hpa targetMemoryUtilizationPercentage: 50 # Lower threshold scaleDownStabilizationSeconds: 600 # 10 minutesFor Cost-Sensitive Workloads
Section titled “For Cost-Sensitive Workloads”Scale down more aggressively:
autoscaling: type: keda minReplicas: 1 keda: cooldownPeriod: 60 # 1 minute triggers: - type: prometheus metadata: threshold: "20" # Higher threshold per podFor Predictable Traffic
Section titled “For Predictable Traffic”Use cron-based scaling:
autoscaling: type: keda keda: triggers: # Base capacity during business hours - type: cron metadata: timezone: "UTC" start: "0 8 * * 1-5" end: "0 18 * * 1-5" desiredReplicas: "5" # Connection-based scaling for spikes - type: prometheus metadata: query: 'sum(omnia_agent_connections_active{agent="my-agent"})' threshold: "10"Comparison Summary
Section titled “Comparison Summary”| Feature | HPA | KEDA |
|---|---|---|
| Built-in | Yes | Requires installation |
| Scale to zero | No | Yes (with caveats) |
| Custom metrics | Limited | Extensive |
| Cron scaling | No | Yes |
| Complexity | Low | Medium |
| Dependencies | Metrics Server | KEDA + metric source |
Choose HPA for simplicity, KEDA for advanced scaling scenarios.