Configure Realtime Evals
This guide walks through enabling and configuring realtime evals on an AgentRuntime so that live conversations are continuously evaluated against the eval definitions in your PromptPack.
Prerequisites
Section titled “Prerequisites”Before enabling realtime evals, ensure:
- Session-api is running with PostgreSQL storage — eval results are stored in the
eval_resultstable managed by session-api - Redis is available — used for event publishing between session-api and the eval worker (Pattern A)
- Provider CRDs exist for any LLM judges you plan to use — these supply the credentials for judge API calls
Enable Evals
Section titled “Enable Evals”Add evals.enabled: true to your AgentRuntime spec:
apiVersion: omnia.altairalabs.ai/v1alpha1kind: AgentRuntimemetadata: name: my-agentspec: promptPackRef: name: my-prompts providers: - name: default providerRef: name: claude-sonnet facade: type: websocket evals: enabled: trueWith just enabled: true and no other settings, evals use these defaults:
| Setting | Default |
|---|---|
sampling.defaultRate | 100 (all evals run) |
sampling.extendedRate | 10 (10% of extended evals run) |
rateLimit.maxEvalsPerSecond | 50 |
rateLimit.maxConcurrentJudgeCalls | 5 |
sessionCompletion.inactivityTimeout | 5m |
Evals will only execute if the referenced PromptPack contains eval definitions.
Configure Judge Providers
Section titled “Configure Judge Providers”LLM judge evals need an LLM to act as the judge. Create a Provider CRD for the judge model and add it to the AgentRuntime’s providers list.
1. Create a Provider CRD
Section titled “1. Create a Provider CRD”apiVersion: omnia.altairalabs.ai/v1alpha1kind: Providermetadata: name: claude-haikuspec: type: claude model: claude-haiku-4-5-20251001 secretRef: name: anthropic-api-key2. Add the Judge Provider to AgentRuntime
Section titled “2. Add the Judge Provider to AgentRuntime”Add a named provider entry for the judge alongside your default provider:
spec: providers: - name: default providerRef: name: claude-sonnet # Primary LLM for the agent - name: judge providerRef: name: claude-haiku # Cheap/fast model for eval judging evals: enabled: trueThe eval worker resolves provider credentials from the AgentRuntime’s spec.providers list. The provider name (e.g., "judge") can be referenced in PromptPack eval definitions.
Define Evals in PromptPack
Section titled “Define Evals in PromptPack”Eval definitions live in your PromptPack’s pack.json. Add an evals array to the prompt that should be evaluated:
{ "prompts": { "customer-support": { "system": "You are a helpful customer support agent...", "evals": [ { "id": "helpfulness", "type": "llm_judge_turn", "trigger": "every_turn", "params": { "judge": "fast-judge", "criteria": "Is the response helpful, accurate, and on-topic?", "rubric": "1-5 scale" } }, { "id": "no-competitor-mentions", "type": "content_includes", "trigger": "every_turn", "params": { "pattern": "competitor-name", "should_match": false } }, { "id": "resolution-check", "type": "llm_judge_turn", "trigger": "on_session_complete", "params": { "judge": "strong-judge", "criteria": "Did the agent fully resolve the customer's issue?" } } ] } }}Available Eval Types
Section titled “Available Eval Types”| Type | What it does | Cost |
|---|---|---|
llm_judge_turn | LLM evaluates the response against criteria | LLM API call |
content_includes | Regex/string match on response content | Free |
guardrail_triggered | Checks if a specific validator fired | Free |
Available Triggers
Section titled “Available Triggers”| Trigger | When it fires |
|---|---|
every_turn | After each assistant message |
on_session_complete | When session ends or times out |
on_n_turns | Every N assistant messages |
Control Costs with Sampling
Section titled “Control Costs with Sampling”For high-traffic agents, you may not want to run expensive LLM judge evals on every session. Configure sampling rates to control cost:
spec: evals: sampling: defaultRate: 100 # Run all lightweight evals (fast, free) extendedRate: 10 # Only run extended evals on 10% of eligible turnsSampling is deterministic — the same sessionID:turnIndex combination always produces the same sampling decision. This means results are consistent across retries and you get an evenly distributed sample.
Cost estimation example:
| Traffic | LLM Judge Rate | Judge Calls/Day | Estimated Cost/Day |
|---|---|---|---|
| 500 sessions/day | 10% | ~100 | ~$0.05 (Haiku) |
| 5,000 sessions/day | 10% | ~1,000 | ~$0.50 (Haiku) |
| 50,000 sessions/day | 5% | ~5,000 | ~$2.50 (Haiku) |
Set Rate Limits
Section titled “Set Rate Limits”Rate limits provide a hard ceiling on eval throughput, protecting against unexpected traffic spikes:
spec: evals: rateLimit: maxEvalsPerSecond: 50 # Overall eval throughput limit maxConcurrentJudgeCalls: 5 # Concurrent LLM API callsIf the rate limit is reached, evals are queued rather than dropped. Increase these values for high-throughput agents where eval latency matters.
Configure Session Completion
Section titled “Configure Session Completion”The inactivityTimeout controls how long the system waits after the last message before considering a session complete and running on_session_complete evals:
spec: evals: sessionCompletion: inactivityTimeout: 10m # Wait 10 minutes of silenceSet this based on your expected conversation patterns:
- Chatbots with quick exchanges:
2mto5m - Complex support conversations:
10mto15m - Long-running async workflows:
30mor more
View Eval Results
Section titled “View Eval Results”Dashboard
Section titled “Dashboard”The dashboard provides two views:
- Session detail — open any session to see eval scores inline next to each assistant message
- Quality view — aggregate pass rates and score trends across agents, viewable from the agent list
Query eval results directly via session-api:
# Get eval results for a specific sessioncurl http://session-api:8080/api/v1/sessions/SESSION_ID/eval-results
# List eval results for an agentcurl "http://session-api:8080/api/v1/eval-results?agentName=my-agent&namespace=default"
# Get aggregate statisticscurl "http://session-api:8080/api/v1/eval-results/summary?agentName=my-agent"Verify Evals Are Running
Section titled “Verify Evals Are Running”Check the Eval Worker Pod
Section titled “Check the Eval Worker Pod”For non-PromptKit agents (Pattern A), the eval worker must be deployed via Helm (see Eval Worker Helm values):
# Check if the eval worker is runningkubectl get deploy -l app.kubernetes.io/component=eval-worker
# View eval worker logskubectl logs -l app.kubernetes.io/component=eval-worker --tail=50In multi-namespace mode, a single eval worker watches multiple namespaces. Check its logs to verify all namespaces are being consumed.
Check Eval Results
Section titled “Check Eval Results”Verify that results are being written:
# Query recent eval results via the APIcurl "http://session-api:8080/api/v1/eval-results?limit=5"Check Agent Configuration
Section titled “Check Agent Configuration”Verify the AgentRuntime has evals enabled:
kubectl get agentruntime my-agent -o jsonpath='{.spec.evals}'Complete Example
Section titled “Complete Example”apiVersion: omnia.altairalabs.ai/v1alpha1kind: AgentRuntimemetadata: name: customer-support namespace: productionspec: promptPackRef: name: customer-support-pack track: stable
providers: - name: default providerRef: name: claude-sonnet - name: judge providerRef: name: claude-haiku
facade: type: websocket
session: type: postgres storeRef: name: session-db
evals: enabled: true sampling: defaultRate: 100 extendedRate: 10 rateLimit: maxEvalsPerSecond: 50 maxConcurrentJudgeCalls: 5 sessionCompletion: inactivityTimeout: 5m
runtime: replicas: 3