Skip to content

Configure Realtime Evals

This guide walks through enabling and configuring realtime evals on an AgentRuntime so that live conversations are continuously evaluated against the eval definitions in your PromptPack.

Before enabling realtime evals, ensure:

  • Session-api is running with PostgreSQL storage — eval results are stored in the eval_results table managed by session-api
  • Redis is available — used for event publishing between session-api and the eval worker (Pattern A)
  • Provider CRDs exist for any LLM judges you plan to use — these supply the credentials for judge API calls

Add evals.enabled: true to your AgentRuntime spec:

apiVersion: omnia.altairalabs.ai/v1alpha1
kind: AgentRuntime
metadata:
name: my-agent
spec:
promptPackRef:
name: my-prompts
providers:
- name: default
providerRef:
name: claude-sonnet
facade:
type: websocket
evals:
enabled: true

With just enabled: true and no other settings, evals use these defaults:

SettingDefault
sampling.defaultRate100 (all evals run)
sampling.extendedRate10 (10% of extended evals run)
rateLimit.maxEvalsPerSecond50
rateLimit.maxConcurrentJudgeCalls5
sessionCompletion.inactivityTimeout5m

Evals will only execute if the referenced PromptPack contains eval definitions.

LLM judge evals need an LLM to act as the judge. Create a Provider CRD for the judge model and add it to the AgentRuntime’s providers list.

apiVersion: omnia.altairalabs.ai/v1alpha1
kind: Provider
metadata:
name: claude-haiku
spec:
type: claude
model: claude-haiku-4-5-20251001
secretRef:
name: anthropic-api-key

Add a named provider entry for the judge alongside your default provider:

spec:
providers:
- name: default
providerRef:
name: claude-sonnet # Primary LLM for the agent
- name: judge
providerRef:
name: claude-haiku # Cheap/fast model for eval judging
evals:
enabled: true

The eval worker resolves provider credentials from the AgentRuntime’s spec.providers list. The provider name (e.g., "judge") can be referenced in PromptPack eval definitions.

Eval definitions live in your PromptPack’s pack.json. Add an evals array to the prompt that should be evaluated:

{
"prompts": {
"customer-support": {
"system": "You are a helpful customer support agent...",
"evals": [
{
"id": "helpfulness",
"type": "llm_judge_turn",
"trigger": "every_turn",
"params": {
"judge": "fast-judge",
"criteria": "Is the response helpful, accurate, and on-topic?",
"rubric": "1-5 scale"
}
},
{
"id": "no-competitor-mentions",
"type": "content_includes",
"trigger": "every_turn",
"params": {
"pattern": "competitor-name",
"should_match": false
}
},
{
"id": "resolution-check",
"type": "llm_judge_turn",
"trigger": "on_session_complete",
"params": {
"judge": "strong-judge",
"criteria": "Did the agent fully resolve the customer's issue?"
}
}
]
}
}
}
TypeWhat it doesCost
llm_judge_turnLLM evaluates the response against criteriaLLM API call
content_includesRegex/string match on response contentFree
guardrail_triggeredChecks if a specific validator firedFree
TriggerWhen it fires
every_turnAfter each assistant message
on_session_completeWhen session ends or times out
on_n_turnsEvery N assistant messages

For high-traffic agents, you may not want to run expensive LLM judge evals on every session. Configure sampling rates to control cost:

spec:
evals:
sampling:
defaultRate: 100 # Run all lightweight evals (fast, free)
extendedRate: 10 # Only run extended evals on 10% of eligible turns

Sampling is deterministic — the same sessionID:turnIndex combination always produces the same sampling decision. This means results are consistent across retries and you get an evenly distributed sample.

Cost estimation example:

TrafficLLM Judge RateJudge Calls/DayEstimated Cost/Day
500 sessions/day10%~100~$0.05 (Haiku)
5,000 sessions/day10%~1,000~$0.50 (Haiku)
50,000 sessions/day5%~5,000~$2.50 (Haiku)

Rate limits provide a hard ceiling on eval throughput, protecting against unexpected traffic spikes:

spec:
evals:
rateLimit:
maxEvalsPerSecond: 50 # Overall eval throughput limit
maxConcurrentJudgeCalls: 5 # Concurrent LLM API calls

If the rate limit is reached, evals are queued rather than dropped. Increase these values for high-throughput agents where eval latency matters.

The inactivityTimeout controls how long the system waits after the last message before considering a session complete and running on_session_complete evals:

spec:
evals:
sessionCompletion:
inactivityTimeout: 10m # Wait 10 minutes of silence

Set this based on your expected conversation patterns:

  • Chatbots with quick exchanges: 2m to 5m
  • Complex support conversations: 10m to 15m
  • Long-running async workflows: 30m or more

The dashboard provides two views:

  1. Session detail — open any session to see eval scores inline next to each assistant message
  2. Quality view — aggregate pass rates and score trends across agents, viewable from the agent list

Query eval results directly via session-api:

Terminal window
# Get eval results for a specific session
curl http://session-api:8080/api/v1/sessions/SESSION_ID/eval-results
# List eval results for an agent
curl "http://session-api:8080/api/v1/eval-results?agentName=my-agent&namespace=default"
# Get aggregate statistics
curl "http://session-api:8080/api/v1/eval-results/summary?agentName=my-agent"

For non-PromptKit agents (Pattern A), the eval worker must be deployed via Helm (see Eval Worker Helm values):

Terminal window
# Check if the eval worker is running
kubectl get deploy -l app.kubernetes.io/component=eval-worker
# View eval worker logs
kubectl logs -l app.kubernetes.io/component=eval-worker --tail=50

In multi-namespace mode, a single eval worker watches multiple namespaces. Check its logs to verify all namespaces are being consumed.

Verify that results are being written:

Terminal window
# Query recent eval results via the API
curl "http://session-api:8080/api/v1/eval-results?limit=5"

Verify the AgentRuntime has evals enabled:

Terminal window
kubectl get agentruntime my-agent -o jsonpath='{.spec.evals}'
apiVersion: omnia.altairalabs.ai/v1alpha1
kind: AgentRuntime
metadata:
name: customer-support
namespace: production
spec:
promptPackRef:
name: customer-support-pack
track: stable
providers:
- name: default
providerRef:
name: claude-sonnet
- name: judge
providerRef:
name: claude-haiku
facade:
type: websocket
session:
type: postgres
storeRef:
name: session-db
evals:
enabled: true
sampling:
defaultRate: 100
extendedRate: 10
rateLimit:
maxEvalsPerSecond: 50
maxConcurrentJudgeCalls: 5
sessionCompletion:
inactivityTimeout: 5m
runtime:
replicas: 3