Skip to content

Realtime Evals

Realtime evals extend Omnia’s evaluation engine to run continuously against live production sessions. Eval definitions live in the PromptPack alongside validators and guardrails, and execution is triggered automatically by session events.

Key principle: same rubric, two contexts. A PromptPack author defines “what good looks like” once. The enterprise eval system uses those definitions pre-deploy against synthetic scenarios (via Arena batch jobs) and post-deploy against live traffic (via realtime evals). Same assertion engine, same eval definitions, different input data.

Evals are:

  • Non-blocking — they run asynchronously after the response is sent, with zero latency impact on the conversation
  • Event-driven — triggered by session events, not polling
  • Per-agent — each AgentRuntime configures its own eval settings (judges, sampling, rate limits)

Evals are defined in the PromptPack’s pack.json as a sibling to validators:

{
"prompts": {
"customer-support": {
"system": "You are a helpful customer support agent...",
"validators": [
{ "type": "banned_words", "params": { "words": ["competitor"] } }
],
"evals": [
{
"id": "helpfulness",
"type": "llm_judge_turn",
"trigger": "every_turn",
"params": {
"judge": "fast-judge",
"criteria": "Is the response helpful, accurate, and on-topic?",
"rubric": "1-5 scale"
}
},
{
"id": "resolution",
"type": "llm_judge_turn",
"trigger": "on_session_complete",
"params": {
"judge": "strong-judge",
"criteria": "Did the agent resolve the customer's issue?"
}
},
{
"id": "no-pii-leak",
"type": "content_includes",
"trigger": "every_turn",
"params": {
"pattern": "\\b\\d{3}-\\d{2}-\\d{4}\\b",
"should_match": false
}
}
]
}
}
}
TriggerFires whenUse case
every_turnAfter each assistant message is recordedPer-turn quality scoring, safety checks
on_session_completeSession ends or times outConversation-level judgments, resolution checks
on_n_turnsEvery N assistant messagesPeriodic checks during long sessions
TypeDescriptionRequires LLM Judge
llm_judge_turnLLM evaluates a single turn against criteriaYes
content_includesCheck if response contains (or doesn’t contain) a patternNo
guardrail_triggeredCheck if a specific validator fired during the turnNo

Non-LLM eval types (content_includes, guardrail_triggered) are free to run and can evaluate every session. LLM judge evals incur API costs and are typically sampled.

Realtime evals use two execution patterns depending on the agent’s framework type:

flowchart TB
AR["AgentRuntime<br/><i>spec.evals.enabled: true</i>"]
AR --> NonPK["Non-PromptKit<br/>(langchain, autogen, custom)"]
AR --> PK["PromptKit Agents"]
NonPK --> PA1["<b>Pattern A only</b><br/>Facade → session-api<br/>→ Redis Streams → eval worker"]
PK --> PA2["<b>Pattern A</b><br/>(Facade path — same as non-PK)"]
PK --> PC["<b>Pattern C</b><br/>(EventBus path)<br/>RecordingStage → EventBus<br/>→ in-process evals"]

Every AgentRuntime uses the facade’s recordingResponseWriter, which captures assistant messages, tool calls, token counts, and cost. This data flows through session-api to PostgreSQL. Session-api then publishes lightweight events to Redis Streams. The eval worker subscribes and runs evals (either per-namespace or across multiple namespaces depending on configuration).

flowchart LR
F[Facade] --> RW[recordingResponseWriter]
RW --> SA[session-api]
SA --> PG[(PostgreSQL)]
SA -.->|async publish| RS[Redis Streams]
RS --> EW[eval-worker]
EW --> ER[(eval_results)]

Pattern A works with every framework type — PromptKit, LangChain, AutoGen, or custom runtimes.

Pattern C: EventBus-Driven (PromptKit Agents)

Section titled “Pattern C: EventBus-Driven (PromptKit Agents)”

For PromptKit agents, an additional path wires PromptKit’s RecordingStage and EventBus into the pipeline. This provides richer event data including provider call metadata, validation events, and pipeline stage timings. An in-process EventBusEvalListener triggers evals with lower latency.

flowchart TB
subgraph Pipeline["PromptKit Pipeline"]
RS[RecordingStage]
end
RS -->|rich events| EB[EventBus]
EB --> OES[OmniaEventStore]
OES --> SS[Session Store]
EB --> EBL[EventBusEvalListener]
EBL --> Runner[EvalRunner]
Runner --> RW[Result Writer]

For PromptKit agents, Pattern C is the primary eval path. Pattern A events still fire but the eval worker skips agents that have Pattern C handling evals in-process.

DataPattern A (Facade)Pattern C (EventBus)
Assistant message contentYesYes
Tool calls (name, args)YesYes (+ schema validation)
Token counts and costYesYes (+ per-call breakdown)
LatencyTotal onlyPer-provider-call
Provider call metadataNoYes (model, temperature)
Validation/guardrail eventsNoYes
Pipeline stage timingsNoYes

The eval worker (eval-worker) is a long-running cluster singleton Deployment that subscribes to Redis Streams and runs evals for agents using Pattern A. It is deployed automatically when enterprise features are enabled.

By default, the worker watches its deployment namespace. To watch additional namespaces, configure the enterprise.evalWorker.namespaces Helm value with the list of namespaces to monitor. The worker reads from multiple Redis streams concurrently via XREADGROUP. See Eval Worker Helm values for details.

The worker:

  1. Subscribes to Redis Streams events (one stream per namespace) using a consumer group for horizontal scaling
  2. Looks up the agent’s AgentRuntime to check eval config and PromptPack reference
  3. Loads eval definitions from the PromptPack ConfigMap (cached with a Kubernetes watcher)
  4. Fetches session data from session-api
  5. Runs assertions using the PromptKit eval engine
  6. Writes results to the eval_results table via session-api

Why a long-running Deployment instead of Kubernetes Jobs? Enterprise batch evaluation (Arena) uses Jobs because each ArenaJob is a discrete unit of work. Realtime evals are continuous — spinning up a Job per session event would be too slow (pod scheduling takes 5-15 seconds) and wasteful. A persistent worker pool is the right model.

LLM judge evals need an LLM provider for judging. The eval worker resolves provider specs from the AgentRuntime’s spec.providers list. Add a named provider (e.g., "judge") to supply the judge model:

spec:
providers:
- name: default
providerRef:
name: claude-sonnet # Primary LLM for the agent
- name: judge
providerRef:
name: claude-haiku # Cheap/fast model for eval judging

The eval worker (Pattern A) and the facade’s eval listener (Pattern C) both resolve these Provider CRDs to obtain credentials for making LLM judge calls. This allows different agents to use different judge models based on their quality requirements and cost constraints.

For on_session_complete evals, the system needs to detect when a session has ended. Two mechanisms are used:

  1. Explicit close — when a session’s status is set to "completed" via the session-api, a session.completed event is published immediately.

  2. Inactivity timeout — the eval system tracks the last message time per session. After the configured inactivityTimeout (default 5 minutes), the session is considered complete and session-level evals are triggered.

The timeout is configurable per agent:

spec:
evals:
sessionCompletion:
inactivityTimeout: 10m # Wait 10 minutes of inactivity

LLM judge evals cost money. At scale, uncontrolled eval execution can produce significant spend. Omnia provides three layers of cost control:

Sampling controls what percentage of sessions/turns are evaluated. It uses deterministic hashing on sessionID:turnIndex, so the same session/turn always produces the same sampling decision. This ensures consistent behavior across retries.

spec:
evals:
sampling:
defaultRate: 100 # 100% for lightweight evals (fast, free)
extendedRate: 10 # 10% for extended evals (model-powered, costs money)

Rate limits use a token bucket algorithm for overall throughput and a semaphore for concurrent judge calls:

spec:
evals:
rateLimit:
maxEvalsPerSecond: 50 # Token bucket
maxConcurrentJudgeCalls: 5 # Semaphore for LLM API calls

Use cheaper, faster models for high-volume per-turn evals and reserve more capable models for session-level evaluations:

JudgeModelCost per evalUse for
fast-judgeClaude Haiku~$0.0005every_turn evals
strong-judgeClaude Sonnet~$0.005on_session_complete evals

Eval results are stored in the eval_results table in PostgreSQL (managed by session-api). Each result records:

  • The session and message that was evaluated
  • The eval definition ID, type, and trigger
  • Pass/fail status and optional numeric score (0.0-1.0)
  • Execution details (duration, judge tokens used, judge cost)
  • Whether it was executed by the eval worker (Pattern A) or in-process (Pattern C)

Results are accessed through session-api endpoints:

MethodPathDescription
POST /api/v1/eval-resultsWrite eval result(s)Called by eval worker or in-process listener
GET /api/v1/sessions/{id}/eval-resultsGet results for a sessionUsed by dashboard session detail
GET /api/v1/eval-resultsList/query resultsFilter by agent, eval ID, passed, time range
GET /api/v1/eval-results/summaryAggregate statisticsPass rates, score distributions, trends

The dashboard provides two views for eval results:

  1. Session detail — inline eval scores displayed next to each assistant message, showing which evals passed/failed and their scores.

  2. Agent quality view — aggregate pass rates, score trends over time, and comparison across agents and PromptPack versions. Drill down by eval type to identify specific quality dimensions.

Data flows from the dashboard through the operator proxy to session-api’s eval-results endpoints.

Eval configuration lives on the AgentRuntime (not a global setting or separate CRD) because:

  • Evals are tied to the agent’s PromptPack, which is already on AgentRuntime
  • Judge providers may differ per agent (cheap judges for low-stakes agents, strong judges for critical agents)
  • Sampling rates vary by agent traffic volume
  • Operators enable/disable evals per agent, not globally

Pattern A (platform events) provides universal coverage for any framework. Pattern C (EventBus) provides a better experience for PromptKit agents with richer data and lower latency. Supporting both means evals work regardless of framework choice while PromptKit users get enhanced capabilities.

Kubernetes Jobs have 5-15 seconds of pod scheduling overhead. For realtime evals triggered on every assistant message, this latency is unacceptable. A persistent worker pool processes events in milliseconds and scales horizontally via consumer groups.

Deterministic hashing on sessionID:turnIndex ensures:

  • The same turn always gets the same sampling decision (idempotent on retry)
  • Sampling is evenly distributed across sessions
  • No need for external state to track what has been sampled