Architecture Overview

This document explains the architecture of Omnia and the design decisions behind it.

High-Level Architecture

Omnia consists of three main components:

graph TB
    subgraph cluster["Kubernetes Cluster"]
        subgraph operator["Omnia Operator"]
            op[Controller Manager]
        end

        subgraph pod["Agent Pod"]
            facade[Facade Container]
            runtime[Runtime Container]
            facade <-->|gRPC| runtime
        end

        op -->|creates| pod
        op -->|watches| pp[PromptPack ConfigMap]

        subgraph storage["Storage Layer"]
            session[(Session Store<br/>Redis)]
            tools[Tool Services]
        end

        facade --> session
        runtime --> tools
    end

    clients((Clients)) -->|WebSocket| facade

Components

Omnia Operator

The operator is a Kubernetes controller that:

Watches for AgentRuntime, PromptPack, ToolRegistry, and Provider resources
Creates and manages Deployments for agent pods
Generates ConfigMaps for tools configuration
Creates Services for agent access
Monitors referenced resources and updates agents accordingly

The operator follows the standard Kubernetes controller pattern:

Watch - Monitor custom resources for changes
Reconcile - Bring actual state to desired state
Status - Report current state back to the resource

Agent Pod (Sidecar Architecture)

Each agent pod runs two containers in a sidecar pattern:

Facade Container

The facade container handles external client communication:

WebSocket Server - Manages client connections and message routing
Session Management - Creates and tracks conversation sessions
Protocol Translation - Converts WebSocket messages to gRPC calls
Connection Lifecycle - Handles connect, disconnect, and heartbeat
Media Storage (optional) - Handles file uploads for multi-modal messages

Optional Media Storage

The facade can optionally provide media storage for runtimes that don’t have built-in media externalization. When enabled, clients can upload files via HTTP before referencing them in WebSocket messages.

This is useful when:

Using a custom runtime without media handling
Need a runtime-agnostic upload endpoint
Want to avoid base64-encoding large files in WebSocket messages

Runtimes like PromptKit have built-in media externalization, so facade media storage can remain disabled (the default).

Supported Storage Backends:

Backend	Description	Authentication
`local`	Local filesystem	N/A
`s3`	Amazon S3, MinIO, LocalStack	IAM roles, IRSA, access keys
`gcs`	Google Cloud Storage	Workload Identity, service accounts
`azure`	Azure Blob Storage	Managed Identity, account keys

Cloud backends use presigned URLs for direct uploads, bypassing the facade for better performance:

sequenceDiagram
    participant C as Client
    participant F as Facade
    participant CS as Cloud Storage

    C->>F: POST /media/request-upload
    F-->>C: {uploadId, presignedUrl}
    C->>CS: PUT presignedUrl (direct)
    CS-->>C: 200 OK
    C->>F: POST /media/confirm-upload/{id}
    F-->>C: {mediaInfo}

See Configure Media Storage for detailed setup instructions.

Runtime Container

The runtime container handles LLM interactions and tool execution:

PromptKit Integration - Uses PromptKit SDK for LLM communication
Tool Manager - Loads and manages tool adapters (HTTP, gRPC, MCP, OpenAPI)
State Persistence - Saves conversation state to the session store
Tracing - OpenTelemetry instrumentation for observability

The containers communicate via gRPC on localhost, providing clean separation between client-facing logic and LLM processing.

Custom Resource Definitions

AgentRuntime

The primary resource for deploying agents. It references:

Provider configuration (which LLM to use)
PromptPack (what prompts to use)
ToolRegistry (what tools are available)
Session configuration
Evals configuration (judges, sampling, rate limits)
Runtime resources and scaling

PromptPack

Defines versioned prompt configurations following the PromptPack specification. Supports:

Structured prompt definitions with variables, parameters, and validators
ConfigMap-based storage of compiled PromptPack JSON
Canary rollouts for safe prompt updates
Automatic agent notification on changes

ToolRegistry

Defines tool handlers available to agents:

HTTP handlers - REST endpoints with explicit schemas
gRPC handlers - gRPC services using the Tool protocol
MCP handlers - Self-describing Model Context Protocol servers
OpenAPI handlers - Self-describing services with OpenAPI specs
Service discovery via label selectors

Provider

Configures LLM provider settings:

Provider type (claude, openai, gemini, etc.)
Model selection
API credentials
Custom base URLs

Tool Execution Flow

sequenceDiagram
    participant C as Client
    participant F as Facade
    participant R as Runtime
    participant TM as Tool Manager
    participant T as Tool Service

    C->>F: WebSocket message
    F->>R: gRPC request
    R->>R: Send to LLM
    R-->>R: LLM returns tool_call
    R->>TM: Execute tool
    TM->>T: Route to adapter (HTTP/gRPC/MCP/OpenAPI)
    T-->>TM: Tool result
    TM-->>R: Return result
    R->>R: Send result to LLM
    R-->>F: Stream response
    F-->>C: WebSocket chunks

The Tool Manager routes calls to the appropriate adapter based on handler type:

graph LR
    TM[Tool Manager] --> HTTP[HTTP Adapter]
    TM --> GRPC[gRPC Adapter]
    TM --> MCP[MCP Adapter]
    TM --> OA[OpenAPI Adapter]

    HTTP --> HS[REST Service]
    GRPC --> GS[gRPC Service]
    MCP --> MS[MCP Server]
    OA --> OS[OpenAPI Service]

Client sends message via WebSocket
Facade creates/resumes session and forwards to Runtime
Runtime sends message to LLM via PromptKit
LLM returns tool call request
Tool Manager routes call to appropriate adapter
Adapter executes tool and returns result
Result sent back to LLM for final response
Response streamed back through Facade to client

Observability

Omnia provides comprehensive observability through OpenTelemetry:

Tracing

The runtime container creates spans for:

Conversation turns - End-to-end request processing
LLM calls - Time spent in provider API calls
Tool executions - Individual tool call latency

Traces include:

Session ID for correlation
Token usage (input/output)
Cost information
Tool results (success/error)

Metrics

The operator and agent containers expose Prometheus metrics:

Request latency histograms
Tool call counts and durations
Session counts
LLM token usage

Configuration

Enable tracing via environment variables:

env:
  - name: OMNIA_TRACING_ENABLED
    value: "true"
  - name: OMNIA_TRACING_ENDPOINT
    value: "otel-collector.observability:4317"
  - name: OMNIA_TRACING_SAMPLE_RATE
    value: "1.0"

Realtime Evals

Omnia includes a realtime evaluation system that continuously assesses the quality of live agent conversations. Eval definitions are authored in the PromptPack (alongside validators/guardrails) and executed automatically as sessions progress.

The system uses a dual-pattern architecture based on the agent’s framework type:

flowchart LR
    subgraph patternA["Pattern A — All Agents"]
        F[Facade] --> SA[session-api]
        SA -.->|event| RS[Redis Streams]
        RS --> EW[eval worker]
    end

    subgraph patternC["Pattern C — PromptKit Agents"]
        EB[EventBus] --> EBL[EventBusEvalListener]
        EBL --> Runner[in-process evals]
    end

Pattern A (Platform Events) uses the eval-worker Deployment. The facade records sessions through session-api, which publishes lightweight events to Redis Streams. A per-namespace eval worker subscribes, loads the PromptPack’s eval definitions, and runs assertions against the session data. By default the worker runs the long-running and external eval groups — LLM judges and external API checks.
Pattern C (EventBus-Driven) runs in-process inside PromptKit agents. PromptKit’s RecordingStage and EventBus provide richer event data (provider call metadata, validation events, pipeline timings). An in-process EventBusEvalListener triggers evals synchronously during the turn. By default the inline path runs the fast-running group — deterministic handlers (contains, regex) that are cheap enough to gate on.

Both paths run concurrently for PromptKit agents, split by eval group. The defaults are disjoint so a given eval runs on exactly one path; operators can override the routing per agent. Eval configuration — routing, judges, sampling rates, rate limits — is defined per-agent on the AgentRuntime CRD. Results land in the eval_results table tagged source="worker" (Pattern A) or source="runtime-inline" (Pattern C), and are surfaced in the dashboard’s quality view.

For the complete explanation, see Realtime Evals.

Design Decisions

Why Kubernetes Operator?

We chose the operator pattern because:

Native integration - Agents are first-class Kubernetes citizens
Declarative configuration - Define desired state, not procedures
Self-healing - Automatic recovery from failures
Scalability - Leverage Kubernetes scaling mechanisms

Why Sidecar Architecture?

Separating facade and runtime enables:

Separation of concerns - Client handling vs LLM processing
Independent scaling - Different resource requirements
Protocol flexibility - Easy to add new client protocols
Testability - Components can be tested in isolation
Language flexibility - Containers can use different languages

Why WebSocket?

WebSocket was chosen for the client facade because:

Streaming - Essential for LLM response streaming
Bidirectional - Enables tool calls and results
Persistent - Maintains connection for multi-turn conversations
Efficient - Lower overhead than HTTP polling

Why Separate PromptPack?

Separating prompts from agents allows:

Reusability - Same prompts across multiple agents
Versioning - Track prompt changes independently
Safe rollouts - Canary deployments for prompts
Separation of concerns - Prompt engineers vs DevOps

Why Handler-Based Tools?

The handler abstraction enables:

Self-describing services - MCP and OpenAPI discover tools automatically
Explicit schemas - HTTP and gRPC tools define their interface
Unified management - All tool types in one registry
Dynamic updates - Add/remove tools without redeploying agents

Resource Relationships

graph LR
    AR[AgentRuntime] -->|references| PP[PromptPack]
    AR -->|references| TR[ToolRegistry]
    AR -->|references| PR[Provider]
    AR -->|creates| D[Deployment]
    AR -->|creates| S[Service]

    PP -->|source| CM1[ConfigMap]
    TR -->|discovers| SVC[Services]
    TR -->|generates| CM2[Tools ConfigMap]
    PR -->|credentials| SEC[Secret]

    D -->|contains| FC[Facade Container]
    D -->|contains| RC[Runtime Container]

Reconciliation Flow

When an AgentRuntime is created or updated:

Validate the referenced PromptPack exists
Optionally validate the referenced ToolRegistry
Fetch Provider configuration
Generate tools ConfigMap from ToolRegistry
Build the pod spec with facade and runtime containers
Create or update the Deployment
Create or update the Service
Update the AgentRuntime status

When a ToolRegistry changes:

Process handlers (HTTP, gRPC, MCP, OpenAPI)
Discover tools from self-describing handlers
Update discovered tools in status
Find all AgentRuntimes referencing this ToolRegistry
Regenerate tools ConfigMaps for affected agents

Security Considerations

Secrets Management

API keys are stored in Kubernetes Secrets
Secrets are mounted as environment variables, not files
Secrets can be from the same or different namespace

Network Policies

Consider implementing NetworkPolicies to:

Restrict agent egress to allowed LLM providers
Limit tool access to specific services
Isolate agent namespaces

RBAC

The operator requires specific permissions:

Full access to Omnia CRDs
Read access to ConfigMaps and Secrets
Create/Update access to Deployments and Services

Multi-Tenancy

For team isolation, Omnia provides Workspaces:

Namespace isolation - Each workspace gets a dedicated namespace
Role-based access - Owner, editor, viewer roles with scoped permissions
Resource quotas - Limits on compute, objects, and Omnia resources
IdP integration - Map identity provider groups to workspace roles

See Multi-Tenancy Architecture for details.