Controller Reconciliation

This document explains the reconciliation logic used by Omnia’s Kubernetes controllers.

Reconciliation Pattern

Omnia follows the Kubernetes controller reconciliation pattern:

Watch - Monitor resources for changes
Queue - Add changed resources to work queue
Reconcile - Process each resource to achieve desired state
Requeue - Retry on transient failures

AgentRuntime Controller

Reconciliation Flow

flowchart TD
    A[AgentRuntime Change Detected] --> B{Validate Spec}
    B -->|Valid| C[Fetch PromptPack]
    B -->|Invalid| F1[Set Failed Status]
    C -->|Found| D{ToolRegistry Referenced?}
    C -->|Not Found| F1
    D -->|Yes| E[Fetch ToolRegistry]
    D -->|No| G[Build Deployment Spec]
    E -->|Found| G
    E -->|Not Found| F1
    G --> H[Create/Update Deployment]
    H --> I[Create/Update Service]
    I --> J[Update Status: Running]

Deployment Building

The controller builds the Deployment spec with:

Container image - From operator configuration
Environment variables:
- OMNIA_AGENT_NAME - AgentRuntime name
- OMNIA_NAMESPACE - Namespace
- OMNIA_PROVIDER_* - Provider configuration
- OMNIA_SESSION_* - Session configuration
Volume mounts - PromptPack ConfigMap
Resource limits - From spec
Labels - For identification and selection

Status Updates

The controller updates status with:

phase - Current lifecycle phase
replicas - Desired and ready counts
conditions - Detailed state information

Watched Resources

The AgentRuntime controller watches:

AgentRuntime resources (primary)
PromptPack resources (to detect changes)
ToolRegistry resources (to detect changes)

PromptPack Controller

Reconciliation Flow

flowchart TD
    A[PromptPack Change Detected] --> B[Fetch ConfigMap]
    B -->|Found| C{Validate Content}
    B -->|Not Found| F[Set Failed Status]
    C -->|Valid| D[Find Referencing AgentRuntimes]
    C -->|Invalid| F
    D --> E{Rollout Strategy?}
    E -->|Immediate| G[Update activeVersion]
    E -->|Canary| H[Update canaryVersion]
    G --> I[Notify Agents]
    H --> J{Weight = 100%?}
    J -->|Yes| K[Promote to Active]
    J -->|No| I
    K --> I
    I --> L[Update Status: Active]

Rollout Strategies

Immediate

Changes apply immediately:

Validate new content
Update activeVersion
Agents pick up changes on next request

Canary

Gradual rollout:

Validate new content
Set canaryVersion
Route percentage of traffic to canary
Monitor for issues
Promote when weight reaches 100%

Watched Resources

The PromptPack controller watches:

PromptPack resources (primary)
ConfigMaps (to detect content changes)

ToolRegistry Controller

Reconciliation Flow

flowchart TD
    A[ToolRegistry Change Detected] --> B[Process Handlers]
    B --> C{Handler Type}
    C -->|HTTP/gRPC| D[Use Explicit Schema]
    C -->|MCP/OpenAPI| E[Discover Tools from Service]
    D --> F[Find Matching Services]
    E --> F
    F --> G{Services Available?}
    G -->|All| H[Status: Ready]
    G -->|Some| I[Status: Degraded]
    G -->|None| J[Status: Failed]
    H --> K[Update discoveredTools]
    I --> K
    J --> K

Tool Discovery

For selector-based tools:

Find Services matching labels
Parse tool metadata from annotations
Determine endpoint URL from Service
Check Service has ready endpoints

Status Phases

Phase	Condition
`Ready`	All tools available
`Degraded`	Some tools unavailable
`Failed`	No tools available

Watched Resources

The ToolRegistry controller watches:

ToolRegistry resources (primary)
Services (to detect tool availability changes)

Error Handling

Transient Errors

On transient errors (network issues, API rate limits):

Log the error
Set status condition to reflect issue
Requeue with exponential backoff
Retry reconciliation

Permanent Errors

On permanent errors (invalid spec, missing resources):

Log the error
Set phase to Failed
Set condition with error message
Do not requeue (wait for spec change)

Concurrency

Controllers process resources concurrently:

Default: 1 worker per controller
Configurable via operator flags
Safe: Each resource reconciled by one worker at a time

Requeuing

Controllers requeue resources to:

Retry after transient failures
Check status of dependent resources
Implement polling for external state

Requeue intervals:

Scenario	Interval
Success	Not requeued
Transient error	5s - 5m (exponential)
Waiting for dependency	30s