Monitor Arena Jobs
This guide covers how to monitor Arena Fleet jobs in real-time, track progress, and access evaluation results.
Checking Job Status
Section titled “Checking Job Status”Basic Status
Section titled “Basic Status”View the current state of an ArenaJob:
kubectl get arenajob my-evalOutput:
NAME PHASE PROGRESS WORKERS AGEmy-eval Running 45/100 3/3 2mDetailed Status
Section titled “Detailed Status”Get full status details:
kubectl get arenajob my-eval -o yamlKey status fields:
status: phase: Running progress: total: 100 # Total scenarios to evaluate completed: 45 # Successfully completed failed: 2 # Failed evaluations pending: 53 # Waiting to run activeWorkers: 3 startTime: "2025-01-18T10:00:00Z" conditions: - type: Ready status: "True" - type: Progressing status: "True" message: "45/100 scenarios completed"Watch Progress in Real-Time
Section titled “Watch Progress in Real-Time”Monitor job progress as it runs:
kubectl get arenajob my-eval -wOr use watch for periodic updates:
watch -n 5 kubectl get arenajob my-evalUnderstanding Job Phases
Section titled “Understanding Job Phases”| Phase | Description |
|---|---|
Pending | Job created, waiting to start |
Running | Workers are actively processing scenarios |
Succeeded | All scenarios completed successfully |
Failed | Job failed (threshold exceeded or error) |
Cancelled | Job was manually cancelled |
Viewing Worker Status
Section titled “Viewing Worker Status”List Worker Pods
Section titled “List Worker Pods”kubectl get pods -l arena.omnia.altairalabs.ai/job=my-evalOutput:
NAME READY STATUS RESTARTS AGEmy-eval-worker-abc12 1/1 Running 0 2mmy-eval-worker-def34 1/1 Running 0 2mmy-eval-worker-ghi56 1/1 Running 0 2mView Worker Logs
Section titled “View Worker Logs”Stream logs from all workers:
kubectl logs -l arena.omnia.altairalabs.ai/job=my-eval -fLogs from a specific worker:
kubectl logs my-eval-worker-abc12 -fCheck Worker Resource Usage
Section titled “Check Worker Resource Usage”kubectl top pods -l arena.omnia.altairalabs.ai/job=my-evalAccessing Results
Section titled “Accessing Results”From Job Status
Section titled “From Job Status”For completed jobs, results are summarized in the status:
kubectl get arenajob my-eval -o jsonpath='{.status.result.summary}'Result URL
Section titled “Result URL”If output storage is configured, get the result location:
kubectl get arenajob my-eval -o jsonpath='{.status.result.url}'Download Results
Section titled “Download Results”For S3 storage:
# Get the result prefixRESULT_URL=$(kubectl get arenajob my-eval -o jsonpath='{.status.result.url}')aws s3 cp $RESULT_URL/results.json ./results.jsonFor PVC storage:
# Port-forward or exec into a pod to access PVCkubectl cp <pod>:/path/to/results ./resultsPrometheus Metrics
Section titled “Prometheus Metrics”Arena Fleet exposes metrics for monitoring with Prometheus.
Key Metrics
Section titled “Key Metrics”| Metric | Description |
|---|---|
arena_job_phase | Current job phase (gauge) |
arena_job_progress_total | Total scenarios in job |
arena_job_progress_completed | Completed scenarios |
arena_job_progress_failed | Failed scenarios |
arena_job_duration_seconds | Job execution duration |
arena_scenario_latency_seconds | Per-scenario LLM latency |
arena_scenario_tokens_total | Token usage per scenario |
Example Prometheus Queries
Section titled “Example Prometheus Queries”Total running jobs:
count(arena_job_phase{phase="Running"})Job completion rate:
arena_job_progress_completed / arena_job_progress_totalAverage scenario latency:
avg(arena_scenario_latency_seconds) by (job_name, provider)Failed scenario rate:
rate(arena_job_progress_failed[5m])Grafana Dashboard
Section titled “Grafana Dashboard”If Grafana is enabled, Arena metrics are available for visualization.
Sample Dashboard Panels
Section titled “Sample Dashboard Panels”Job Progress:
arena_job_progress_completed{job_name="$job"}Scenario Latency Histogram:
histogram_quantile(0.95, arena_scenario_latency_seconds_bucket)Token Usage Over Time:
sum(rate(arena_scenario_tokens_total[5m])) by (provider)Event Monitoring
Section titled “Event Monitoring”View events related to Arena jobs:
kubectl get events --field-selector involvedObject.name=my-evalKey events to watch for:
| Event | Meaning |
|---|---|
JobStarted | Job execution began |
WorkersCreated | Worker pods created |
ScenarioCompleted | Individual scenario finished |
JobSucceeded | Job completed successfully |
JobFailed | Job failed |
RetryScheduled | Failed scenario being retried |
Setting Up Alerts
Section titled “Setting Up Alerts”Alert on Job Failure
Section titled “Alert on Job Failure”apiVersion: monitoring.coreos.com/v1kind: PrometheusRulemetadata: name: arena-alertsspec: groups: - name: arena rules: - alert: ArenaJobFailed expr: arena_job_phase{phase="Failed"} == 1 for: 1m labels: severity: warning annotations: summary: "Arena job {{ $labels.job_name }} failed" description: "Job has been in Failed state for more than 1 minute"Alert on High Failure Rate
Section titled “Alert on High Failure Rate”- alert: ArenaHighFailureRate expr: | (arena_job_progress_failed / arena_job_progress_total) > 0.1 for: 5m labels: severity: warning annotations: summary: "Arena job {{ $labels.job_name }} has >10% failure rate"Alert on Slow Evaluations
Section titled “Alert on Slow Evaluations”- alert: ArenaSlowEvaluation expr: | avg(arena_scenario_latency_seconds) by (job_name) > 60 for: 10m labels: severity: info annotations: summary: "Arena job {{ $labels.job_name }} has slow evaluations (>60s avg)"Cancelling a Job
Section titled “Cancelling a Job”Stop a running job:
kubectl delete arenajob my-evalOr patch to cancel while preserving the resource:
kubectl patch arenajob my-eval --type=merge -p '{"spec":{"suspend":true}}'Debugging Failed Jobs
Section titled “Debugging Failed Jobs”Check Job Conditions
Section titled “Check Job Conditions”kubectl describe arenajob my-eval | grep -A 10 ConditionsView Failed Scenarios
Section titled “View Failed Scenarios”Check worker logs for failures:
kubectl logs -l arena.omnia.altairalabs.ai/job=my-eval | grep -i errorCommon Failure Reasons
Section titled “Common Failure Reasons”| Reason | Resolution |
|---|---|
ConfigNotReady | Check ArenaConfig status |
SourceFetchFailed | Verify ArenaSource can fetch bundle |
ProviderError | Check provider credentials and limits |
Timeout | Increase evaluation timeout |
AssertionFailed | Expected behavior - check test assertions |
Related Resources
Section titled “Related Resources”- Troubleshoot Arena: Debug common issues
- ArenaJob Reference: Complete status field documentation
- Set Up Observability: Configure Prometheus and Grafana