Skip to content

Monitor Arena Jobs

This guide covers how to monitor Arena Fleet jobs in real-time, track progress, and access evaluation results.

View the current state of an ArenaJob:

Terminal window
kubectl get arenajob my-eval

Output:

NAME PHASE PROGRESS WORKERS AGE
my-eval Running 45/100 3/3 2m

Get full status details:

Terminal window
kubectl get arenajob my-eval -o yaml

Key status fields:

status:
phase: Running
progress:
total: 100 # Total scenarios to evaluate
completed: 45 # Successfully completed
failed: 2 # Failed evaluations
pending: 53 # Waiting to run
activeWorkers: 3
startTime: "2025-01-18T10:00:00Z"
conditions:
- type: Ready
status: "True"
- type: Progressing
status: "True"
message: "45/100 scenarios completed"

Monitor job progress as it runs:

Terminal window
kubectl get arenajob my-eval -w

Or use watch for periodic updates:

Terminal window
watch -n 5 kubectl get arenajob my-eval
PhaseDescription
PendingJob created, waiting to start
RunningWorkers are actively processing scenarios
SucceededAll scenarios completed successfully
FailedJob failed (threshold exceeded or error)
CancelledJob was manually cancelled
Terminal window
kubectl get pods -l arena.omnia.altairalabs.ai/job=my-eval

Output:

NAME READY STATUS RESTARTS AGE
my-eval-worker-abc12 1/1 Running 0 2m
my-eval-worker-def34 1/1 Running 0 2m
my-eval-worker-ghi56 1/1 Running 0 2m

Stream logs from all workers:

Terminal window
kubectl logs -l arena.omnia.altairalabs.ai/job=my-eval -f

Logs from a specific worker:

Terminal window
kubectl logs my-eval-worker-abc12 -f
Terminal window
kubectl top pods -l arena.omnia.altairalabs.ai/job=my-eval

For completed jobs, results are summarized in the status:

Terminal window
kubectl get arenajob my-eval -o jsonpath='{.status.result.summary}'

If output storage is configured, get the result location:

Terminal window
kubectl get arenajob my-eval -o jsonpath='{.status.result.url}'

For S3 storage:

Terminal window
# Get the result prefix
RESULT_URL=$(kubectl get arenajob my-eval -o jsonpath='{.status.result.url}')
aws s3 cp $RESULT_URL/results.json ./results.json

For PVC storage:

Terminal window
# Port-forward or exec into a pod to access PVC
kubectl cp <pod>:/path/to/results ./results

Arena Fleet exposes metrics for monitoring with Prometheus.

MetricDescription
arena_job_phaseCurrent job phase (gauge)
arena_job_progress_totalTotal scenarios in job
arena_job_progress_completedCompleted scenarios
arena_job_progress_failedFailed scenarios
arena_job_duration_secondsJob execution duration
arena_scenario_latency_secondsPer-scenario LLM latency
arena_scenario_tokens_totalToken usage per scenario

Total running jobs:

count(arena_job_phase{phase="Running"})

Job completion rate:

arena_job_progress_completed / arena_job_progress_total

Average scenario latency:

avg(arena_scenario_latency_seconds) by (job_name, provider)

Failed scenario rate:

rate(arena_job_progress_failed[5m])

If Grafana is enabled, Arena metrics are available for visualization.

Job Progress:

arena_job_progress_completed{job_name="$job"}

Scenario Latency Histogram:

histogram_quantile(0.95, arena_scenario_latency_seconds_bucket)

Token Usage Over Time:

sum(rate(arena_scenario_tokens_total[5m])) by (provider)

View events related to Arena jobs:

Terminal window
kubectl get events --field-selector involvedObject.name=my-eval

Key events to watch for:

EventMeaning
JobStartedJob execution began
WorkersCreatedWorker pods created
ScenarioCompletedIndividual scenario finished
JobSucceededJob completed successfully
JobFailedJob failed
RetryScheduledFailed scenario being retried
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: arena-alerts
spec:
groups:
- name: arena
rules:
- alert: ArenaJobFailed
expr: arena_job_phase{phase="Failed"} == 1
for: 1m
labels:
severity: warning
annotations:
summary: "Arena job {{ $labels.job_name }} failed"
description: "Job has been in Failed state for more than 1 minute"
- alert: ArenaHighFailureRate
expr: |
(arena_job_progress_failed / arena_job_progress_total) > 0.1
for: 5m
labels:
severity: warning
annotations:
summary: "Arena job {{ $labels.job_name }} has >10% failure rate"
- alert: ArenaSlowEvaluation
expr: |
avg(arena_scenario_latency_seconds) by (job_name) > 60
for: 10m
labels:
severity: info
annotations:
summary: "Arena job {{ $labels.job_name }} has slow evaluations (>60s avg)"

Stop a running job:

Terminal window
kubectl delete arenajob my-eval

Or patch to cancel while preserving the resource:

Terminal window
kubectl patch arenajob my-eval --type=merge -p '{"spec":{"suspend":true}}'
Terminal window
kubectl describe arenajob my-eval | grep -A 10 Conditions

Check worker logs for failures:

Terminal window
kubectl logs -l arena.omnia.altairalabs.ai/job=my-eval | grep -i error
ReasonResolution
ConfigNotReadyCheck ArenaConfig status
SourceFetchFailedVerify ArenaSource can fetch bundle
ProviderErrorCheck provider credentials and limits
TimeoutIncrease evaluation timeout
AssertionFailedExpected behavior - check test assertions