Troubleshoot Arena Fleet

This guide helps you diagnose and resolve common issues with Arena Fleet evaluations.

ArenaSource Issues

Source Stuck in Pending

Symptoms: ArenaSource stays in Pending phase.

kubectl get arenasource my-source
# NAME        TYPE   PHASE     AGE
# my-source   git    Pending   5m

Diagnosis:

kubectl describe arenasource my-source

Common Causes:

Invalid Git URL:
```
Message: failed to clone: repository not found
```
- Verify the URL is correct and accessible
- Check if the repository is private and needs credentials
Missing credentials:
```
Message: authentication required
```
- Create and reference a credentials secret
- Verify the secret exists in the same namespace
Network issues:
```
Message: dial tcp: lookup github.com: no such host
```
- Check cluster DNS resolution
- Verify network policies allow egress

Resolution:

# For private repositories, add secretRef
spec:
  git:
    url: https://github.com/org/private-repo
    secretRef:
      name: git-credentials

Source Fetch Errors

Symptoms: Source shows Error phase.

kubectl get arenasource my-source -o jsonpath='{.status.conditions}'

Common Causes:

Invalid path: The specified path doesn’t exist in the repository
Invalid ref: Branch, tag, or commit doesn’t exist
Timeout: Source fetch took too long

Resolution:

# Verify the path and ref exist
spec:
  git:
    url: https://github.com/org/repo
    ref:
      branch: main  # Verify this branch exists
    path: prompts/  # Verify this path exists
  timeout: 120s     # Increase if needed

ConfigMap Source Not Updating

Symptoms: ConfigMap changes don’t trigger source updates.

Cause: ArenaSource watches ConfigMap resourceVersion, which changes on any modification.

Resolution: Ensure the ConfigMap is being modified:

# Check ConfigMap resourceVersion
kubectl get configmap my-prompts -o jsonpath='{.metadata.resourceVersion}'

# Force update by touching the ConfigMap
kubectl patch configmap my-prompts -p '{"metadata":{"annotations":{"updated":"'$(date +%s)'"}}}'

ArenaConfig Issues

Config Shows Invalid

Symptoms: ArenaConfig phase is Invalid.

kubectl describe arenaconfig my-config

Common Causes:

Source not ready:

Message: ArenaSource "my-source" is not ready

Fix the ArenaSource first

Provider not found:
```
Message: Provider "claude-provider" not found
```
- Verify the Provider exists in the referenced namespace
Invalid scenario filters:
```
Message: no scenarios match the specified filters
```
- Check include/exclude patterns match your bundle

Resolution:

# Check referenced resources exist
kubectl get arenasource my-source
kubectl get provider claude-provider

# Verify scenario patterns
kubectl get arenaconfig my-config -o jsonpath='{.spec.scenarios}'

Zero Scenarios Resolved

Symptoms: Config shows scenarioCount: 0.

Causes:

Include patterns don’t match any scenarios
Exclude patterns filter out all scenarios
Bundle doesn’t contain scenarios

Resolution:

# Check the bundle content
kubectl get configmap my-prompts -o jsonpath='{.data.pack\.json}' | jq '.scenarios'

# Adjust filters or use wildcard
spec:
  scenarios:
    include:
      - "*"  # Include all scenarios

ArenaJob Issues

Job Stuck in Pending

Symptoms: Job stays in Pending phase.

Diagnosis:

kubectl describe arenajob my-job
kubectl get events --field-selector involvedObject.name=my-job

Common Causes:

Config not ready:

Message: ArenaConfig "my-config" is not ready

Fix the ArenaConfig first

Insufficient resources:
```
Message: 0/3 nodes are available: insufficient cpu
```
- Reduce worker replicas or add cluster capacity

Image pull errors:

Message: Failed to pull image "ghcr.io/altairalabs/arena-worker"

Check image pull secrets
Verify image exists

Resolution:

# Reduce resource requirements
spec:
  workers:
    replicas: 1  # Start with fewer workers

Workers Crash or Restart

Symptoms: Worker pods show CrashLoopBackOff or frequent restarts.

Diagnosis:

kubectl logs -l arena.omnia.altairalabs.ai/job=my-job --previous
kubectl describe pod <worker-pod-name>

Common Causes:

Out of memory:
```
OOMKilled
```
- Increase worker memory limits
Provider errors:
```
Error: rate limit exceeded
```
- Reduce concurrency
- Check provider quota
Invalid bundle:
```
Error: failed to parse pack.json
```
- Validate your PromptKit bundle

Resolution:

# Increase resources in Helm values
arena:
  worker:
    resources:
      limits:
        memory: 1Gi
      requests:
        memory: 512Mi

High Failure Rate

Symptoms: Many scenarios failing during evaluation.

Diagnosis:

# Check job progress
kubectl get arenajob my-job -o jsonpath='{.status.progress}'

# View worker logs for errors
kubectl logs -l arena.omnia.altairalabs.ai/job=my-job | grep -i "error\|failed"

Common Causes:

Assertion failures (expected):
- Review assertion definitions
- Adjust expected values
Provider rate limits:
```
Error: 429 Too Many Requests
```
- Reduce concurrency
- Add delays between requests
Timeouts:
```
Error: context deadline exceeded
```
- Increase evaluation timeout
- Check for slow providers

Resolution:

spec:
  sourceRef:
    name: my-source
  # Reduce concurrency and increase timeout
evaluation:
  timeout: "10m"
  concurrency: 1
  maxRetries: 5

Results Not Stored

Symptoms: Job succeeds but no results in S3/PVC.

Diagnosis:

# Check job status for result URL
kubectl get arenajob my-job -o jsonpath='{.status.result}'

# Check worker logs for storage errors
kubectl logs -l arena.omnia.altairalabs.ai/job=my-job | grep -i "s3\|storage\|upload"

Common Causes:

Missing credentials: S3 secret not found or invalid
Bucket doesn’t exist: Bucket must be pre-created
Permission denied: IAM policy doesn’t allow writes

Resolution:

# Test S3 access from within cluster
kubectl run s3-test --rm -it --image=amazon/aws-cli -- \
  s3 ls s3://my-bucket/

# Check secret exists
kubectl get secret arena-s3-credentials -o yaml

Queue Issues (Redis)

Workers Not Processing

Symptoms: Job running but progress not advancing.

Diagnosis:

# Check Redis connectivity
kubectl exec -it <operator-pod> -- redis-cli -h omnia-redis-master ping

# Check queue depth
kubectl exec -it <operator-pod> -- redis-cli -h omnia-redis-master llen arena:queue:my-job

Common Causes:

Redis not reachable: Check Redis service and pods
Queue configuration mismatch: Workers and controller using different queues
Redis authentication: Password mismatch

Resolution:

# Verify Redis configuration in Helm values
arena:
  queue:
    type: redis
    redis:
      host: "omnia-redis-master"
      port: 6379

Controller Issues

Controllers Not Reconciling

Symptoms: CRDs created but nothing happens.

Diagnosis:

# Check operator logs
kubectl logs -n omnia-system deployment/omnia-controller-manager | grep -i arena

# Check if Arena controllers are enabled
kubectl get deployment omnia-controller-manager -o yaml | grep -i arena

Common Causes:

Arena not enabled: Feature disabled in Helm values
RBAC issues: Controller missing permissions
CRD not installed: Arena CRDs not present

Resolution:

# Enable Arena in Helm values
arena:
  enabled: true

# Verify CRDs exist
kubectl get crd arenasources.omnia.altairalabs.ai
kubectl get crd arenaconfigs.omnia.altairalabs.ai
kubectl get crd arenajobs.omnia.altairalabs.ai

Debugging Commands Reference

Quick Health Check

# Check all Arena resources
kubectl get arenasource,arenaconfig,arenajob -A

# Check operator logs for errors
kubectl logs -n omnia-system deployment/omnia-controller-manager --tail=100 | grep -i "error\|arena"

Verbose Debugging

# Enable debug logging (requires operator restart)
kubectl set env -n omnia-system deployment/omnia-controller-manager LOG_LEVEL=debug

# Stream all Arena-related logs
kubectl logs -n omnia-system deployment/omnia-controller-manager -f | grep -i arena

Resource Cleanup

# Delete stuck jobs
kubectl delete arenajob --all

# Force delete with finalizer removal (use with caution)
kubectl patch arenajob my-job -p '{"metadata":{"finalizers":null}}' --type=merge
kubectl delete arenajob my-job

Getting Help

If you’re still experiencing issues:

Check the Arena Fleet Architecture for conceptual understanding
Review the CRD references: ArenaSource, ArenaConfig, ArenaJob
Search or open an issue on GitHub