Troubleshoot Arena Fleet
This guide helps you diagnose and resolve common issues with Arena Fleet evaluations.
ArenaSource Issues
Section titled “ArenaSource Issues”Source Stuck in Pending
Section titled “Source Stuck in Pending”Symptoms: ArenaSource stays in Pending phase.
kubectl get arenasource my-source# NAME TYPE PHASE AGE# my-source git Pending 5mDiagnosis:
kubectl describe arenasource my-sourceCommon Causes:
-
Invalid Git URL:
Message: failed to clone: repository not found- Verify the URL is correct and accessible
- Check if the repository is private and needs credentials
-
Missing credentials:
Message: authentication required- Create and reference a credentials secret
- Verify the secret exists in the same namespace
-
Network issues:
Message: dial tcp: lookup github.com: no such host- Check cluster DNS resolution
- Verify network policies allow egress
Resolution:
# For private repositories, add secretRefspec: git: url: https://github.com/org/private-repo secretRef: name: git-credentialsSource Fetch Errors
Section titled “Source Fetch Errors”Symptoms: Source shows Error phase.
kubectl get arenasource my-source -o jsonpath='{.status.conditions}'Common Causes:
- Invalid path: The specified path doesn’t exist in the repository
- Invalid ref: Branch, tag, or commit doesn’t exist
- Timeout: Source fetch took too long
Resolution:
# Verify the path and ref existspec: git: url: https://github.com/org/repo ref: branch: main # Verify this branch exists path: prompts/ # Verify this path exists timeout: 120s # Increase if neededConfigMap Source Not Updating
Section titled “ConfigMap Source Not Updating”Symptoms: ConfigMap changes don’t trigger source updates.
Cause: ArenaSource watches ConfigMap resourceVersion, which changes on any modification.
Resolution: Ensure the ConfigMap is being modified:
# Check ConfigMap resourceVersionkubectl get configmap my-prompts -o jsonpath='{.metadata.resourceVersion}'
# Force update by touching the ConfigMapkubectl patch configmap my-prompts -p '{"metadata":{"annotations":{"updated":"'$(date +%s)'"}}}'ArenaConfig Issues
Section titled “ArenaConfig Issues”Config Shows Invalid
Section titled “Config Shows Invalid”Symptoms: ArenaConfig phase is Invalid.
kubectl describe arenaconfig my-configCommon Causes:
-
Source not ready:
Message: ArenaSource "my-source" is not ready- Fix the ArenaSource first
-
Provider not found:
Message: Provider "claude-provider" not found- Verify the Provider exists in the referenced namespace
-
Invalid scenario filters:
Message: no scenarios match the specified filters- Check include/exclude patterns match your bundle
Resolution:
# Check referenced resources existkubectl get arenasource my-sourcekubectl get provider claude-provider
# Verify scenario patternskubectl get arenaconfig my-config -o jsonpath='{.spec.scenarios}'Zero Scenarios Resolved
Section titled “Zero Scenarios Resolved”Symptoms: Config shows scenarioCount: 0.
Causes:
- Include patterns don’t match any scenarios
- Exclude patterns filter out all scenarios
- Bundle doesn’t contain scenarios
Resolution:
# Check the bundle contentkubectl get configmap my-prompts -o jsonpath='{.data.pack\.json}' | jq '.scenarios'
# Adjust filters or use wildcardspec: scenarios: include: - "*" # Include all scenariosArenaJob Issues
Section titled “ArenaJob Issues”Job Stuck in Pending
Section titled “Job Stuck in Pending”Symptoms: Job stays in Pending phase.
Diagnosis:
kubectl describe arenajob my-jobkubectl get events --field-selector involvedObject.name=my-jobCommon Causes:
-
Config not ready:
Message: ArenaConfig "my-config" is not ready- Fix the ArenaConfig first
-
Insufficient resources:
Message: 0/3 nodes are available: insufficient cpu- Reduce worker replicas or add cluster capacity
-
Image pull errors:
Message: Failed to pull image "ghcr.io/altairalabs/arena-worker"- Check image pull secrets
- Verify image exists
Resolution:
# Reduce resource requirementsspec: workers: replicas: 1 # Start with fewer workersWorkers Crash or Restart
Section titled “Workers Crash or Restart”Symptoms: Worker pods show CrashLoopBackOff or frequent restarts.
Diagnosis:
kubectl logs -l arena.omnia.altairalabs.ai/job=my-job --previouskubectl describe pod <worker-pod-name>Common Causes:
-
Out of memory:
OOMKilled- Increase worker memory limits
-
Provider errors:
Error: rate limit exceeded- Reduce concurrency
- Check provider quota
-
Invalid bundle:
Error: failed to parse pack.json- Validate your PromptKit bundle
Resolution:
# Increase resources in Helm valuesarena: worker: resources: limits: memory: 1Gi requests: memory: 512MiHigh Failure Rate
Section titled “High Failure Rate”Symptoms: Many scenarios failing during evaluation.
Diagnosis:
# Check job progresskubectl get arenajob my-job -o jsonpath='{.status.progress}'
# View worker logs for errorskubectl logs -l arena.omnia.altairalabs.ai/job=my-job | grep -i "error\|failed"Common Causes:
-
Assertion failures (expected):
- Review assertion definitions
- Adjust expected values
-
Provider rate limits:
Error: 429 Too Many Requests- Reduce concurrency
- Add delays between requests
-
Timeouts:
Error: context deadline exceeded- Increase evaluation timeout
- Check for slow providers
Resolution:
spec: sourceRef: name: my-source # Reduce concurrency and increase timeoutevaluation: timeout: "10m" concurrency: 1 maxRetries: 5Results Not Stored
Section titled “Results Not Stored”Symptoms: Job succeeds but no results in S3/PVC.
Diagnosis:
# Check job status for result URLkubectl get arenajob my-job -o jsonpath='{.status.result}'
# Check worker logs for storage errorskubectl logs -l arena.omnia.altairalabs.ai/job=my-job | grep -i "s3\|storage\|upload"Common Causes:
- Missing credentials: S3 secret not found or invalid
- Bucket doesn’t exist: Bucket must be pre-created
- Permission denied: IAM policy doesn’t allow writes
Resolution:
# Test S3 access from within clusterkubectl run s3-test --rm -it --image=amazon/aws-cli -- \ s3 ls s3://my-bucket/
# Check secret existskubectl get secret arena-s3-credentials -o yamlQueue Issues (Redis)
Section titled “Queue Issues (Redis)”Workers Not Processing
Section titled “Workers Not Processing”Symptoms: Job running but progress not advancing.
Diagnosis:
# Check Redis connectivitykubectl exec -it <operator-pod> -- redis-cli -h omnia-redis-master ping
# Check queue depthkubectl exec -it <operator-pod> -- redis-cli -h omnia-redis-master llen arena:queue:my-jobCommon Causes:
- Redis not reachable: Check Redis service and pods
- Queue configuration mismatch: Workers and controller using different queues
- Redis authentication: Password mismatch
Resolution:
# Verify Redis configuration in Helm valuesarena: queue: type: redis redis: host: "omnia-redis-master" port: 6379Controller Issues
Section titled “Controller Issues”Controllers Not Reconciling
Section titled “Controllers Not Reconciling”Symptoms: CRDs created but nothing happens.
Diagnosis:
# Check operator logskubectl logs -n omnia-system deployment/omnia-controller-manager | grep -i arena
# Check if Arena controllers are enabledkubectl get deployment omnia-controller-manager -o yaml | grep -i arenaCommon Causes:
- Arena not enabled: Feature disabled in Helm values
- RBAC issues: Controller missing permissions
- CRD not installed: Arena CRDs not present
Resolution:
# Enable Arena in Helm valuesarena: enabled: true# Verify CRDs existkubectl get crd arenasources.omnia.altairalabs.aikubectl get crd arenaconfigs.omnia.altairalabs.aikubectl get crd arenajobs.omnia.altairalabs.aiDebugging Commands Reference
Section titled “Debugging Commands Reference”Quick Health Check
Section titled “Quick Health Check”# Check all Arena resourceskubectl get arenasource,arenaconfig,arenajob -A
# Check operator logs for errorskubectl logs -n omnia-system deployment/omnia-controller-manager --tail=100 | grep -i "error\|arena"Verbose Debugging
Section titled “Verbose Debugging”# Enable debug logging (requires operator restart)kubectl set env -n omnia-system deployment/omnia-controller-manager LOG_LEVEL=debug
# Stream all Arena-related logskubectl logs -n omnia-system deployment/omnia-controller-manager -f | grep -i arenaResource Cleanup
Section titled “Resource Cleanup”# Delete stuck jobskubectl delete arenajob --all
# Force delete with finalizer removal (use with caution)kubectl patch arenajob my-job -p '{"metadata":{"finalizers":null}}' --type=mergekubectl delete arenajob my-jobGetting Help
Section titled “Getting Help”If you’re still experiencing issues:
- Check the Arena Fleet Architecture for conceptual understanding
- Review the CRD references: ArenaSource, ArenaConfig, ArenaJob
- Search or open an issue on GitHub