Troubleshooting Guide
This guide covers common issues encountered when running the DevOps AI Toolkit Controller and their solutions.
Common Issues and Solutions
1. Controller Pod Not Starting
Symptoms:
kubectl get pods --namespace dot-ai
# Shows controller pod in CrashLoopBackOff or ImagePullBackOff
Diagnosis:
kubectl logs --selector app.kubernetes.io/name=dot-ai-controller --namespace dot-ai
kubectl describe pod --selector app.kubernetes.io/name=dot-ai-controller --namespace dot-ai
Common Causes:
- RBAC Issues: Missing leader election permissions (we encountered this during testing)
- Image Issues: Wrong architecture or missing image
- Resource Constraints: Insufficient memory/CPU limits
Solution:
# Check if leader election RBAC is missing (error we fixed during testing):
# "leases.coordination.k8s.io is forbidden"
kubectl get clusterrole dot-ai-controller-manager-role --output yaml
# Add missing leader election permissions if needed:
kubectl patch clusterrole dot-ai-controller-manager-role --type='json' \
--patch='[{"op": "add", "path": "/rules/-", "value": {"apiGroups": ["coordination.k8s.io"], "resources": ["leases"], "verbs": ["create", "get", "list", "update"]}}]'
2. Events Not Being Processed
Symptoms:
kubectl logs --selector app.kubernetes.io/name=dot-ai-controller --namespace dot-ai --tail 50
# Shows: "No RemediationPolicies found - event will not be processed"
Diagnosis:
# Check if RemediationPolicies exist
kubectl get remediationpolicies --all-namespaces
# Check policy selectors
kubectl get remediationpolicies --namespace dot-ai --output yaml
Common Causes:
- No RemediationPolicy created
- Event doesn't match policy selectors
- Policy in wrong namespace
3. MCP Connection Failures
Symptoms:
# Controller logs show:
# "❌ HTTP request failed" or "Failed to send MCP request"
Diagnosis:
# Check MCP pod status
kubectl get pods --namespace dot-ai --selector app.kubernetes.io/name=dot-ai
# Test MCP connectivity from controller
kubectl exec --namespace dot-ai deployment/dot-ai-controller-manager -- \
curl -v http://dot-ai-mcp.dot-ai.svc.cluster.local:3456/health
Common Causes:
- MCP pod not running
- Wrong MCP endpoint URL in RemediationPolicy
- Network policies blocking communication
4. Slack Notifications Not Working
Symptoms:
# Controller logs show:
# "failed to send Slack start notification"
Diagnosis:
# Check Slack webhook configuration
kubectl get remediationpolicies --namespace dot-ai --output yaml | grep --after-context 5 slack
# Test webhook manually
curl -X POST -H 'Content-type: application/json' \
--data '{"text":"Test message"}' \
YOUR_SLACK_WEBHOOK_URL
Common Causes:
- Invalid Slack webhook URL
- Slack webhook disabled (
enabled: false) - Network connectivity issues
5. Rate Limiting Active
Symptoms:
# Controller logs show:
# "Event processing rate limited" and "cooldown active for Xm Ys more"
This is Expected Behavior: Rate limiting prevents spam processing of duplicate events. The default settings are:
eventsPerMinute: 5cooldownMinutes: 15
To Adjust: Modify your RemediationPolicy:
rateLimiting:
eventsPerMinute: 10 # Increase if needed
cooldownMinutes: 5 # Decrease if needed
6. MCP Analysis Failures
Symptoms:
# Controller logs show:
# "MCP remediation failed" or "McpRemediationFailed" events
Diagnosis:
# Check MCP logs for detailed error messages
kubectl logs --namespace dot-ai --selector app.kubernetes.io/name=dot-ai --tail 50
# Check RemediationPolicy status
kubectl describe remediationpolicies --namespace dot-ai
Common Causes:
- Invalid Anthropic API key
- API rate limits exceeded
- Network connectivity to Anthropic services
- Malformed event data
7. ResourceSyncConfig Not Syncing
Symptoms:
# ResourceSyncConfig status shows syncErrors or not active
kubectl get resourcesyncconfigs --output yaml
Diagnosis:
# Check ResourceSyncConfig status
kubectl get resourcesyncconfigs --output jsonpath='{.items[*].status}'
# Check controller logs for sync errors
kubectl logs --selector app.kubernetes.io/name=dot-ai-controller --namespace dot-ai | grep -i "resourcesync\|sync"
# Verify MCP endpoint is reachable
kubectl exec --namespace dot-ai deployment/dot-ai-controller-manager -- \
curl -v http://dot-ai-mcp.dot-ai.svc.cluster.local:3456/api/v1/resources/sync
Common Causes:
- MCP resource sync endpoint not available
- Wrong
mcpEndpointURL in ResourceSyncConfig - Network policies blocking communication
- RBAC permissions missing for resource discovery
Solution:
# Verify the MCP endpoint URL is correct
kubectl get resourcesyncconfigs --output jsonpath='{.items[*].spec.mcpEndpoint}'
# Check if watcher is active
kubectl get resourcesyncconfigs --output jsonpath='{.items[*].status.active}'
# Check watched resource types count
kubectl get resourcesyncconfigs --output jsonpath='{.items[*].status.watchedResourceTypes}'
8. CapabilityScanConfig Not Scanning
Symptoms:
# CapabilityScanConfig status shows errors or not ready
kubectl get capabilityscanconfigs --output yaml
Diagnosis:
# Check CapabilityScanConfig status
kubectl get capabilityscanconfigs --output jsonpath='{.items[*].status}'
# Check controller logs for scan errors
kubectl logs --selector app.kubernetes.io/name=dot-ai-controller --namespace dot-ai | grep -i "capabilityscan"
# Verify auth secret exists
kubectl get secret dot-ai-secrets --namespace dot-ai
Common Causes:
- MCP endpoint not available
- Wrong
mcp.endpointURL in CapabilityScanConfig - Missing or invalid
mcp.authSecretRefsecret - Resource filters excluding all resources
Solution:
# Verify the MCP endpoint URL is correct
kubectl get capabilityscanconfigs --output jsonpath='{.items[*].spec.mcp.endpoint}'
# Check if initial scan completed
kubectl get capabilityscanconfigs --output jsonpath='{.items[*].status.initialScanComplete}'
# Check last error
kubectl get capabilityscanconfigs --output jsonpath='{.items[*].status.lastError}'
# Verify include/exclude filters aren't too restrictive
kubectl get capabilityscanconfigs --output jsonpath='{.items[*].spec.includeResources}'
9. GitKnowledgeSource Not Syncing
Symptoms:
# GitKnowledgeSource status shows errors or Synced condition is False
kubectl get gitknowledgesources --output yaml
Diagnosis:
# Check GitKnowledgeSource status
kubectl get gitknowledgesources -n dot-ai -o jsonpath='{.items[*].status}'
# Check controller logs for sync errors
kubectl logs --selector app.kubernetes.io/name=dot-ai-controller --namespace dot-ai | grep -i "gitknowledge\|clone"
# Verify MCP endpoint is reachable
kubectl exec --namespace dot-ai deployment/dot-ai-controller-manager -- \
curl -v http://dot-ai.dot-ai.svc:3456/health
Common Causes:
- CloneError with "read-only file system": Controller deployment missing
/tmpvolume mount - Authentication failure: Invalid or missing token for private repositories
- MCP unreachable: Wrong MCP server URL or network issues
- Invalid path patterns: Glob patterns not matching any files
Solution:
# Check for read-only filesystem error (needs /tmp volume)
kubectl get gitknowledgesources -n dot-ai -o jsonpath='{.items[*].status.lastError}'
# Verify the controller has /tmp volume mounted
kubectl get deployment dot-ai-controller-manager -n dot-ai -o jsonpath='{.spec.template.spec.containers[0].volumeMounts}'
# If missing, patch to add /tmp volume:
kubectl patch deployment dot-ai-controller-manager -n dot-ai --type='json' -p='[
{"op": "add", "path": "/spec/template/spec/volumes", "value": [{"name": "tmp-dir", "emptyDir": {}}]},
{"op": "add", "path": "/spec/template/spec/containers/0/volumeMounts", "value": [{"name": "tmp-dir", "mountPath": "/tmp"}]}
]'
# For private repo auth issues, verify secret exists
kubectl get secret <secret-name> -n dot-ai -o jsonpath='{.data.<key>}' | base64 -d
10. ResourceSync High Traffic or Performance Issues
Symptoms:
- High CPU/memory usage on controller
- Frequent sync requests to MCP
- Slow cluster performance
Diagnosis:
# Check sync frequency and resource counts
kubectl get resourcesyncconfigs --output yaml | grep -A5 status
# Check debounce and resync settings
kubectl get resourcesyncconfigs --output yaml | grep -E "debounceWindowSeconds|resyncIntervalMinutes"
Solution:
Adjust debounce and resync intervals in your ResourceSyncConfig:
spec:
debounceWindowSeconds: 30 # Increase to batch more changes
resyncIntervalMinutes: 120 # Increase to reduce full resyncs
Getting Help
Collect Diagnostic Information
When reporting issues, include this diagnostic information:
# Controller status and logs
kubectl get pods --namespace dot-ai
kubectl logs --selector app.kubernetes.io/name=dot-ai-controller --namespace dot-ai --tail 100
# MCP status and logs
kubectl logs --namespace dot-ai --selector app.kubernetes.io/name=dot-ai --tail 50
# RemediationPolicy configuration
kubectl get remediationpolicies --namespace dot-ai --output yaml
# ResourceSyncConfig configuration and status
kubectl get resourcesyncconfigs --all-namespaces --output yaml
# CapabilityScanConfig configuration and status
kubectl get capabilityscanconfigs --all-namespaces --output yaml
# Recent events
kubectl get events --namespace dot-ai --sort-by='.lastTimestamp' --field-selector type=Warning
Enable Debug Logging
For more detailed troubleshooting, you can increase log verbosity:
# Edit the controller deployment to add debug flags
kubectl patch deployment dot-ai-controller-manager --namespace dot-ai --patch='
{
"spec": {
"template": {
"spec": {
"containers": [
{
"name": "manager",
"args": ["--leader-elect", "--health-probe-bind-address=:8081", "-v=2"]
}
]
}
}
}
}'
Resource Requirements
The default resource limits are:
Controller:
- Limits: 500m CPU, 128Mi memory
- Requests: 10m CPU, 64Mi memory
MCP:
- Limits: 1 CPU, 2Gi memory
- Requests: 200m CPU, 512Mi memory
These should be sufficient for most use cases, but may need adjustment for high-volume environments.