Skip to main content

Troubleshooting Guide

This guide covers common issues encountered when running the DevOps AI Toolkit Controller and their solutions.

Common Issues and Solutions

1. Controller Pod Not Starting

Symptoms:

kubectl get pods --namespace dot-ai
# Shows controller pod in CrashLoopBackOff or ImagePullBackOff

Diagnosis:

kubectl logs --selector app.kubernetes.io/name=dot-ai-controller --namespace dot-ai
kubectl describe pod --selector app.kubernetes.io/name=dot-ai-controller --namespace dot-ai

Common Causes:

  • RBAC Issues: Missing leader election permissions (we encountered this during testing)
  • Image Issues: Wrong architecture or missing image
  • Resource Constraints: Insufficient memory/CPU limits

Solution:

# Check if leader election RBAC is missing (error we fixed during testing):
# "leases.coordination.k8s.io is forbidden"
kubectl get clusterrole dot-ai-controller-manager-role --output yaml

# Add missing leader election permissions if needed:
kubectl patch clusterrole dot-ai-controller-manager-role --type='json' \
--patch='[{"op": "add", "path": "/rules/-", "value": {"apiGroups": ["coordination.k8s.io"], "resources": ["leases"], "verbs": ["create", "get", "list", "update"]}}]'

2. Events Not Being Processed

Symptoms:

kubectl logs --selector app.kubernetes.io/name=dot-ai-controller --namespace dot-ai --tail 50
# Shows: "No RemediationPolicies found - event will not be processed"

Diagnosis:

# Check if RemediationPolicies exist
kubectl get remediationpolicies --all-namespaces

# Check policy selectors
kubectl get remediationpolicies --namespace dot-ai --output yaml

Common Causes:

  • No RemediationPolicy created
  • Event doesn't match policy selectors
  • Policy in wrong namespace

3. MCP Connection Failures

Symptoms:

# Controller logs show:
# "❌ HTTP request failed" or "Failed to send MCP request"

Diagnosis:

# Check MCP pod status
kubectl get pods --namespace dot-ai --selector app.kubernetes.io/name=dot-ai

# Test MCP connectivity from controller
kubectl exec --namespace dot-ai deployment/dot-ai-controller-manager -- \
curl -v http://dot-ai-mcp.dot-ai.svc.cluster.local:3456/health

Common Causes:

  • MCP pod not running
  • Wrong MCP endpoint URL in RemediationPolicy
  • Network policies blocking communication

4. Slack Notifications Not Working

Symptoms:

# Controller logs show:
# "failed to send Slack start notification"

Diagnosis:

# Check Slack webhook configuration
kubectl get remediationpolicies --namespace dot-ai --output yaml | grep --after-context 5 slack

# Test webhook manually
curl -X POST -H 'Content-type: application/json' \
--data '{"text":"Test message"}' \
YOUR_SLACK_WEBHOOK_URL

Common Causes:

  • Invalid Slack webhook URL
  • Slack webhook disabled (enabled: false)
  • Network connectivity issues

5. Rate Limiting Active

Symptoms:

# Controller logs show:
# "Event processing rate limited" and "cooldown active for Xm Ys more"

This is Expected Behavior: Rate limiting prevents spam processing of duplicate events. The default settings are:

  • eventsPerMinute: 5
  • cooldownMinutes: 15

To Adjust: Modify your RemediationPolicy:

rateLimiting:
eventsPerMinute: 10 # Increase if needed
cooldownMinutes: 5 # Decrease if needed

6. MCP Analysis Failures

Symptoms:

# Controller logs show:
# "MCP remediation failed" or "McpRemediationFailed" events

Diagnosis:

# Check MCP logs for detailed error messages
kubectl logs --namespace dot-ai --selector app.kubernetes.io/name=dot-ai --tail 50

# Check RemediationPolicy status
kubectl describe remediationpolicies --namespace dot-ai

Common Causes:

  • Invalid Anthropic API key
  • API rate limits exceeded
  • Network connectivity to Anthropic services
  • Malformed event data

7. ResourceSyncConfig Not Syncing

Symptoms:

# ResourceSyncConfig status shows syncErrors or not active
kubectl get resourcesyncconfigs --output yaml

Diagnosis:

# Check ResourceSyncConfig status
kubectl get resourcesyncconfigs --output jsonpath='{.items[*].status}'

# Check controller logs for sync errors
kubectl logs --selector app.kubernetes.io/name=dot-ai-controller --namespace dot-ai | grep -i "resourcesync\|sync"

# Verify MCP endpoint is reachable
kubectl exec --namespace dot-ai deployment/dot-ai-controller-manager -- \
curl -v http://dot-ai-mcp.dot-ai.svc.cluster.local:3456/api/v1/resources/sync

Common Causes:

  • MCP resource sync endpoint not available
  • Wrong mcpEndpoint URL in ResourceSyncConfig
  • Network policies blocking communication
  • RBAC permissions missing for resource discovery

Solution:

# Verify the MCP endpoint URL is correct
kubectl get resourcesyncconfigs --output jsonpath='{.items[*].spec.mcpEndpoint}'

# Check if watcher is active
kubectl get resourcesyncconfigs --output jsonpath='{.items[*].status.active}'

# Check watched resource types count
kubectl get resourcesyncconfigs --output jsonpath='{.items[*].status.watchedResourceTypes}'

8. CapabilityScanConfig Not Scanning

Symptoms:

# CapabilityScanConfig status shows errors or not ready
kubectl get capabilityscanconfigs --output yaml

Diagnosis:

# Check CapabilityScanConfig status
kubectl get capabilityscanconfigs --output jsonpath='{.items[*].status}'

# Check controller logs for scan errors
kubectl logs --selector app.kubernetes.io/name=dot-ai-controller --namespace dot-ai | grep -i "capabilityscan"

# Verify auth secret exists
kubectl get secret dot-ai-secrets --namespace dot-ai

Common Causes:

  • MCP endpoint not available
  • Wrong mcp.endpoint URL in CapabilityScanConfig
  • Missing or invalid mcp.authSecretRef secret
  • Resource filters excluding all resources

Solution:

# Verify the MCP endpoint URL is correct
kubectl get capabilityscanconfigs --output jsonpath='{.items[*].spec.mcp.endpoint}'

# Check if initial scan completed
kubectl get capabilityscanconfigs --output jsonpath='{.items[*].status.initialScanComplete}'

# Check last error
kubectl get capabilityscanconfigs --output jsonpath='{.items[*].status.lastError}'

# Verify include/exclude filters aren't too restrictive
kubectl get capabilityscanconfigs --output jsonpath='{.items[*].spec.includeResources}'

9. GitKnowledgeSource Not Syncing

Symptoms:

# GitKnowledgeSource status shows errors or Synced condition is False
kubectl get gitknowledgesources --output yaml

Diagnosis:

# Check GitKnowledgeSource status
kubectl get gitknowledgesources -n dot-ai -o jsonpath='{.items[*].status}'

# Check controller logs for sync errors
kubectl logs --selector app.kubernetes.io/name=dot-ai-controller --namespace dot-ai | grep -i "gitknowledge\|clone"

# Verify MCP endpoint is reachable
kubectl exec --namespace dot-ai deployment/dot-ai-controller-manager -- \
curl -v http://dot-ai.dot-ai.svc:3456/health

Common Causes:

  • CloneError with "read-only file system": Controller deployment missing /tmp volume mount
  • Authentication failure: Invalid or missing token for private repositories
  • MCP unreachable: Wrong MCP server URL or network issues
  • Invalid path patterns: Glob patterns not matching any files

Solution:

# Check for read-only filesystem error (needs /tmp volume)
kubectl get gitknowledgesources -n dot-ai -o jsonpath='{.items[*].status.lastError}'

# Verify the controller has /tmp volume mounted
kubectl get deployment dot-ai-controller-manager -n dot-ai -o jsonpath='{.spec.template.spec.containers[0].volumeMounts}'

# If missing, patch to add /tmp volume:
kubectl patch deployment dot-ai-controller-manager -n dot-ai --type='json' -p='[
{"op": "add", "path": "/spec/template/spec/volumes", "value": [{"name": "tmp-dir", "emptyDir": {}}]},
{"op": "add", "path": "/spec/template/spec/containers/0/volumeMounts", "value": [{"name": "tmp-dir", "mountPath": "/tmp"}]}
]'

# For private repo auth issues, verify secret exists
kubectl get secret <secret-name> -n dot-ai -o jsonpath='{.data.<key>}' | base64 -d

10. ResourceSync High Traffic or Performance Issues

Symptoms:

  • High CPU/memory usage on controller
  • Frequent sync requests to MCP
  • Slow cluster performance

Diagnosis:

# Check sync frequency and resource counts
kubectl get resourcesyncconfigs --output yaml | grep -A5 status

# Check debounce and resync settings
kubectl get resourcesyncconfigs --output yaml | grep -E "debounceWindowSeconds|resyncIntervalMinutes"

Solution:

Adjust debounce and resync intervals in your ResourceSyncConfig:

spec:
debounceWindowSeconds: 30 # Increase to batch more changes
resyncIntervalMinutes: 120 # Increase to reduce full resyncs

Getting Help

Collect Diagnostic Information

When reporting issues, include this diagnostic information:

# Controller status and logs
kubectl get pods --namespace dot-ai
kubectl logs --selector app.kubernetes.io/name=dot-ai-controller --namespace dot-ai --tail 100

# MCP status and logs
kubectl logs --namespace dot-ai --selector app.kubernetes.io/name=dot-ai --tail 50

# RemediationPolicy configuration
kubectl get remediationpolicies --namespace dot-ai --output yaml

# ResourceSyncConfig configuration and status
kubectl get resourcesyncconfigs --all-namespaces --output yaml

# CapabilityScanConfig configuration and status
kubectl get capabilityscanconfigs --all-namespaces --output yaml

# Recent events
kubectl get events --namespace dot-ai --sort-by='.lastTimestamp' --field-selector type=Warning

Enable Debug Logging

For more detailed troubleshooting, you can increase log verbosity:

# Edit the controller deployment to add debug flags
kubectl patch deployment dot-ai-controller-manager --namespace dot-ai --patch='
{
"spec": {
"template": {
"spec": {
"containers": [
{
"name": "manager",
"args": ["--leader-elect", "--health-probe-bind-address=:8081", "-v=2"]
}
]
}
}
}
}'

Resource Requirements

The default resource limits are:

Controller:

  • Limits: 500m CPU, 128Mi memory
  • Requests: 10m CPU, 64Mi memory

MCP:

  • Limits: 1 CPU, 2Gi memory
  • Requests: 200m CPU, 512Mi memory

These should be sufficient for most use cases, but may need adjustment for high-volume environments.