Troubleshooting Guide

This guide covers common issues encountered when running the DevOps AI Toolkit Controller and their solutions.

Common Issues and Solutions

1. Controller Pod Not Starting

Symptoms:

kubectl get pods --namespace dot-ai
# Shows controller pod in CrashLoopBackOff or ImagePullBackOff

Diagnosis:

kubectl logs --selector app.kubernetes.io/name=dot-ai-controller --namespace dot-ai
kubectl describe pod --selector app.kubernetes.io/name=dot-ai-controller --namespace dot-ai

Common Causes:

RBAC Issues: Missing leader election permissions (we encountered this during testing)
Image Issues: Wrong architecture or missing image
Resource Constraints: Insufficient memory/CPU limits

Solution:

# Check if leader election RBAC is missing (error we fixed during testing):
# "leases.coordination.k8s.io is forbidden"
kubectl get clusterrole dot-ai-controller-manager-role --output yaml

# Add missing leader election permissions if needed:
kubectl patch clusterrole dot-ai-controller-manager-role --type='json' \
  --patch='[{"op": "add", "path": "/rules/-", "value": {"apiGroups": ["coordination.k8s.io"], "resources": ["leases"], "verbs": ["create", "get", "list", "update"]}}]'

2. Events Not Being Processed

Symptoms:

kubectl logs --selector app.kubernetes.io/name=dot-ai-controller --namespace dot-ai --tail 50
# Shows: "No RemediationPolicies found - event will not be processed"

Diagnosis:

# Check if RemediationPolicies exist
kubectl get remediationpolicies --all-namespaces

# Check policy selectors
kubectl get remediationpolicies --namespace dot-ai --output yaml

Common Causes:

No RemediationPolicy created
Event doesn't match policy selectors
Policy in wrong namespace

3. MCP Connection Failures

Symptoms:

# Controller logs show:
# "❌ HTTP request failed" or "Failed to send MCP request"

Diagnosis:

# Check MCP pod status
kubectl get pods --namespace dot-ai --selector app.kubernetes.io/name=dot-ai

# Test MCP connectivity from controller
kubectl exec --namespace dot-ai deployment/dot-ai-controller-manager -- \
  curl -v http://dot-ai-mcp.dot-ai.svc.cluster.local:3456/health

Common Causes:

MCP pod not running
Wrong MCP endpoint URL in RemediationPolicy
Network policies blocking communication

4. Slack Notifications Not Working

Symptoms:

# Controller logs show:
# "failed to send Slack start notification"

Diagnosis:

# Check Slack webhook configuration
kubectl get remediationpolicies --namespace dot-ai --output yaml | grep --after-context 5 slack

# Test webhook manually
curl -X POST -H 'Content-type: application/json' \
  --data '{"text":"Test message"}' \
  YOUR_SLACK_WEBHOOK_URL

Common Causes:

Invalid Slack webhook URL
Slack webhook disabled (enabled: false)
Network connectivity issues

5. Rate Limiting Active

Symptoms:

# Controller logs show:
# "Event processing rate limited" and "cooldown active for Xm Ys more"

This is Expected Behavior: Rate limiting prevents spam processing of duplicate events. The default settings are:

eventsPerMinute: 5
cooldownMinutes: 15

To Adjust: Modify your RemediationPolicy:

rateLimiting:
  eventsPerMinute: 10    # Increase if needed
  cooldownMinutes: 5     # Decrease if needed

6. MCP Analysis Failures

Symptoms:

# Controller logs show:
# "MCP remediation failed" or "McpRemediationFailed" events

Diagnosis:

# Check MCP logs for detailed error messages
kubectl logs --namespace dot-ai --selector app.kubernetes.io/name=dot-ai --tail 50

# Check RemediationPolicy status
kubectl describe remediationpolicies --namespace dot-ai

Common Causes:

Invalid Anthropic API key
API rate limits exceeded
Network connectivity to Anthropic services
Malformed event data

7. ResourceSyncConfig Not Syncing

Symptoms:

# ResourceSyncConfig status shows syncErrors or not active
kubectl get resourcesyncconfigs --output yaml

Diagnosis:

# Check ResourceSyncConfig status
kubectl get resourcesyncconfigs --output jsonpath='{.items[*].status}'

# Check controller logs for sync errors
kubectl logs --selector app.kubernetes.io/name=dot-ai-controller --namespace dot-ai | grep -i "resourcesync\|sync"

# Verify MCP endpoint is reachable
kubectl exec --namespace dot-ai deployment/dot-ai-controller-manager -- \
  curl -v http://dot-ai-mcp.dot-ai.svc.cluster.local:3456/api/v1/resources/sync

Common Causes:

MCP resource sync endpoint not available
Wrong mcpEndpoint URL in ResourceSyncConfig
Network policies blocking communication
RBAC permissions missing for resource discovery

Solution:

# Verify the MCP endpoint URL is correct
kubectl get resourcesyncconfigs --output jsonpath='{.items[*].spec.mcpEndpoint}'

# Check if watcher is active
kubectl get resourcesyncconfigs --output jsonpath='{.items[*].status.active}'

# Check watched resource types count
kubectl get resourcesyncconfigs --output jsonpath='{.items[*].status.watchedResourceTypes}'

8. CapabilityScanConfig Not Scanning

Symptoms:

# CapabilityScanConfig status shows errors or not ready
kubectl get capabilityscanconfigs --output yaml

Diagnosis:

# Check CapabilityScanConfig status
kubectl get capabilityscanconfigs --output jsonpath='{.items[*].status}'

# Check controller logs for scan errors
kubectl logs --selector app.kubernetes.io/name=dot-ai-controller --namespace dot-ai | grep -i "capabilityscan"

# Verify auth secret exists
kubectl get secret dot-ai-secrets --namespace dot-ai

Common Causes:

MCP endpoint not available
Wrong mcp.endpoint URL in CapabilityScanConfig
Missing or invalid mcp.authSecretRef secret
Resource filters excluding all resources

Solution:

# Verify the MCP endpoint URL is correct
kubectl get capabilityscanconfigs --output jsonpath='{.items[*].spec.mcp.endpoint}'

# Check if initial scan completed
kubectl get capabilityscanconfigs --output jsonpath='{.items[*].status.initialScanComplete}'

# Check last error
kubectl get capabilityscanconfigs --output jsonpath='{.items[*].status.lastError}'

# Verify include/exclude filters aren't too restrictive
kubectl get capabilityscanconfigs --output jsonpath='{.items[*].spec.includeResources}'

9. GitKnowledgeSource Not Syncing

Symptoms:

# GitKnowledgeSource status shows errors or Synced condition is False
kubectl get gitknowledgesources --output yaml

Diagnosis:

# Check GitKnowledgeSource status
kubectl get gitknowledgesources -n dot-ai -o jsonpath='{.items[*].status}'

# Check controller logs for sync errors
kubectl logs --selector app.kubernetes.io/name=dot-ai-controller --namespace dot-ai | grep -i "gitknowledge\|clone"

# Verify MCP endpoint is reachable
kubectl exec --namespace dot-ai deployment/dot-ai-controller-manager -- \
  curl -v http://dot-ai.dot-ai.svc:3456/health

Common Causes:

CloneError with "read-only file system": Controller deployment missing /tmp volume mount
Authentication failure: Invalid or missing token for private repositories
MCP unreachable: Wrong MCP server URL or network issues
Invalid path patterns: Glob patterns not matching any files

Solution:

# Check for read-only filesystem error (needs /tmp volume)
kubectl get gitknowledgesources -n dot-ai -o jsonpath='{.items[*].status.lastError}'

# Verify the controller has /tmp volume mounted
kubectl get deployment dot-ai-controller-manager -n dot-ai -o jsonpath='{.spec.template.spec.containers[0].volumeMounts}'

# If missing, patch to add /tmp volume:
kubectl patch deployment dot-ai-controller-manager -n dot-ai --type='json' -p='[
  {"op": "add", "path": "/spec/template/spec/volumes", "value": [{"name": "tmp-dir", "emptyDir": {}}]},
  {"op": "add", "path": "/spec/template/spec/containers/0/volumeMounts", "value": [{"name": "tmp-dir", "mountPath": "/tmp"}]}
]'

# For private repo auth issues, verify secret exists
kubectl get secret <secret-name> -n dot-ai -o jsonpath='{.data.<key>}' | base64 -d

10. ResourceSync High Traffic or Performance Issues

Symptoms:

High CPU/memory usage on controller
Frequent sync requests to MCP
Slow cluster performance

Diagnosis:

# Check sync frequency and resource counts
kubectl get resourcesyncconfigs --output yaml | grep -A5 status

# Check debounce and resync settings
kubectl get resourcesyncconfigs --output yaml | grep -E "debounceWindowSeconds|resyncIntervalMinutes"

Solution:

Adjust debounce and resync intervals in your ResourceSyncConfig:

spec:
  debounceWindowSeconds: 30   # Increase to batch more changes
  resyncIntervalMinutes: 120  # Increase to reduce full resyncs

Getting Help

Collect Diagnostic Information

When reporting issues, include this diagnostic information:

# Controller status and logs
kubectl get pods --namespace dot-ai
kubectl logs --selector app.kubernetes.io/name=dot-ai-controller --namespace dot-ai --tail 100

# MCP status and logs
kubectl logs --namespace dot-ai --selector app.kubernetes.io/name=dot-ai --tail 50

# RemediationPolicy configuration
kubectl get remediationpolicies --namespace dot-ai --output yaml

# ResourceSyncConfig configuration and status
kubectl get resourcesyncconfigs --all-namespaces --output yaml

# CapabilityScanConfig configuration and status
kubectl get capabilityscanconfigs --all-namespaces --output yaml

# Recent events
kubectl get events --namespace dot-ai --sort-by='.lastTimestamp' --field-selector type=Warning

Enable Debug Logging

For more detailed troubleshooting, you can increase log verbosity:

# Edit the controller deployment to add debug flags
kubectl patch deployment dot-ai-controller-manager --namespace dot-ai --patch='
{
  "spec": {
    "template": {
      "spec": {
        "containers": [
          {
            "name": "manager",
            "args": ["--leader-elect", "--health-probe-bind-address=:8081", "-v=2"]
          }
        ]
      }
    }
  }
}'

Resource Requirements

The default resource limits are:

Controller:

Limits: 500m CPU, 128Mi memory
Requests: 10m CPU, 64Mi memory

MCP:

Limits: 1 CPU, 2Gi memory
Requests: 200m CPU, 512Mi memory

These should be sufficient for most use cases, but may need adjustment for high-volume environments.

Common Issues and Solutions​

1. Controller Pod Not Starting​

2. Events Not Being Processed​

3. MCP Connection Failures​

4. Slack Notifications Not Working​

5. Rate Limiting Active​

6. MCP Analysis Failures​

7. ResourceSyncConfig Not Syncing​

8. CapabilityScanConfig Not Scanning​

9. GitKnowledgeSource Not Syncing​

10. ResourceSync High Traffic or Performance Issues​

Getting Help​

Collect Diagnostic Information​

Enable Debug Logging​

Resource Requirements​

Common Issues and Solutions

1. Controller Pod Not Starting

2. Events Not Being Processed

3. MCP Connection Failures

4. Slack Notifications Not Working

5. Rate Limiting Active

6. MCP Analysis Failures

7. ResourceSyncConfig Not Syncing

8. CapabilityScanConfig Not Scanning

9. GitKnowledgeSource Not Syncing

10. ResourceSync High Traffic or Performance Issues

Getting Help

Collect Diagnostic Information

Enable Debug Logging

Resource Requirements