Skip to main content

DevOps AI Toolkit Controller

A Kubernetes controller that provides resource tracking, event-driven remediation, and resource visibility capabilities for your cluster.

Mission

The DevOps AI Toolkit Controller bridges the gap between Kubernetes resources and intelligent operations. It enables:

  • Resource awareness through logical grouping and health aggregation
  • Proactive remediation through AI-powered event analysis
  • Resource discoverability through semantic search integration

Who Should Use This

  • Platform Engineers building self-service Kubernetes platforms
  • DevOps Teams looking to automate incident response
  • SREs who want intelligent monitoring and remediation
  • Developers deploying applications and needing resource visibility

Scope

In Scope

  • Kubernetes resource tracking and lifecycle management
  • Event-driven remediation with AI analysis
  • Resource synchronization for semantic search
  • Integration with DevOps AI Toolkit MCP

Out of Scope

  • Direct AI/LLM processing (delegated to MCP)
  • Application-level monitoring
  • Multi-cluster management
  • GitOps workflows

Features

The DevOps AI Toolkit Controller provides three main capabilities:

1. Solution CRD - Resource Tracking

Track and manage deployed Kubernetes resources as logical solutions:

  • Resource Grouping: Links all resources (Deployments, Services, etc.) that compose a logical solution
  • Intent Preservation: Stores the original user intent and deployment context
  • Metadata Storage: Captures deployment rationale, patterns, policies, and documentation links
  • Health Monitoring: Aggregates health status across all tracked resources
  • Automatic Cleanup: Deleting a Solution CR automatically deletes all child resources via ownerReferences

Works standalone - No external dependencies required.

2. RemediationPolicy CRD - Event-Driven Remediation

Monitor Kubernetes events and automatically remediate issues using the DevOps AI Toolkit:

  • Event Watching: Configurable filtering of Kubernetes events by type, reason, and involved objects
  • Automatic Mode: System detects, analyzes, and fixes issues without human intervention
  • Manual Mode: System provides remediation recommendations via Slack for human execution
  • Slack Notifications: Rich notifications with remediation results and next steps
  • Rate Limiting: Prevents event storms with configurable cooldowns
  • Status Reporting: Comprehensive observability through status updates

Requires DevOps AI Toolkit MCP for AI-powered analysis.

3. ResourceSyncConfig CRD - Resource Visibility

Enable semantic search and resource discovery across your cluster:

  • Resource Discovery: Automatically discovers all resource types in your cluster
  • Change Tracking: Watches for resource changes (create, update, delete)
  • Semantic Search: Syncs resource metadata to MCP for natural language queries
  • Debounced Sync: Batches changes to reduce API calls
  • Periodic Resync: Full state sync catches any missed events

Requires DevOps AI Toolkit MCP for semantic search capabilities.

Quick Start

Note: If you're using DevOps AI Toolkit (dot-ai), the controller is automatically installed into your cluster. Skip to step 2.

1. Install Controller

# Set the version from https://github.com/vfarcic/dot-ai-controller/pkgs/container/dot-ai-controller%2Fcharts%2Fdot-ai-controller
export DOT_AI_CONTROLLER_VERSION="..."

helm install dot-ai-controller oci://ghcr.io/vfarcic/dot-ai-controller/charts/dot-ai-controller \
--version $DOT_AI_CONTROLLER_VERSION \
--namespace dot-ai \
--create-namespace \
--wait

This installs all three CRDs (Solution, RemediationPolicy, and ResourceSyncConfig) and the controller.

2. Choose Your Feature

For Resource Tracking:

# Create a Solution CR to track your deployed resources
kubectl apply --filename - <<'EOF'
apiVersion: dot-ai.devopstoolkit.live/v1alpha1
kind: Solution
metadata:
name: my-app
namespace: default
spec:
intent: "Production web application with database"
resources:
- apiVersion: apps/v1
kind: Deployment
name: web-app
- apiVersion: v1
kind: Service
name: web-app-service
- apiVersion: apps/v1
kind: StatefulSet
name: postgresql
EOF

See the Solution Guide for complete examples and usage patterns.

For Event Remediation:

First, install the DevOps AI Toolkit MCP, then:

# Create a RemediationPolicy to handle events
kubectl apply --filename - <<'EOF'
apiVersion: dot-ai.devopstoolkit.live/v1alpha1
kind: RemediationPolicy
metadata:
name: auto-remediate
namespace: dot-ai
spec:
eventSelectors:
- type: Warning
reason: FailedScheduling
mode: automatic
mcpEndpoint: http://dot-ai-mcp.dot-ai.svc.cluster.local:3456/api/v1/tools/remediate
mode: manual
EOF

See the Remediation Guide for complete examples, configuration options, and best practices.

For Resource Visibility:

First, install the DevOps AI Toolkit MCP, then:

# Create a ResourceSyncConfig to enable semantic search
kubectl apply --filename - <<'EOF'
apiVersion: dot-ai.devopstoolkit.live/v1alpha1
kind: ResourceSyncConfig
metadata:
name: default-sync
spec:
mcpEndpoint: http://dot-ai-mcp.dot-ai.svc.cluster.local:3456/api/v1/resources/sync
debounceWindowSeconds: 10
resyncIntervalMinutes: 60
EOF

See the Resource Sync Guide for complete examples and semantic search usage.

Documentation

Architecture

┌─────────────────────────────────────────────────────┐
│ Kubernetes Cluster │
│ │
│ ┌──────────────────────┐ │
│ │ Solution CR │ (Parent Resource) │
│ │ ───────────── │ │
│ │ metadata: │ │
│ │ intent: "..." │ │
│ │ resources: [...] │ │
│ └──────────────────────┘ │
│ ▲ │
│ │ ownerReferences │
│ │ │
│ ┌────────┴──────────┬──────────────┬─────────┐ │
│ │ │ │ │ │
│ ▼ ▼ ▼ ▼ │
│ Deployment Service PVC ConfigMap│
│ (child) (child) (child) (child) │
│ │
│ ┌──────────────────────────────────────┐ │
│ │ Controller │ │
│ │ ─────────── │ │
│ │ • Watches Solution CRs │ │
│ │ • Manages ownerReferences │ │
│ │ • Tracks resource health │ │
│ │ • Processes events (RemediationPolicy) │ │
│ │ • Syncs resources to MCP (ResourceSync) │ │
│ └──────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────┘