Skip to main content

Remediate Feature Architecture

This document provides a detailed architecture overview of the Remediate feature in the DevOps AI Toolkit.

Overview

The Remediate feature provides AI-powered Kubernetes issue analysis and remediation. It investigates problems using kubectl tools, identifies root causes with confidence scoring, and executes verified fixes with optional post-execution validation.

High-Level Architecture

Remediation Workflow

The remediate tool operates as a multi-phase workflow with persistent session management:

Component Details

MCP Server (dot-ai)

The MCP server is the core remediation engine:

ComponentFileDescription
remediate toolsrc/tools/remediate.tsEntry point, orchestrates investigation and execution
System Promptprompts/remediate-system.mdAI instructions for investigation behavior
GenericSessionManagersrc/core/generic-session-manager.tsFile-based session persistence
AIProvidersrc/core/ai-provider.interface.tsAI abstraction with tool loop support
AIProviderFactorysrc/core/ai-provider-factory.tsMulti-provider factory (Anthropic, OpenAI, etc.)
kubectl-toolssrc/core/kubectl-tools.tsKubectl investigation tools
visualizationsrc/core/visualization.tsURL generation for web UI

Kubectl Investigation Tools

Tools available during AI investigation:

ToolDescription
kubectl_api_resourcesDiscover available resources in cluster
kubectl_getList resources with table format (compact)
kubectl_describeDetailed resource information with events
kubectl_logsContainer logs (supports --previous for crashes)
kubectl_eventsCluster events for understanding state changes
kubectl_patch_dryrunValidate patches before actual execution

Controller (dot-ai-controller)

The Kubernetes controller provides event-driven remediation:

ComponentFileDescription
RemediationPolicy CRDconfig/crd/bases/Custom resource for remediation rules
Policy Controllerinternal/controller/remediationpolicy_controller.goEvent matching and MCP dispatch
Rate Limiterinternal/controller/remediationpolicy_ratelimit.goPer-object cooldowns and rate limits
MCP Clientinternal/controller/remediationpolicy_mcp.goHTTP client for remediate tool
Cooldown StateConfigMapsPersistent cooldown state across restarts

Web UI (dot-ai-ui)

Provides visualization for remediation results:

ComponentFileDescription
Visualization Pagesrc/pages/Visualization.tsxMain page for /v/{sessionId}
MermaidRenderersrc/components/renderers/MermaidRenderer.tsxInteractive flowcharts (collapsible)
CardRenderersrc/components/renderers/CardRenderer.tsxIssue/solution cards
CodeRenderersrc/components/renderers/CodeRenderer.tsxCommands and logs with syntax highlighting
InsightsPanelsrc/components/InsightsPanel.tsxAI observations display
API Clientsrc/api/client.tsData fetching from MCP server

Integration Points

MCP Server ↔ AI Provider

  • Tool Loop: AI iteratively calls kubectl tools (max 30 iterations)
  • Investigation: Gathers cluster data to understand the issue
  • Analysis: Parses JSON response with root cause, confidence, and remediation steps
  • Validation: Optional recursive investigation after command execution

MCP Server ↔ Kubernetes API

  • Read Operations: kubectl get, describe, logs, events
  • Validation: kubectl patch --dry-run=server
  • Execution: child_process.exec() for remediation commands

Controller ↔ MCP Server

  • Event-Driven: Controller watches Kubernetes events
  • Policy Matching: Events matched against RemediationPolicy selectors
  • HTTP Dispatch: POST to MCP /api/v1/tools/remediate
  • Rate Limiting: Per-object cooldowns prevent remediation storms

MCP Server ↔ Web UI

  • Session Storage: Remediation data stored with session IDs
  • Visualization API: /api/v1/visualize/{sessionId} endpoint
  • URL Generation: WEB_UI_BASE_URL/v/{sessionId}

Controller ↔ Notifications

  • Slack Webhooks: Controller sends remediation events to Slack channels
  • Google Chat Webhooks: Controller sends remediation events to Google Chat spaces
  • Secret References: Webhook URLs stored securely in Kubernetes Secrets
  • Event Types: Notifications sent on remediation start, success, and failure

Session Management

Sessions persist workflow state across tool calls:

Session ID Format: rem-{timestamp}-{uuid8}
Example: rem-1767465086590-11029192

Session Data:
├── toolName: 'remediate'
├── issue: "Pod my-app is crashing with OOMKilled"
├── mode: 'manual' | 'automatic'
├── interaction_id: (for evaluation dataset)
├── status: 'investigating' | 'analysis_complete' | 'executed_*' | ...
├── finalAnalysis:
│ ├── rootCause: "Container memory limit too low"
│ ├── confidence: 0.92
│ ├── factors: ["High memory usage", "No HPA"]
│ ├── remediation:
│ │ ├── summary: "Increase memory limit"
│ │ ├── actions: [{description, command, risk, rationale}]
│ │ └── risk: 'low' | 'medium' | 'high'
│ └── validationIntent: "Verify pod is running"
├── executionResults: [{command, success, output, error}]
└── timestamp: ISO date

Session States

StateDescription
investigatingAI is gathering data via kubectl tools
analysis_completeAnalysis done, awaiting user approval
failedInvestigation failed (error)
executed_successfullyAll commands succeeded
executed_with_errorsSome commands failed
cancelledUser cancelled the remediation

RemediationPolicy CRD

The controller uses a CRD for event-driven remediation:

apiVersion: dot-ai.devopstoolkit.live/v1alpha1
kind: RemediationPolicy
metadata:
name: oom-killer-policy
spec:
eventSelectors:
- type: Warning
reason: OOMKilled
involvedObjectKind: Pod
namespace: production
message: ".*memory.*" # Regex support
mode: automatic # Override per selector
confidenceThreshold: 0.9
maxRiskLevel: low

mcpEndpoint: https://mcp.example.com/api/v1/tools
mcpAuthSecretRef:
name: mcp-auth
key: token
mcpTool: remediate

mode: manual # Default mode
confidenceThreshold: 0.8
maxRiskLevel: low

rateLimiting:
eventsPerMinute: 10
cooldownMinutes: 5

notifications:
slack:
webhookSecretRef:
name: slack-webhook
key: url
channel: "#alerts"
googleChat:
webhookSecretRef:
name: gchat-webhook
key: url

status:
totalEventsProcessed: 150
successfulRemediations: 142
failedRemediations: 8
rateLimitedEvents: 25
lastProcessedEvent: "2025-01-07T10:30:00Z"

Output Formats

The remediate tool returns structured output:

FieldDescription
statussuccess, failed, or awaiting_user_approval
sessionIdSession ID for continuation or visualization
investigation.iterationsNumber of AI tool loop iterations
investigation.dataGatheredList of kubectl tools called
analysis.rootCauseIdentified root cause
analysis.confidenceConfidence score (0-1)
analysis.factorsContributing factors
remediation.summaryHuman-readable summary
remediation.actionsCommands with risk levels
remediation.riskOverall risk level
validationIntentPost-execution validation instructions
executionChoicesAvailable execution options
resultsExecution results (if executed)

Error Handling

The remediation workflow includes robust error handling:

  1. Session Not Found: Clear guidance to start new investigation
  2. AI Service Errors: Logged with request IDs for debugging
  3. JSON Parsing Failures: Original AI response logged for analysis
  4. Command Execution Failures: Individual command results tracked
  5. Validation Failures: Recursive investigation continues despite errors
  6. Investigation Timeouts: Max 30 iterations prevents infinite loops

Configuration

Environment Variables

VariableDescriptionDefault
AI_PROVIDERAI provider selectionanthropic
ANTHROPIC_API_KEYAnthropic API keyRequired if using
OPENAI_API_KEYOpenAI API keyRequired if using
KUBECONFIGKubernetes config pathAuto-detected
DOT_AI_SESSION_DIRSession storage directory./tmp/sessions
WEB_UI_BASE_URLWeb UI base URLOptional
DEBUG_DOT_AIEnable debug loggingfalse

Supported AI Providers

ProviderModelsNotes
AnthropicClaude Sonnet 4.5, Opus, HaikuDefault, 1M token context
OpenAIGPT-5.1-codex
GoogleGemini 3 Pro, Flash
xAIGrok-4
Amazon BedrockVariousUses AWS credential chain
OpenRouterMulti-modelProxy to multiple providers
CustomOllama, vLLM, LocalAIVia baseURL config

See Also