Skip to main content

Operate Feature Architecture

This document provides a detailed architecture overview of the Operate feature in the DevOps AI Toolkit.

Overview

The Operate feature provides AI-powered Day 2 operations for Kubernetes applications. It handles updates, scaling, enhancements, rollbacks, and deletions through natural language intents while applying organizational patterns and policies, validating changes via dry-run, and executing approved operations safely.

High-Level Architecture

Operation Workflow

The operate tool implements a three-phase workflow with persistent session management:

Component Details

MCP Server (dot-ai)

The MCP server provides the core operations engine:

ComponentFileDescription
operate toolsrc/tools/operate.tsEntry point, routing, context embedding, formatting
Analysis workflowsrc/tools/operate-analysis.tsIntent analysis, AI tool loop, response parsing
Execution workflowsrc/tools/operate-execution.tsCommand execution, post-validation, results
System Promptprompts/operate-system.mdAI instructions for operation behavior
User Promptprompts/operate-user.mdHandlebars template with context injection
GenericSessionManagersrc/core/generic-session-manager.tsFile-based session persistence
AIProvidersrc/core/ai-provider.interface.tsAI abstraction with tool loop support
kubectl-toolssrc/core/kubectl-tools.tsKubectl investigation and validation tools
Vector Servicessrc/services/*-vector-service.tsPattern, policy, capability search
visualizationsrc/core/visualization.tsURL generation for web UI

Kubectl Investigation & Validation Tools

Tools available during AI analysis:

ToolDescription
kubectl_api_resourcesDiscover available resources in cluster
kubectl_getList resources with table format
kubectl_describeDetailed resource information with events
kubectl_logsContainer logs for debugging
kubectl_patch_dryrunValidate patch operations before execution
kubectl_apply_dryrunValidate apply operations before execution
kubectl_delete_dryrunValidate delete operations before execution
kubectl_get_crd_schemaGet CRD schema for custom resources

Controller (dot-ai-controller)

The Kubernetes controller provides capability scanning:

ComponentFileDescription
Capability Scannerinternal/controller/capability_scanner.goDiscovers cluster resources and capabilities
Embedding Serviceinternal/controller/embedding_service.goGenerates embeddings for semantic search
Qdrant Clientinternal/controller/qdrant_client.goStores capabilities in vector database

Web UI (dot-ai-ui)

Provides visualization for operation analysis and execution:

ComponentFileDescription
Visualization Pagesrc/pages/Visualization.tsxMain page for /v/{sessionId}
MermaidRenderersrc/components/renderers/MermaidRenderer.tsxInteractive flowcharts
CardRenderersrc/components/renderers/CardRenderer.tsxCurrent state and proposed changes
CodeRenderersrc/components/renderers/CodeRenderer.tsxCommands with syntax highlighting
InsightsPanelsrc/components/InsightsPanel.tsxAI observations and risk assessment

Integration Points

MCP Server ↔ AI Provider

  • Tool Loop: AI iteratively calls kubectl tools (max 30 iterations)
  • Investigation: Gathers current cluster state to understand resources
  • Dry-Run Validation: Validates all commands before proposing
  • Analysis: Generates JSON response with changes, commands, and risk assessment

MCP Server ↔ Vector Database

  • Patterns: Organizational patterns for operational best practices
  • Policies: Policy intents for validation and compliance
  • Capabilities: Cluster resource capabilities for intelligent recommendations
  • Capabilities are mandatory; patterns/policies are optional

MCP Server ↔ Kubernetes API

  • Read Operations: kubectl get, describe, logs
  • Validation: kubectl patch/apply/delete --dry-run=server
  • Execution: Sequential command execution via child_process.exec()

MCP Server ↔ Remediate Tool

  • Post-Execution Validation: Internally calls remediate with validationIntent
  • Verification: Confirms operations completed successfully
  • Error Detection: Identifies issues introduced by operations

MCP Server ↔ Web UI

  • Session Storage: Operation data stored with session IDs
  • Visualization API: /api/v1/visualize/{sessionId} endpoint
  • URL Generation: WEB_UI_BASE_URL/v/{sessionId}

Session Management

Sessions persist workflow state across tool calls:

Session ID Format: opr-{timestamp}-{uuid8}
Example: opr-1704067200000-a1b2c3d4

Session Data:
├── toolName: 'operate'
├── intent: "Update my-api to v2.0 with zero downtime"
├── context:
│ ├── patterns: OrganizationalPattern[]
│ ├── policies: PolicyIntent[]
│ └── capabilities: ResourceCapability[]
├── proposedChanges:
│ ├── create: ResourceChange[]
│ ├── update: ResourceChange[]
│ └── delete: ResourceChange[]
├── commands: ["kubectl set image...", "kubectl patch..."]
├── dryRunValidation:
│ ├── status: 'success' | 'failed'
│ └── details: string
├── patternsApplied: ["Zero-Downtime Rolling Update"]
├── capabilitiesUsed: ["metrics-server", "KEDA"]
├── policiesChecked: ["Production Update Policy"]
├── risks: { level: 'low', description: "..." }
├── validationIntent: "Verify deployment rollout complete"
├── status: 'analyzing' | 'analysis_complete' | 'executing' | 'executed_*'
└── executionResults: [{command, success, output, error}]

Session States

StateDescription
analyzingAI is gathering data and generating commands
analysis_completeAnalysis done, awaiting user approval
executingCommands are being executed
executed_successfullyAll commands succeeded
executed_with_errorsSome commands failed
failedAnalysis or execution failed

Organizational Context Integration

The operate tool integrates organizational knowledge via vector database search:

Context Priority

  1. Capabilities (Mandatory): What the cluster can do
  2. Patterns (Optional): Organizational best practices
  3. Policies (Optional): Compliance and validation rules

Output Formats

The operate tool returns structured output at different stages:

Analysis Response

FieldDescription
statusawaiting_user_approval
sessionIdSession ID for continuation
visualizationUrlURL to view analysis in web UI
currentStateCurrent cluster resource state
proposedChangesCreate, update, delete operations
commandsPre-validated kubectl commands
dryRunValidationDry-run validation results
patternsAppliedApplied organizational patterns
capabilitiesUsedUsed cluster capabilities
policiesCheckedChecked policies
risksRisk assessment (level + description)
validationIntentPost-execution validation instructions

Execution Response

FieldDescription
statussuccess or failed
sessionIdSession ID for reference
resultsPer-command execution results
validationPost-execution validation summary
messageHuman-readable summary

Error Handling

The operation workflow includes robust error handling:

  1. No Capabilities Found: Clear guidance to run capability scan first
  2. Session Not Found: Guidance to start new operation
  3. Dry-Run Failures: AI iterates to fix commands before proposing
  4. Command Execution Failures: Continue-on-error, capture all results
  5. Validation Failures: Report issues via remediate tool integration
  6. AI Service Errors: Logged with request IDs for debugging
  7. Investigation Timeouts: Max 30 iterations prevents infinite loops

Configuration

Environment Variables

VariableDescriptionDefault
AI_PROVIDERAI provider selectionanthropic
ANTHROPIC_API_KEYAnthropic API keyRequired if using
OPENAI_API_KEYOpenAI API keyRequired if using
QDRANT_URLQdrant vector database URLhttp://localhost:6333
QDRANT_API_KEYQdrant API keyOptional
QDRANT_CAPABILITIES_COLLECTIONCapabilities collection namecapabilities
KUBECONFIGKubernetes config pathAuto-detected
DOT_AI_SESSION_DIRSession storage directory~/.dot-ai/sessions
WEB_UI_BASE_URLWeb UI base URLOptional
DEBUG_DOT_AIEnable debug loggingfalse

Supported AI Providers

ProviderModelsNotes
AnthropicClaude Sonnet 4.5, Opus, HaikuDefault, 1M token context
OpenAIGPT-5.1-codex
GoogleGemini 3 Pro, Flash
xAIGrok-4
Amazon BedrockVariousUses AWS credential chain
OpenRouterMulti-modelProxy to multiple providers
CustomOllama, vLLM, LocalAIVia baseURL config

Workflow Example

User Intent: "Update my-api deployment in prod to v2.0 with zero downtime"

1. CONTEXT EMBEDDING
└─ embedContext(intent)
├─ Search patterns → "Zero-Downtime Rolling Update"
├─ Search policies → "Production Update Requirements"
└─ Search capabilities → "metrics-server", "KEDA Operator"

2. AI INVESTIGATION LOOP
└─ AI Tool Loop (30 iterations max)
├─ kubectl_get deployment/my-api -n prod
├─ kubectl_describe deployment/my-api -n prod
├─ kubectl_patch_dryrun (test maxUnavailable: 0)
└─ kubectl_set_image (test v2.0 image --dry-run=server)

3. ANALYSIS GENERATION
└─ Session created: opr-1704067200000-a1b2c3d4
├─ Status: analysis_complete
├─ Current: 3 replicas, my-api:v1.5, maxUnavailable: 1
├─ Proposed: image v2.0, maxUnavailable: 0
├─ Commands: set image + patch strategy
├─ Risk: LOW
└─ Visualization URL: https://dot-ai-ui/v/opr-1704067200000-a1b2c3d4

4. USER APPROVAL
└─ User reviews analysis in terminal or web UI
└─ Calls: operate({ sessionId: 'opr-...', executeChoice: 1 })

5. COMMAND EXECUTION
└─ executeOperations()
├─ Load session (status: analysis_complete)
├─ Update status to executing
├─ Execute commands sequentially
│ ├─ kubectl set image deployment/my-api my-api=my-api:v2.0 -n prod
│ └─ kubectl patch deployment/my-api -n prod -p '{"spec":...}'
├─ Call remediate internally for validation
└─ Update status to executed_successfully

6. RETURN RESULTS
└─ Results: 2 commands succeeded
├─ Validation: "Rollout complete, all pods running v2.0"
└─ Status: success

See Also