# DevOps AI Toolkit > AI-powered Kubernetes operations for DevOps teams. Deploy, query, remediate, and operate Kubernetes resources through natural language using MCP, CLI, a Kubernetes controller, and a web UI. ## Getting Started # DevOps AI Toolkit Stack **Deploy the complete DevOps AI Toolkit stack with a single Helm command.** ## Overview The dot-ai-stack umbrella chart installs all DevOps AI Toolkit components with a single command: - **DevOps AI Toolkit** - MCP server for AI-powered Kubernetes operations - **DevOps AI Toolkit Controller** - Kubernetes controller for intelligent resource management and autonomous operations - **DevOps AI Toolkit Web UI** - Web interface for visual cluster management > **Note:** This guide covers Kubernetes deployment using the umbrella Helm chart. For other installation options (Docker, NPX, individual charts, etc.), see [devopstoolkit.ai](/). ## Prerequisites - **Kubernetes cluster** with an ingress controller - **Helm 3.x** installed - **kubectl** configured with cluster access - **AI API keys** for AI-powered features (Anthropic and/or OpenAI) ## Step 1: Create a Local Cluster (Optional) > Skip this step if you already have a Kubernetes cluster with an ingress controller. Create a Kind cluster with ingress port mappings: ```bash kind create cluster --name dot-ai-stack --config - < **Note:** Multiple AI providers are supported. See [AI Model Configuration](/docs/ai-engine/setup/deployment#ai-model-configuration) for all options including Google Gemini, AWS Bedrock, Azure OpenAI, and others. Generate random authentication tokens for the MCP server and Web UI: ```bash export DOT_AI_AUTH_TOKEN=$(openssl rand -base64 32) export DOT_AI_UI_AUTH_TOKEN=$(openssl rand -base64 32) ``` ## Step 3: Install the Stack Install the complete dot-ai stack with a single Helm command: ```bash helm upgrade --install dot-ai-stack \ oci://ghcr.io/vfarcic/dot-ai-stack/charts/dot-ai-stack \ --namespace dot-ai --create-namespace \ --set dot-ai.secrets.anthropic.apiKey=$ANTHROPIC_API_KEY \ --set dot-ai.secrets.openai.apiKey=$OPENAI_API_KEY \ --set dot-ai.secrets.auth.token=$DOT_AI_AUTH_TOKEN \ --set dot-ai.ingress.enabled=true \ --set dot-ai.ingress.className=nginx \ --set dot-ai.ingress.host=dot-ai.127.0.0.1.nip.io \ --set dot-ai.webUI.baseUrl=http://dot-ai-ui.127.0.0.1.nip.io \ --set dot-ai-ui.uiAuth.token=$DOT_AI_UI_AUTH_TOKEN \ --set dot-ai-ui.ingress.enabled=true \ --set dot-ai-ui.ingress.host=dot-ai-ui.127.0.0.1.nip.io \ --wait ``` > **Note:** Replace the ingress hosts with your actual domain names for production deployments. This installs: - **dot-ai** - MCP server with ingress at `dot-ai.127.0.0.1.nip.io` - **dot-ai-controller** - Kubernetes controller for autonomous operations - **dot-ai-ui** - Web interface at `dot-ai-ui.127.0.0.1.nip.io` - **Qdrant** - Vector database for pattern and policy storage - **ResourceSyncConfig** - Enables resource discovery - **CapabilityScanConfig** - Enables cluster capability scanning ## Step 4: Verify Installation Check that all pods are running: ```bash kubectl get pods --namespace dot-ai ``` You should see all pods in `Running` status: ``` NAME READY STATUS RESTARTS AGE dot-ai-577db5b4fc-j8kgf 1/1 Running 0 50s dot-ai-controller-manager-c898b5697-dqk2m 1/1 Running 0 50s dot-ai-stack-qdrant-0 1/1 Running 0 50s dot-ai-ui-69d586db8b-ccqrm 1/1 Running 0 50s ``` Test the MCP server health: ```bash curl -H "Authorization: Bearer $DOT_AI_AUTH_TOKEN" \ http://dot-ai.127.0.0.1.nip.io/healthz ``` Expected output: ```json {"status":"ok"} ``` ## Step 5: Choose Your Client The DevOps AI Toolkit can be accessed through two client options - **MCP** or **CLI**. Both provide AI agent integration with full feature parity. ### Option A: MCP Client **Best for:** Curated high-level operations designed to minimize context window usage. Create the MCP client configuration file: ```bash cat > .mcp.json << EOF { "mcpServers": { "dot-ai": { "type": "http", "url": "http://dot-ai.127.0.0.1.nip.io", "headers": { "Authorization": "Bearer $DOT_AI_AUTH_TOKEN" } } } } EOF ``` > **Note:** This example creates `.mcp.json` in the current directory for Claude Code. Other MCP-enabled agents may expect the configuration in a different location (e.g., `~/.config/` or within the agent's settings). Consult your agent's documentation for the correct path. **Learn more:** [MCP Setup Documentation](/docs/ai-engine/setup/deployment) ### Option B: CLI Client **Best for:** Comprehensive API access with lower token overhead for AI agents, plus scripting and automation support. Install the CLI: **macOS (Homebrew):** ```bash brew install vfarcic/tap/dot-ai ``` **Windows (Scoop):** ```powershell scoop bucket add dot-ai https://github.com/vfarcic/scoop-dot-ai scoop install dot-ai ``` **Other platforms:** Download from [releases](https://github.com/vfarcic/dot-ai-cli/releases) or see [installation guide](/docs/cli/setup/installation/). Configure the CLI: ```bash export DOT_AI_URL="http://dot-ai.127.0.0.1.nip.io" ``` Generate skills for your AI agent: ```bash # For Claude Code dot-ai skills generate --agent claude-code # For Cursor dot-ai skills generate --agent cursor # For Windsurf dot-ai skills generate --agent windsurf ``` **Learn more:** [CLI Quick Start](/docs/cli/quick-start/) | [Installation](/docs/cli/setup/installation/) | [Agent Integration](/docs/cli/guides/skills-generation/) ### Choosing Between MCP and CLI - **Use MCP** for simpler, high-level operations with minimal tool descriptions - **Use CLI** for comprehensive API access with lower token costs and better economy for agents executing multiple commands ## Step 6: Start Using Launch your AI agent: ```bash claude ``` > **Note:** If your agent doesn't automatically detect the client, explicitly invoke it with "Use dot-ai MCP" or "Use dot-ai CLI" depending on which client you configured. Try these example prompts: | What You Want | Example Prompt | |---------------|----------------| | Check system status | "Show dot-ai status" | | Query cluster | "What pods are running in the dot-ai namespace?" | | List capabilities | "List all capabilities" | | Deploy an app | "I want to deploy a web application" | | Fix issues | "Something is wrong with my database" | ## Configuration Override any component value by prefixing with the chart name: ```bash --set dot-ai.resources.limits.memory=4Gi --set dot-ai-controller.resources.limits.memory=1Gi --set dot-ai-ui.ingress.host=ui.example.com ``` For available options, see each component's documentation: - [DevOps AI Toolkit values](/docs/ai-engine/setup/deployment) - [Controller values](/docs/controller/) - [Web UI values](/docs/ui/) ## Next Steps - [MCP Tools Overview](/docs/ai-engine/tools/overview) - Complete feature reference - [Pattern Management](/docs/ai-engine/organizational-data/patterns) - Create organizational patterns - [Policy Management](/docs/ai-engine/organizational-data/policies) - Define governance policies ## Cleanup To remove the stack: ```bash helm uninstall dot-ai-stack --namespace dot-ai kubectl delete namespace dot-ai ``` To delete the Kind cluster: ```bash kind delete cluster --name dot-ai-stack ``` # Operate Feature Architecture This document provides a detailed architecture overview of the Operate feature in the DevOps AI Toolkit. ## Overview The Operate feature provides AI-powered Day 2 operations for Kubernetes applications. It handles updates, scaling, enhancements, rollbacks, and deletions through natural language intents while applying organizational patterns and policies, validating changes via dry-run, and executing approved operations safely. ## High-Level Architecture ```mermaid flowchart TB subgraph Users["User / AI Agent"] Agent["Claude Code, Cursor,
VS Code, etc."] end subgraph MCP["MCP Server (dot-ai)"] Operate["operate Tool"] AI["AI Provider"] Session["Session
Manager"] Kubectl["Kubectl Tools"] VectorClient["Vector DB
Client"] end subgraph External["External Services"] LLM["Claude, OpenAI,
Gemini, etc."] Qdrant["Qdrant
(Patterns, Policies,
Capabilities)"] end subgraph K8s["Kubernetes Cluster"] API["Kubernetes API"] Controller["Controller
(dot-ai-controller)"] Resources["Cluster Resources
Deployments, StatefulSets,
CRDs, etc."] end subgraph WebUI["Web UI (dot-ai-ui)"] Viz["Visualization Dashboard
- Current State
- Proposed Changes
- Risk Assessment"] end Agent <-->|MCP Protocol| Operate Operate --> AI Operate --> Session Operate --> Kubectl Operate --> VectorClient AI --> LLM AI <-->|Tool Loop| Kubectl VectorClient --> Qdrant Kubectl --> API Operate -->|Execute Commands| API Controller -->|Scan & Embed| Qdrant Controller -->|Watch| Resources Agent -.->|User opens
Visualization URL| WebUI ``` ## Operation Workflow The operate tool implements a three-phase workflow with persistent session management: ```mermaid flowchart TD subgraph Phase1["Phase 1: Intent Analysis"] Intent["User Intent
(natural language)"] CreateSession["Create Session
(opr-{ts}-{uuid})"] EmbedContext["Embed Context:
- Search Patterns
- Search Policies
- Search Capabilities"] ContextCheck{"Capabilities
Found?"} NoCapError["Error: Run
capability scan first"] Investigation["AI Investigation Loop
(max 30 iterations)"] KubectlTools["kubectl Tools:
get, describe, logs,
patch_dryrun, apply_dryrun,
delete_dryrun, get_crd_schema"] Intent --> CreateSession --> EmbedContext EmbedContext --> ContextCheck ContextCheck -->|No| NoCapError ContextCheck -->|Yes| Investigation Investigation <-->|Tool Calls| KubectlTools end subgraph Phase2["Phase 2: Analysis & Validation"] ParseResponse["Parse AI Response"] DryRunCheck{"Dry-Run
Validated?"} RetryAnalysis["AI Iterates with
Corrected Commands"] Analysis["Generate Analysis:
- Current State
- Proposed Changes
- Commands
- Risk Assessment"] SaveSession["Save Session
(status: analysis_complete)"] ReturnAnalysis["Return with
Visualization URL"] Investigation --> ParseResponse --> DryRunCheck DryRunCheck -->|No| RetryAnalysis --> Investigation DryRunCheck -->|Yes| Analysis Analysis --> SaveSession --> ReturnAnalysis end subgraph Phase3["Phase 3: User Approval & Execution"] UserChoice["User Approval
(executeChoice=1)"] LoadSession["Load Session"] ExecuteCommands["Execute Commands
(continue-on-error)"] PostValidation{"Has
validationIntent?"} CallRemediate["Call remediate
Internally"] SaveResults["Save Results
(status: executed_*)"] ReturnResults["Return Execution
Results"] ReturnAnalysis -.->|"User reviews
and approves"| UserChoice UserChoice --> LoadSession --> ExecuteCommands ExecuteCommands --> PostValidation PostValidation -->|Yes| CallRemediate --> SaveResults PostValidation -->|No| SaveResults SaveResults --> ReturnResults end ``` ## Component Details ### MCP Server (dot-ai) The MCP server provides the core operations engine: | Component | File | Description | |-----------|------|-------------| | `operate` tool | `src/tools/operate.ts` | Entry point, routing, context embedding, formatting | | Analysis workflow | `src/tools/operate-analysis.ts` | Intent analysis, AI tool loop, response parsing | | Execution workflow | `src/tools/operate-execution.ts` | Command execution, post-validation, results | | System Prompt | `prompts/operate-system.md` | AI instructions for operation behavior | | User Prompt | `prompts/operate-user.md` | Handlebars template with context injection | | `GenericSessionManager` | `src/core/generic-session-manager.ts` | File-based session persistence | | `AIProvider` | `src/core/ai-provider.interface.ts` | AI abstraction with tool loop support | | `kubectl-tools` | `src/core/kubectl-tools.ts` | Kubectl investigation and validation tools | | Vector Services | `src/services/*-vector-service.ts` | Pattern, policy, capability search | | `visualization` | `src/core/visualization.ts` | URL generation for web UI | ### Kubectl Investigation & Validation Tools Tools available during AI analysis: | Tool | Description | |------|-------------| | `kubectl_api_resources` | Discover available resources in cluster | | `kubectl_get` | List resources with table format | | `kubectl_describe` | Detailed resource information with events | | `kubectl_logs` | Container logs for debugging | | `kubectl_patch_dryrun` | Validate patch operations before execution | | `kubectl_apply_dryrun` | Validate apply operations before execution | | `kubectl_delete_dryrun` | Validate delete operations before execution | | `kubectl_get_crd_schema` | Get CRD schema for custom resources | ### Controller (dot-ai-controller) The Kubernetes controller provides capability scanning: | Component | File | Description | |-----------|------|-------------| | Capability Scanner | `internal/controller/capability_scanner.go` | Discovers cluster resources and capabilities | | Embedding Service | `internal/controller/embedding_service.go` | Generates embeddings for semantic search | | Qdrant Client | `internal/controller/qdrant_client.go` | Stores capabilities in vector database | ### Web UI (dot-ai-ui) Provides visualization for operation analysis and execution: | Component | File | Description | |-----------|------|-------------| | Visualization Page | `src/pages/Visualization.tsx` | Main page for `/v/{sessionId}` | | MermaidRenderer | `src/components/renderers/MermaidRenderer.tsx` | Interactive flowcharts | | CardRenderer | `src/components/renderers/CardRenderer.tsx` | Current state and proposed changes | | CodeRenderer | `src/components/renderers/CodeRenderer.tsx` | Commands with syntax highlighting | | InsightsPanel | `src/components/InsightsPanel.tsx` | AI observations and risk assessment | ## Integration Points ```mermaid flowchart LR subgraph MCP["MCP Server"] Operate["operate tool"] AIProvider["AI Provider"] KubectlTools["Kubectl Tools"] SessionMgr["Session Manager"] VectorSvc["Vector Services"] end subgraph AI["AI Providers"] Anthropic["Claude API"] OpenAI["OpenAI API"] Google["Gemini API"] Others["xAI, Bedrock,
OpenRouter, etc."] end subgraph VectorDB["Vector Database"] Patterns["Patterns
Collection"] Policies["Policies
Collection"] Capabilities["Capabilities
Collection"] end subgraph K8s["Kubernetes"] API["API Server"] Controller["Capability
Scanner"] end subgraph Storage["Session Storage"] Files["File System
~/.dot-ai/sessions/opr-sessions/"] end subgraph UI["Web UI"] Viz["Visualization
Dashboard"] end AIProvider <-->|Tool Loop| KubectlTools AIProvider --> Anthropic AIProvider --> OpenAI AIProvider --> Google AIProvider --> Others VectorSvc --> Patterns VectorSvc --> Policies VectorSvc --> Capabilities KubectlTools -->|Investigation| API Operate -->|Execute Commands| API Controller -->|Scan & Embed| Capabilities SessionMgr --> Files Operate -.->|Session URL| Viz ``` ### MCP Server ↔ AI Provider - **Tool Loop**: AI iteratively calls kubectl tools (max 30 iterations) - **Investigation**: Gathers current cluster state to understand resources - **Dry-Run Validation**: Validates all commands before proposing - **Analysis**: Generates JSON response with changes, commands, and risk assessment ### MCP Server ↔ Vector Database - **Patterns**: Organizational patterns for operational best practices - **Policies**: Policy intents for validation and compliance - **Capabilities**: Cluster resource capabilities for intelligent recommendations - Capabilities are **mandatory**; patterns/policies are optional ### MCP Server ↔ Kubernetes API - **Read Operations**: `kubectl get`, `describe`, `logs` - **Validation**: `kubectl patch/apply/delete --dry-run=server` - **Execution**: Sequential command execution via `child_process.exec()` ### MCP Server ↔ Remediate Tool - **Post-Execution Validation**: Internally calls remediate with `validationIntent` - **Verification**: Confirms operations completed successfully - **Error Detection**: Identifies issues introduced by operations ### MCP Server ↔ Web UI - **Session Storage**: Operation data stored with session IDs - **Visualization API**: `/api/v1/visualize/{sessionId}` endpoint - **URL Generation**: `WEB_UI_BASE_URL/v/{sessionId}` ## Session Management Sessions persist workflow state across tool calls: ``` Session ID Format: opr-{timestamp}-{uuid8} Example: opr-1704067200000-a1b2c3d4 Session Data: ├── toolName: 'operate' ├── intent: "Update my-api to v2.0 with zero downtime" ├── context: │ ├── patterns: OrganizationalPattern[] │ ├── policies: PolicyIntent[] │ └── capabilities: ResourceCapability[] ├── proposedChanges: │ ├── create: ResourceChange[] │ ├── update: ResourceChange[] │ └── delete: ResourceChange[] ├── commands: ["kubectl set image...", "kubectl patch..."] ├── dryRunValidation: │ ├── status: 'success' | 'failed' │ └── details: string ├── patternsApplied: ["Zero-Downtime Rolling Update"] ├── capabilitiesUsed: ["metrics-server", "KEDA"] ├── policiesChecked: ["Production Update Policy"] ├── risks: { level: 'low', description: "..." } ├── validationIntent: "Verify deployment rollout complete" ├── status: 'analyzing' | 'analysis_complete' | 'executing' | 'executed_*' └── executionResults: [{command, success, output, error}] ``` ### Session States | State | Description | |-------|-------------| | `analyzing` | AI is gathering data and generating commands | | `analysis_complete` | Analysis done, awaiting user approval | | `executing` | Commands are being executed | | `executed_successfully` | All commands succeeded | | `executed_with_errors` | Some commands failed | | `failed` | Analysis or execution failed | ## Organizational Context Integration The operate tool integrates organizational knowledge via vector database search: ```mermaid flowchart TB subgraph Input["User Intent"] Intent["Update my-api to v2.0
with zero downtime"] end subgraph Search["Vector DB Search"] PatternSearch["Search Patterns
(limit: 5)"] PolicySearch["Search Policies
(limit: 5)"] CapSearch["Search Capabilities
(limit: 50)"] end subgraph Results["Search Results"] Patterns["Patterns Found:
- Zero-Downtime Rolling Update
- Blue-Green Deployment"] Policies["Policies Found:
- Production Update Policy
- Change Management"] Caps["Capabilities Found:
- metrics-server
- KEDA Operator
- Argo Rollouts"] end subgraph Prompt["AI Prompt"] Context["Embedded Context:
patterns + policies + capabilities"] end Intent --> PatternSearch & PolicySearch & CapSearch PatternSearch --> Patterns PolicySearch --> Policies CapSearch --> Caps Patterns & Policies & Caps --> Context ``` ### Context Priority 1. **Capabilities** (Mandatory): What the cluster can do 2. **Patterns** (Optional): Organizational best practices 3. **Policies** (Optional): Compliance and validation rules ## Output Formats The operate tool returns structured output at different stages: ### Analysis Response | Field | Description | |-------|-------------| | `status` | `awaiting_user_approval` | | `sessionId` | Session ID for continuation | | `visualizationUrl` | URL to view analysis in web UI | | `currentState` | Current cluster resource state | | `proposedChanges` | Create, update, delete operations | | `commands` | Pre-validated kubectl commands | | `dryRunValidation` | Dry-run validation results | | `patternsApplied` | Applied organizational patterns | | `capabilitiesUsed` | Used cluster capabilities | | `policiesChecked` | Checked policies | | `risks` | Risk assessment (level + description) | | `validationIntent` | Post-execution validation instructions | ### Execution Response | Field | Description | |-------|-------------| | `status` | `success` or `failed` | | `sessionId` | Session ID for reference | | `results` | Per-command execution results | | `validation` | Post-execution validation summary | | `message` | Human-readable summary | ## Error Handling The operation workflow includes robust error handling: 1. **No Capabilities Found**: Clear guidance to run capability scan first 2. **Session Not Found**: Guidance to start new operation 3. **Dry-Run Failures**: AI iterates to fix commands before proposing 4. **Command Execution Failures**: Continue-on-error, capture all results 5. **Validation Failures**: Report issues via remediate tool integration 6. **AI Service Errors**: Logged with request IDs for debugging 7. **Investigation Timeouts**: Max 30 iterations prevents infinite loops ## Configuration ### Environment Variables | Variable | Description | Default | |----------|-------------|---------| | `AI_PROVIDER` | AI provider selection | `anthropic` | | `ANTHROPIC_API_KEY` | Anthropic API key | Required if using | | `OPENAI_API_KEY` | OpenAI API key | Required if using | | `QDRANT_URL` | Qdrant vector database URL | `http://localhost:6333` | | `QDRANT_API_KEY` | Qdrant API key | Optional | | `QDRANT_CAPABILITIES_COLLECTION` | Capabilities collection name | `capabilities` | | `KUBECONFIG` | Kubernetes config path | Auto-detected | | `DOT_AI_SESSION_DIR` | Session storage directory | `~/.dot-ai/sessions` | | `WEB_UI_BASE_URL` | Web UI base URL | Optional | | `DEBUG_DOT_AI` | Enable debug logging | `false` | ### Supported AI Providers | Provider | Models | Notes | |----------|--------|-------| | Anthropic | Claude Sonnet 4.5, Opus, Haiku | Default, 1M token context | | OpenAI | GPT-5.1-codex | | | Google | Gemini 3 Pro, Flash | | | xAI | Grok-4 | | | Amazon Bedrock | Various | Uses AWS credential chain | | OpenRouter | Multi-model | Proxy to multiple providers | | Custom | Ollama, vLLM, LocalAI | Via `baseURL` config | ## Workflow Example ``` User Intent: "Update my-api deployment in prod to v2.0 with zero downtime" 1. CONTEXT EMBEDDING └─ embedContext(intent) ├─ Search patterns → "Zero-Downtime Rolling Update" ├─ Search policies → "Production Update Requirements" └─ Search capabilities → "metrics-server", "KEDA Operator" 2. AI INVESTIGATION LOOP └─ AI Tool Loop (30 iterations max) ├─ kubectl_get deployment/my-api -n prod ├─ kubectl_describe deployment/my-api -n prod ├─ kubectl_patch_dryrun (test maxUnavailable: 0) └─ kubectl_set_image (test v2.0 image --dry-run=server) 3. ANALYSIS GENERATION └─ Session created: opr-1704067200000-a1b2c3d4 ├─ Status: analysis_complete ├─ Current: 3 replicas, my-api:v1.5, maxUnavailable: 1 ├─ Proposed: image v2.0, maxUnavailable: 0 ├─ Commands: set image + patch strategy ├─ Risk: LOW └─ Visualization URL: https://dot-ai-ui/v/opr-1704067200000-a1b2c3d4 4. USER APPROVAL └─ User reviews analysis in terminal or web UI └─ Calls: operate({ sessionId: 'opr-...', executeChoice: 1 }) 5. COMMAND EXECUTION └─ executeOperations() ├─ Load session (status: analysis_complete) ├─ Update status to executing ├─ Execute commands sequentially │ ├─ kubectl set image deployment/my-api my-api=my-api:v2.0 -n prod │ └─ kubectl patch deployment/my-api -n prod -p '{"spec":...}' ├─ Call remediate internally for validation └─ Update status to executed_successfully 6. RETURN RESULTS └─ Results: 2 commands succeeded ├─ Validation: "Rollout complete, all pods running v2.0" └─ Status: success ``` ## See Also - [MCP Operate Guide](//mcp/operate/) - [Remediate Architecture](./remediate.md) - [Recommendation Architecture](./recommendation.md) - [Controller Documentation](//controller/) - [Web UI Documentation](//ui/) # Query Feature Architecture This document provides a detailed architecture overview of the Query feature in the DevOps AI Toolkit. ## Overview The Query feature provides a natural language interface for Kubernetes cluster intelligence. Users can ask questions about cluster resources, capabilities, and status in plain English, with the AI autonomously deciding which tools to call to gather the necessary information. ## High-Level Architecture ```mermaid flowchart TB subgraph Users["User / AI Agent"] Agent["Claude Code, Cursor,
VS Code, etc."] end subgraph MCP["MCP Server (dot-ai)"] Query["query Tool"] AI["AI Provider"] Session["Session
Manager"] CapTools["Capability
Tools"] ResTools["Resource
Tools"] KubectlTools["Kubectl
Tools"] end subgraph External["External Services"] LLM["Claude, OpenAI,
Gemini, etc."] Qdrant["Qdrant
Vector DB"] Embeddings["Embedding
Service"] end subgraph K8s["Kubernetes Cluster"] API["Kubernetes API"] Controller["Controller
(dot-ai-controller)"] Resources["Cluster Resources
Pods, Deployments,
Services, etc."] CRDs["Custom Resource
Definitions"] end subgraph WebUI["Web UI (dot-ai-ui)"] Viz["Visualization Dashboard
- Mermaid Diagrams
- Tables & Cards
- Code Blocks"] end Agent <-->|MCP Protocol| Query Query --> AI Query --> Session AI --> LLM AI <-->|Tool Loop| CapTools AI <-->|Tool Loop| ResTools AI <-->|Tool Loop| KubectlTools CapTools --> Qdrant ResTools --> Qdrant KubectlTools --> API Qdrant --> Embeddings Controller -->|Sync Resources| Qdrant Controller -->|Watch| Resources Controller -->|Watch| CRDs Agent -.->|User opens
Visualization URL| WebUI ``` ## Query Workflow The query tool operates as an agentic loop where the AI autonomously decides which tools to call: ```mermaid flowchart TD subgraph Phase1["Phase 1: Input Processing"] Intent["User Intent
(natural language)"] Validate["Validate Input
(1-1000 chars)"] RequestId["Generate RequestId
for tracking"] Intent --> Validate --> RequestId end subgraph Phase2["Phase 2: AI Initialization"] LoadPrompt["Load System Prompt
(query-system.md)"] InitProvider["Initialize AI Provider
(from environment)"] BuildTools["Build Tool Definitions
- search_capabilities
- query_capabilities
- search_resources
- query_resources
- kubectl_*"] RequestId --> LoadPrompt --> InitProvider --> BuildTools end subgraph Phase3["Phase 3: Agentic Tool Loop"] ToolLoop["AI Tool Loop
(Vercel AI SDK)"] Decision{"AI
Decision"} SemanticSearch["Semantic Search
(Qdrant vectors)"] FilterQuery["Filter Query
(Qdrant filters)"] KubectlExec["Kubectl Execution
(live cluster)"] CollectResults["Collect Tool Results"] BuildTools --> ToolLoop --> Decision Decision -->|search_* tools| SemanticSearch --> CollectResults Decision -->|query_* tools| FilterQuery --> CollectResults Decision -->|kubectl_* tools| KubectlExec --> CollectResults CollectResults -->|Continue| ToolLoop end subgraph Phase4["Phase 4: Response Generation"] MaxIterations{"Max Iterations
Reached?"} FinalResponse["AI Final Response
(JSON format)"] ParseJSON["Parse Summary
from JSON"] Decision -->|Done| FinalResponse CollectResults --> MaxIterations MaxIterations -->|Yes| FinalResponse MaxIterations -->|No| ToolLoop FinalResponse --> ParseJSON end subgraph Phase5["Phase 5: Session & Output"] CreateSession["Create Session
(qry-{ts}-{uuid})"] StoreData["Store Session Data
- intent
- summary
- toolsUsed
- toolCallsExecuted"] GenURL{"WEB_UI_BASE_URL
Configured?"} VizURL["Generate Visualization URL
{baseUrl}/v/{sessionId}"] BuildOutput["Build QueryOutput
- success: true
- summary
- toolsUsed
- sessionId
- visualizationUrl"] ParseJSON --> CreateSession --> StoreData --> GenURL GenURL -->|Yes| VizURL --> BuildOutput GenURL -->|No| BuildOutput end ``` ## Component Details ### MCP Server (dot-ai) The MCP server hosts the query tool and orchestrates AI-driven investigation: | Component | File | Description | |-----------|------|-------------| | `query` tool | `src/tools/query.ts` | Entry point, orchestrates tool loop and session | | System Prompt | `prompts/query-system.md` | AI instructions for query behavior | | `GenericSessionManager` | `src/core/generic-session-manager.ts` | File-based session persistence | | `AIProvider` | `src/core/ai-provider.interface.ts` | AI abstraction with tool loop support | | `AIProviderFactory` | `src/core/ai-provider-factory.ts` | Multi-provider factory | | `capability-tools` | `src/core/capability-tools.ts` | Semantic search for capabilities | | `resource-tools` | `src/core/resource-tools.ts` | Semantic search for resources | | `kubectl-tools` | `src/core/kubectl-tools.ts` | Kubectl read-only tools | | `CapabilityVectorService` | `src/core/capability-vector-service.ts` | Capability embeddings storage | | `ResourceVectorService` | `src/core/resource-vector-service.ts` | Resource embeddings storage | | `EmbeddingService` | `src/core/embedding-service.ts` | Multi-provider embedding generation | | `visualization` | `src/core/visualization.ts` | URL generation for web UI | ### Query Tools Available to AI The AI can autonomously call these tools during investigation: | Tool | Type | Description | |------|------|-------------| | `search_capabilities` | Semantic | Vector similarity search for resource capabilities | | `query_capabilities` | Filter | Structured filter queries for capabilities | | `search_resources` | Semantic | Vector similarity search for cluster resources | | `query_resources` | Filter | Structured filter queries for resources | | `kubectl_api_resources` | Kubectl | List all available API resources | | `kubectl_get` | Kubectl | Get resources with current state | | `kubectl_describe` | Kubectl | Detailed resource information | | `kubectl_logs` | Kubectl | Container logs from pods | | `kubectl_events` | Kubectl | Kubernetes events for troubleshooting | | `kubectl_get_crd_schema` | Kubectl | OpenAPI v3 schema for CRDs | ### Controller (dot-ai-controller) The Kubernetes controller syncs cluster data to Qdrant for semantic search: | Component | File | Description | |-----------|------|-------------| | `ResourceSyncReconciler` | `internal/controller/resourcesync_controller.go` | Watches all resources, syncs to MCP | | `CapabilityScanReconciler` | `internal/controller/capabilityscan_controller.go` | Triggers capability scans on CRD changes | | `ResourceSyncConfig` CRD | `api/v1alpha1/resourcesyncconfig_types.go` | Configuration for resource sync | | `CapabilityScanConfig` CRD | `api/v1alpha1/capabilityscanconfig_types.go` | Configuration for capability scanning | | MCP Resource Client | `internal/controller/resourcesync_mcp.go` | HTTP client for `/api/v1/resources/sync` | | MCP Capability Client | `internal/controller/capabilityscan_mcp.go` | HTTP client for `/api/v1/tools/manageOrgData` | | Debounce Buffer | `internal/controller/resourcesync_debounce.go` | Batches changes before sync | ### Web UI (dot-ai-ui) Provides visualization for query results: | Component | File | Description | |-----------|------|-------------| | Visualization Page | `src/pages/Visualization.tsx` | Main page for `/v/{sessionId}` | | MermaidRenderer | `src/components/renderers/MermaidRenderer.tsx` | Interactive diagrams with zoom/pan | | CardRenderer | `src/components/renderers/CardRenderer.tsx` | Resource cards in grid layout | | TableRenderer | `src/components/renderers/TableRenderer.tsx` | Tabular data display | | CodeRenderer | `src/components/renderers/CodeRenderer.tsx` | Syntax-highlighted code/YAML | | InsightsPanel | `src/components/InsightsPanel.tsx` | AI observations display | | TabContainer | `src/components/TabContainer.tsx` | Multi-visualization tabs | | API Client | `src/api/client.ts` | Data fetching from MCP server | ## Integration Points ```mermaid flowchart LR subgraph MCP["MCP Server"] Query["query tool"] AIProvider["AI Provider"] CapTools["Capability Tools"] ResTools["Resource Tools"] KubectlTools["Kubectl Tools"] SessionMgr["Session Manager"] end subgraph Vector["Vector Database"] Qdrant["Qdrant"] CapColl["capabilities
collection"] ResColl["resources
collection"] end subgraph AI["AI Providers"] Anthropic["Claude API"] OpenAI["OpenAI API"] Google["Gemini API"] Others["xAI, Bedrock,
OpenRouter, etc."] end subgraph Embed["Embedding Providers"] OpenAIEmbed["OpenAI
text-embedding-3-small"] GoogleEmbed["Google
gemini-embedding-001"] BedrockEmbed["Bedrock
titan-embed-text-v2"] end subgraph K8s["Kubernetes"] API["API Server"] Controller["dot-ai-controller"] end subgraph Storage["Session Storage"] Files["File System
~/.dot-ai/sessions/qry-*"] end subgraph UI["Web UI"] Viz["Visualization
Dashboard"] end AIProvider <-->|Tool Loop| CapTools AIProvider <-->|Tool Loop| ResTools AIProvider <-->|Tool Loop| KubectlTools AIProvider --> Anthropic AIProvider --> OpenAI AIProvider --> Google AIProvider --> Others CapTools --> Qdrant ResTools --> Qdrant Qdrant --> CapColl Qdrant --> ResColl Qdrant --> OpenAIEmbed Qdrant --> GoogleEmbed Qdrant --> BedrockEmbed KubectlTools --> API Controller -->|POST /api/v1/resources/sync| MCP Controller -->|POST /api/v1/tools/manageOrgData| MCP Controller -->|Watch| K8s SessionMgr --> Files Query -.->|Session URL| Viz ``` ### MCP Server ↔ AI Provider - **Tool Loop**: AI iteratively calls tools (max 20 iterations by default) - **Autonomous Decision**: AI decides which tools to call based on user intent - **JSON Output**: AI returns structured JSON with summary ### MCP Server ↔ Qdrant - **Semantic Search**: Vector similarity using embeddings - **Filter Queries**: Structured queries by namespace, kind, labels, etc. - **Capability Data**: Resource kinds with capabilities, providers, abstractions - **Resource Data**: Cluster resources with labels, annotations, timestamps ### Controller ↔ MCP Server - **Resource Sync**: Controller watches all resources and syncs to MCP - **Capability Scanning**: Controller triggers scans when CRDs change - **Debounced Batching**: Changes batched every 10s to reduce API calls - **Periodic Resync**: Full resync every 60 minutes for consistency ### MCP Server ↔ Web UI - **Session Storage**: Query results stored with session IDs - **Visualization API**: `/api/v1/visualize/{sessionId}` endpoint - **URL Generation**: `WEB_UI_BASE_URL/v/{sessionId}` - **Cached Visualizations**: AI-generated visualizations cached in session ## Session Management Sessions persist query data for visualization: ``` Session ID Format: qry-{timestamp}-{uuid8} Example: qry-1767465086590-a1b2c3d4 Session Data: ├── toolName: 'query' ├── intent: "What databases are running in the cluster?" ├── summary: "Found 3 PostgreSQL clusters and 2 Redis instances..." ├── toolsUsed: ['search_capabilities', 'kubectl_get', ...] ├── iterations: 5 ├── toolCallsExecuted: │ ├── {tool: 'search_capabilities', input: {...}, output: {...}} │ ├── {tool: 'kubectl_get', input: {...}, output: {...}} │ └── ... ├── cachedVisualization: │ ├── title: "Database Resources" │ ├── visualizations: [...] │ ├── insights: [...] │ └── generatedAt: ISO timestamp └── timestamp: ISO date ``` ## Data Flow: Resource Sync ```mermaid flowchart LR subgraph K8s["Kubernetes Cluster"] Resources["Resources
(Pods, Deployments, etc.)"] CRDs["Custom Resources"] end subgraph Controller["dot-ai-controller"] Informers["Dynamic
Informers"] CRDWatcher["CRD
Watcher"] DebounceRes["Resource
Debounce Buffer"] DebounceCap["Capability
Debounce Buffer"] end subgraph MCP["MCP Server"] SyncAPI["POST /api/v1/
resources/sync"] ManageAPI["POST /api/v1/
tools/manageOrgData"] VectorSvc["Vector
Services"] end subgraph Qdrant["Qdrant"] ResColl["resources
collection"] CapColl["capabilities
collection"] end Resources -->|Watch| Informers CRDs -->|Watch| CRDWatcher Informers -->|Queue| DebounceRes CRDWatcher -->|Queue| DebounceCap DebounceRes -->|Batch every 10s| SyncAPI DebounceCap -->|Batch every 10s| ManageAPI SyncAPI --> VectorSvc ManageAPI --> VectorSvc VectorSvc -->|Upsert/Delete| ResColl VectorSvc -->|Scan/Delete| CapColl ``` ## Output Formats The query tool returns structured output: | Field | Description | |-------|-------------| | `success` | Boolean indicating query success | | `summary` | Human-readable summary of findings | | `toolsUsed` | List of tools called during investigation | | `iterations` | Number of AI tool loop iterations | | `sessionId` | Session ID for visualization | | `visualizationUrl` | URL to view results (if configured) | | `guidance` | Instructions for agent on presenting results | | `error` | Error object with code and message (if failed) | ## Error Handling The query workflow includes robust error handling: 1. **Input Validation**: Intent must be non-empty string (1-1000 chars) 2. **AI Provider Errors**: Logged with request IDs for debugging 3. **Vector DB Unavailable**: Falls back to keyword search or kubectl-only 4. **JSON Parsing Failures**: Original AI response logged for analysis 5. **Tool Execution Errors**: Captured in toolCallsExecuted, investigation continues 6. **Max Iterations**: Default 20 iterations prevents infinite loops ## Configuration ### Environment Variables | Variable | Description | Default | |----------|-------------|---------| | `AI_PROVIDER` | AI provider selection | `anthropic` | | `ANTHROPIC_API_KEY` | Anthropic API key | Required if using | | `OPENAI_API_KEY` | OpenAI API key | Required if using | | `QDRANT_URL` | Qdrant vector DB endpoint | `http://localhost:6333` | | `QDRANT_API_KEY` | Qdrant authentication | Optional | | `QDRANT_CAPABILITIES_COLLECTION` | Capabilities collection name | `capabilities` | | `QDRANT_RESOURCES_COLLECTION` | Resources collection name | `resources` | | `CUSTOM_EMBEDDINGS_API_KEY` | Embedding service API key | Falls back to provider key | | `KUBECONFIG` | Kubernetes config path | Auto-detected | | `WEB_UI_BASE_URL` | Web UI base URL | Optional | | `DEBUG_DOT_AI` | Enable debug logging | `false` | ### Supported AI Providers | Provider | Models | Notes | |----------|--------|-------| | Anthropic | Claude Sonnet 4.5, Opus, Haiku | Default, 1M token context | | OpenAI | GPT-5.1-codex | | | Google | Gemini 3 Pro, Flash | | | xAI | Grok-4 | | | Amazon Bedrock | Various | Uses AWS credential chain | | OpenRouter | Multi-model | Proxy to multiple providers | | Custom | Ollama, vLLM, LocalAI | Via `baseURL` config | ### Supported Embedding Providers | Provider | Model | Dimensions | |----------|-------|------------| | OpenAI | text-embedding-3-small | 1536 | | Google | gemini-embedding-001 | 768 | | Amazon Bedrock | titan-embed-text-v2:0 | 1024 | ## See Also - [MCP Query Guide](//mcp/query/) - [Controller Documentation](//controller/) - [Web UI Documentation](//ui/) - [Vector Database Setup](//setup/qdrant/) # Recommendation Feature Architecture This document provides a detailed architecture overview of the Recommendation feature in the DevOps AI Toolkit. ## Overview The Recommendation feature provides AI-powered Kubernetes deployment recommendations. It analyzes user intent, discovers cluster capabilities, and generates deployment solutions with full manifest generation and deployment capabilities. ## High-Level Architecture ```mermaid flowchart TB subgraph Users["User / AI Agent"] Agent["Claude Code, Cursor,
VS Code, etc."] end subgraph MCP["MCP Server (dot-ai)"] Recommend["recommend Tool"] AI["AI Provider"] Vector["Vector DB
Client"] Discovery["Discovery
Engine"] end subgraph External["External Services"] LLM["Claude, OpenAI,
Ollama, etc."] Qdrant["Qdrant
(Semantic Search)"] end subgraph K8s["Kubernetes Cluster"] API["Kubernetes API"] Controller["Controller
(dot-ai-controller)"] Resources["Deployed Resources
Deployments, Services,
Ingress, HPA, PDB, etc."] end subgraph WebUI["Web UI (dot-ai-ui)"] Viz["Visualization Dashboard
- Solution Comparisons
- Resource Diagrams
- Generated Manifests"] end Agent <-->|MCP Protocol| Recommend Recommend --> AI Recommend --> Vector Recommend --> Discovery AI --> LLM Vector -->|Embeddings| LLM Vector --> Qdrant Discovery --> API Recommend --> API Controller --> Resources Controller -->|Sync with Embeddings| Qdrant Agent -.->|User opens
Visualization URL| WebUI ``` ## Recommendation Workflow Stages The recommendation tool operates as a unified multi-stage workflow: ```mermaid flowchart TD subgraph Stage1["Stage 1: recommend"] UserIntent["User Intent"] IntentCheck{"Intent < 100 chars?"} Refine["Return Refinement Guidance"] CapSearch["Capability Search
(Vector DB)"] AIRank["AI Ranking
(Claude)"] CapMatch{"Capability
Match?"} ResourceSol["Generate Resource-Based
Solutions"] HelmSearch["Search ArtifactHub
for Helm Charts"] Solutions["Solutions with solutionIds
+ visualization URL"] UserIntent --> IntentCheck IntentCheck -->|Yes| Refine IntentCheck -->|No| CapSearch CapSearch --> AIRank AIRank --> CapMatch CapMatch -->|Yes| ResourceSol CapMatch -->|No| HelmSearch ResourceSol --> Solutions HelmSearch --> Solutions end subgraph Stage2["Stage 2: chooseSolution"] SelectSol["solutionId"] LoadSession["Load Session"] GenQuestions["Generate Questions
(if Helm)"] ReturnQuestions["Required Questions"] SelectSol --> LoadSession --> GenQuestions --> ReturnQuestions end subgraph Stage3["Stage 3-6: answerQuestion"] Required["answerQuestion:required
(name, namespace, image, port)"] Basic["answerQuestion:basic
(replicas, resources, ingress)"] Advanced["answerQuestion:advanced
(probes, PDB, security)"] Open["answerQuestion:open
(free-form, AI enhancement)"] Ready["ready_for_manifest_generation"] Required --> Basic --> Advanced --> Open --> Ready end subgraph Stage4["Stage 7: generateManifests"] GenType{"Solution
Type?"} subgraph Capability["Capability-Based"] C1["1. Retrieve Schemas"] C2["2. AI Generation"] C3["3. YAML Validation"] C4["4. kubectl Dry-Run"] C5["5. Retry Loop (max 10)"] C6["6. Package Output"] C1 --> C2 --> C3 --> C4 --> C5 --> C6 end subgraph Helm["Helm-Based"] H1["1. Fetch Chart"] H2["2. AI Values Gen"] H3["3. Helm Dry-Run"] H4["4. Retry Loop"] H1 --> H2 --> H3 --> H4 end GenType -->|Capability| C1 GenType -->|Helm| H1 C6 --> Manifests H4 --> Manifests Manifests["Manifests + visualization URL"] end subgraph Stage5["Stage 8: deployManifests"] DeployType{"Solution
Type?"} CapDeploy["kubectl apply -k
--wait"] HelmDeploy["helm upgrade
--install --wait"] Status["Deployment Status
+ Next Steps"] DeployType -->|Capability| CapDeploy DeployType -->|Helm| HelmDeploy CapDeploy --> Status HelmDeploy --> Status end Solutions --> SelectSol ReturnQuestions --> Required Ready --> GenType Manifests --> DeployType ``` ## Component Details ### MCP Server (dot-ai) The MCP server is the core recommendation engine: | Component | File | Description | |-----------|------|-------------| | `recommend` tool | `src/tools/recommend.ts` | Entry point, routes to stages, generates solutions | | `chooseSolution` | `src/tools/choose-solution.ts` | Loads selected solution, returns questions | | `answerQuestion` | `src/tools/answer-question.ts` | Processes answers, manages stage progression | | `generateManifests` | `src/tools/generate-manifests.ts` | AI manifest generation with validation loop | | `deployManifests` | `src/tools/deploy-manifests.ts` | Deploys via kubectl or helm | | `ResourceRecommender` | `src/core/schema.ts` | AI-powered solution ranking and filtering | | `CapabilityVectorService` | `src/core/capability-vector-service.ts` | Semantic search for capabilities | | `PatternVectorService` | `src/core/pattern-vector-service.ts` | Organizational pattern matching | | `PolicyVectorService` | `src/core/policy-vector-service.ts` | Policy enforcement | | `GenericSessionManager` | `src/core/generic-session-manager.ts` | Session state management | | `ArtifactHubService` | `src/core/artifacthub.ts` | Helm chart discovery | ### Controller (dot-ai-controller) The Kubernetes controller manages deployed solutions: | CRD | Description | |-----|-------------| | `Solution` | Groups related resources, manages ownerReferences, aggregates health | | `ResourceSyncConfig` | Syncs resource metadata to MCP for semantic search | | `CapabilityScanConfig` | Scans cluster for resource capabilities | | `RemediationPolicy` | Event-driven remediation (separate feature) | ### Web UI (dot-ai-ui) Provides visualization for recommendation results: - **Visualization Page** (`/v/{sessionId}`) - Renders solution comparisons - **Mermaid Diagrams** - Architecture and flow diagrams - **Resource Cards** - Solution component details - **Code Blocks** - Generated manifests with syntax highlighting - **Tables** - Configuration summaries ## Integration Points ```mermaid flowchart LR subgraph MCP["MCP Server"] Recommend["recommend tool"] Schema["ResourceRecommender"] CapVec["CapabilityVectorService"] PatVec["PatternVectorService"] PolVec["PolicyVectorService"] Discovery["Discovery Engine"] ArtHub["ArtifactHubService"] end subgraph VectorDB["Qdrant"] Capabilities["Capabilities
Collection"] Patterns["Patterns
Collection"] Policies["Policies
Collection"] end subgraph AI["AI Provider"] Claude["Claude API"] OpenAI["OpenAI API"] end subgraph K8s["Kubernetes"] API["API Server"] Controller["dot-ai-controller"] end subgraph External["External"] ArtifactHub["ArtifactHub API"] end subgraph UI["Web UI"] Viz["Visualization
Dashboard"] end CapVec <-->|Semantic Search| Capabilities PatVec <-->|Pattern Match| Patterns PolVec <-->|Policy Lookup| Policies Schema -->|Solution Ranking| Claude Schema -->|Solution Ranking| OpenAI Discovery -->|kubectl explain| API Recommend -->|kubectl apply| API ArtHub -->|Chart Search| ArtifactHub Recommend -.->|Session URL| Viz Controller -->|Watch Resources| API ``` ### MCP Server ↔ Vector DB (Qdrant) - **Capability Storage**: Resource capabilities with semantic embeddings - **Pattern Storage**: Organizational patterns for solution enhancement - **Policy Storage**: Policy intents for configuration enforcement - **Semantic Search**: Natural language queries matched to stored data ### MCP Server ↔ Kubernetes API - **Resource Discovery**: `kubectl api-resources`, `kubectl explain` - **Schema Retrieval**: OpenAPI schemas for manifest generation - **Manifest Validation**: `kubectl apply --dry-run=server` - **Deployment**: `kubectl apply`, `helm upgrade --install` ### MCP Server ↔ AI Provider - **Solution Assembly**: Ranking and filtering discovered capabilities - **Question Generation**: Creating configuration questions from schemas - **Manifest Generation**: Generating YAML from solution + answers - **Helm Values**: Generating values.yaml for chart installations ### MCP Server ↔ Web UI - **Session Storage**: Solution data stored with session IDs - **Visualization API**: `/api/visualize/{sessionId}` endpoint - **URL Generation**: `WEB_UI_BASE_URL/v/{sessionId}` ### Controller ↔ MCP Server - **Resource Sync**: Controller syncs resource metadata to MCP - **Capability Scan**: Controller triggers capability discovery - **Solution CR**: MCP generates Solution CR for controller management ## Session Management Sessions persist workflow state across tool calls: ``` Session ID Format: sol-{timestamp}-{uuid8} Example: sol-1765409923079-fa3f055c Session Data: ├── toolName: 'recommend' ├── stage: 'recommend' | 'generateManifests' | ... ├── intent: "Deploy PostgreSQL database" ├── type: 'single' | 'combination' | 'helm' ├── score: 96 ├── description: "Multi-cloud PostgreSQL via DevOps Toolkit" ├── resources: [{kind, apiVersion, group, description}] ├── chart: {repository, chartName, version} (if Helm) ├── questions: {required, basic, advanced, open} ├── answers: {questionId: value} ├── appliedPatterns: ["DevOps Toolkit DB Pattern"] ├── generatedManifests: {type, files, helmCommand} └── timestamp: ISO date ``` ## Output Formats The recommendation tool supports three output formats for capability-based solutions: | Format | Description | Files Generated | |--------|-------------|-----------------| | `raw` | Plain YAML manifests | `manifests.yaml` | | `helm` | Helm chart structure | `Chart.yaml`, `values.yaml`, `templates/*.yaml` | | `kustomize` | Kustomize overlay | `kustomization.yaml`, `base/`, `overlays/` | ## Error Handling The recommendation workflow includes robust error handling: 1. **Intent Refinement**: Vague intents get guidance, not failure 2. **Validation Loops**: Up to 10 retries for manifest generation 3. **Capability Gaps**: Clear error when enhancement isn't possible 4. **Session Expiry**: Graceful handling of expired sessions 5. **AI Service Errors**: Fallback to original solution on enhancement failure ## See Also - [MCP Recommendation Guide](//mcp/recommend/) - [Capability Management Guide](//mcp/capability-management/) - [Pattern Management Guide](//mcp/pattern-management/) - [Controller Documentation](//controller/) - [Web UI Documentation](//ui/) # Remediate Feature Architecture This document provides a detailed architecture overview of the Remediate feature in the DevOps AI Toolkit. ## Overview The Remediate feature provides AI-powered Kubernetes issue analysis and remediation. It investigates problems using kubectl tools, identifies root causes with confidence scoring, and executes verified fixes with optional post-execution validation. ## High-Level Architecture ```mermaid flowchart TB subgraph Users["User / AI Agent"] Agent["Claude Code, Cursor,
VS Code, etc."] end subgraph MCP["MCP Server (dot-ai)"] Remediate["remediate Tool"] AI["AI Provider"] Session["Session
Manager"] Kubectl["Kubectl Tools"] end subgraph External["External Services"] LLM["Claude, OpenAI,
Gemini, etc."] end subgraph K8s["Kubernetes Cluster"] API["Kubernetes API"] Controller["Controller
(dot-ai-controller)"] Events["Kubernetes Events"] Resources["Cluster Resources
Pods, Deployments,
Services, etc."] end subgraph WebUI["Web UI (dot-ai-ui)"] Viz["Visualization Dashboard
- Investigation Flow
- Root Cause Analysis
- Remediation Commands"] end subgraph Notifications["Notifications"] Slack["Slack"] GChat["Google Chat"] end Agent <-->|MCP Protocol| Remediate Remediate --> AI Remediate --> Session Remediate --> Kubectl AI --> LLM AI <-->|Tool Loop| Kubectl Kubectl --> API Remediate -->|Execute Commands| API Controller -->|Watch| Events Controller -->|RemediationPolicy| Remediate Controller -.->|Webhook| Slack Controller -.->|Webhook| GChat Events --> Resources Agent -.->|User opens
Visualization URL| WebUI ``` ## Remediation Workflow The remediate tool operates as a multi-phase workflow with persistent session management: ```mermaid flowchart TD subgraph Phase1["Phase 1: Investigation"] Issue["Issue Description"] CreateSession["Create Session
(rem-{ts}-{uuid})"] Investigation["AI Investigation Loop
(max 30 iterations)"] KubectlTools["kubectl Tools:
get, describe, logs,
events, api-resources, patch"] Issue --> CreateSession --> Investigation Investigation <-->|Tool Calls| KubectlTools end subgraph Phase2["Phase 2: Analysis"] ParseResponse["Parse AI Response"] StatusCheck{"Issue
Status?"} AlreadyResolved["Return: Issue already
resolved/non-existent"] Analysis["Root Cause Analysis
+ Confidence Score
+ Contributing Factors"] Investigation --> ParseResponse --> StatusCheck StatusCheck -->|resolved| AlreadyResolved StatusCheck -->|non_existent| AlreadyResolved StatusCheck -->|active| Analysis end subgraph Phase3["Phase 3: Execution Decision"] ModeCheck{"Execution
Mode?"} subgraph Manual["Manual Mode (default)"] ReturnChoices["Return 2 Choices:
1. Execute via MCP
2. Execute via Agent"] WaitApproval["await_user_approval"] end subgraph Auto["Automatic Mode"] ThresholdCheck{"Confidence >= threshold
AND Risk <= maxRisk?"} AutoExecute["Execute Automatically"] Fallback["Return with
fallbackReason"] end Analysis --> ModeCheck ModeCheck -->|manual| ReturnChoices --> WaitApproval ModeCheck -->|automatic| ThresholdCheck ThresholdCheck -->|Yes| AutoExecute ThresholdCheck -->|No| Fallback end subgraph Phase4["Phase 4: Command Execution"] UserChoice["User Choice
(executeChoice=1 or 2)"] Choice1{"Choice?"} ExecuteMCP["Execute via MCP
(child_process.exec)"] ExecuteAgent["Return Commands
for Agent Execution"] LogResults["Log Results
(success/failure/output)"] WaitApproval --> UserChoice --> Choice1 Choice1 -->|1| ExecuteMCP --> LogResults Choice1 -->|2| ExecuteAgent AutoExecute --> ExecuteMCP end subgraph Phase5["Phase 5: Validation"] ValidationCheck{"All Commands
Succeeded?"} HasValidation{"Has validationIntent?"} RecursiveCall["Recursive Investigation
with validationIntent"] WaitReconcile["Wait 30s
(automatic mode)"] FinalStatus["Final Status:
resolved / still_active"] LogResults --> ValidationCheck ValidationCheck -->|Yes| HasValidation ValidationCheck -->|No| FinalStatus HasValidation -->|Yes| WaitReconcile --> RecursiveCall --> FinalStatus HasValidation -->|No| FinalStatus end ``` ## Component Details ### MCP Server (dot-ai) The MCP server is the core remediation engine: | Component | File | Description | |-----------|------|-------------| | `remediate` tool | `src/tools/remediate.ts` | Entry point, orchestrates investigation and execution | | System Prompt | `prompts/remediate-system.md` | AI instructions for investigation behavior | | `GenericSessionManager` | `src/core/generic-session-manager.ts` | File-based session persistence | | `AIProvider` | `src/core/ai-provider.interface.ts` | AI abstraction with tool loop support | | `AIProviderFactory` | `src/core/ai-provider-factory.ts` | Multi-provider factory (Anthropic, OpenAI, etc.) | | `kubectl-tools` | `src/core/kubectl-tools.ts` | Kubectl investigation tools | | `visualization` | `src/core/visualization.ts` | URL generation for web UI | ### Kubectl Investigation Tools Tools available during AI investigation: | Tool | Description | |------|-------------| | `kubectl_api_resources` | Discover available resources in cluster | | `kubectl_get` | List resources with table format (compact) | | `kubectl_describe` | Detailed resource information with events | | `kubectl_logs` | Container logs (supports `--previous` for crashes) | | `kubectl_events` | Cluster events for understanding state changes | | `kubectl_patch_dryrun` | Validate patches before actual execution | ### Controller (dot-ai-controller) The Kubernetes controller provides event-driven remediation: | Component | File | Description | |-----------|------|-------------| | `RemediationPolicy` CRD | `config/crd/bases/` | Custom resource for remediation rules | | Policy Controller | `internal/controller/remediationpolicy_controller.go` | Event matching and MCP dispatch | | Rate Limiter | `internal/controller/remediationpolicy_ratelimit.go` | Per-object cooldowns and rate limits | | MCP Client | `internal/controller/remediationpolicy_mcp.go` | HTTP client for remediate tool | | Cooldown State | ConfigMaps | Persistent cooldown state across restarts | ### Web UI (dot-ai-ui) Provides visualization for remediation results: | Component | File | Description | |-----------|------|-------------| | Visualization Page | `src/pages/Visualization.tsx` | Main page for `/v/{sessionId}` | | MermaidRenderer | `src/components/renderers/MermaidRenderer.tsx` | Interactive flowcharts (collapsible) | | CardRenderer | `src/components/renderers/CardRenderer.tsx` | Issue/solution cards | | CodeRenderer | `src/components/renderers/CodeRenderer.tsx` | Commands and logs with syntax highlighting | | InsightsPanel | `src/components/InsightsPanel.tsx` | AI observations display | | API Client | `src/api/client.ts` | Data fetching from MCP server | ## Integration Points ```mermaid flowchart LR subgraph MCP["MCP Server"] Remediate["remediate tool"] AIProvider["AI Provider"] KubectlTools["Kubectl Tools"] SessionMgr["Session Manager"] end subgraph AI["AI Providers"] Anthropic["Claude API"] OpenAI["OpenAI API"] Google["Gemini API"] Others["xAI, Bedrock,
OpenRouter, etc."] end subgraph K8s["Kubernetes"] API["API Server"] Controller["RemediationPolicy
Controller"] Events["Kubernetes Events"] end subgraph Storage["Session Storage"] Files["File System
tmp/sessions/*.json"] ConfigMaps["ConfigMaps
(cooldown state)"] end subgraph UI["Web UI"] Viz["Visualization
Dashboard"] end subgraph Notif["Notifications"] Slack["Slack"] GChat["Google Chat"] end AIProvider <-->|Tool Loop| KubectlTools AIProvider --> Anthropic AIProvider --> OpenAI AIProvider --> Google AIProvider --> Others KubectlTools -->|Investigation| API Remediate -->|Execute Commands| API Controller -->|Watch| Events Controller -->|HTTP POST| Remediate Controller -.->|Webhook| Slack Controller -.->|Webhook| GChat SessionMgr --> Files Controller --> ConfigMaps Remediate -.->|Session URL| Viz ``` ### MCP Server ↔ AI Provider - **Tool Loop**: AI iteratively calls kubectl tools (max 30 iterations) - **Investigation**: Gathers cluster data to understand the issue - **Analysis**: Parses JSON response with root cause, confidence, and remediation steps - **Validation**: Optional recursive investigation after command execution ### MCP Server ↔ Kubernetes API - **Read Operations**: `kubectl get`, `describe`, `logs`, `events` - **Validation**: `kubectl patch --dry-run=server` - **Execution**: `child_process.exec()` for remediation commands ### Controller ↔ MCP Server - **Event-Driven**: Controller watches Kubernetes events - **Policy Matching**: Events matched against RemediationPolicy selectors - **HTTP Dispatch**: POST to MCP `/api/v1/tools/remediate` - **Rate Limiting**: Per-object cooldowns prevent remediation storms ### MCP Server ↔ Web UI - **Session Storage**: Remediation data stored with session IDs - **Visualization API**: `/api/v1/visualize/{sessionId}` endpoint - **URL Generation**: `WEB_UI_BASE_URL/v/{sessionId}` ### Controller ↔ Notifications - **Slack Webhooks**: Controller sends remediation events to Slack channels - **Google Chat Webhooks**: Controller sends remediation events to Google Chat spaces - **Secret References**: Webhook URLs stored securely in Kubernetes Secrets - **Event Types**: Notifications sent on remediation start, success, and failure ## Session Management Sessions persist workflow state across tool calls: ``` Session ID Format: rem-{timestamp}-{uuid8} Example: rem-1767465086590-11029192 Session Data: ├── toolName: 'remediate' ├── issue: "Pod my-app is crashing with OOMKilled" ├── mode: 'manual' | 'automatic' ├── interaction_id: (for evaluation dataset) ├── status: 'investigating' | 'analysis_complete' | 'executed_*' | ... ├── finalAnalysis: │ ├── rootCause: "Container memory limit too low" │ ├── confidence: 0.92 │ ├── factors: ["High memory usage", "No HPA"] │ ├── remediation: │ │ ├── summary: "Increase memory limit" │ │ ├── actions: [{description, command, risk, rationale}] │ │ └── risk: 'low' | 'medium' | 'high' │ └── validationIntent: "Verify pod is running" ├── executionResults: [{command, success, output, error}] └── timestamp: ISO date ``` ### Session States | State | Description | |-------|-------------| | `investigating` | AI is gathering data via kubectl tools | | `analysis_complete` | Analysis done, awaiting user approval | | `failed` | Investigation failed (error) | | `executed_successfully` | All commands succeeded | | `executed_with_errors` | Some commands failed | | `cancelled` | User cancelled the remediation | ## RemediationPolicy CRD The controller uses a CRD for event-driven remediation: ```yaml apiVersion: dot-ai.devopstoolkit.live/v1alpha1 kind: RemediationPolicy metadata: name: oom-killer-policy spec: eventSelectors: - type: Warning reason: OOMKilled involvedObjectKind: Pod namespace: production message: ".*memory.*" # Regex support mode: automatic # Override per selector confidenceThreshold: 0.9 maxRiskLevel: low mcpEndpoint: https://mcp.example.com/api/v1/tools mcpAuthSecretRef: name: mcp-auth key: token mcpTool: remediate mode: manual # Default mode confidenceThreshold: 0.8 maxRiskLevel: low rateLimiting: eventsPerMinute: 10 cooldownMinutes: 5 notifications: slack: webhookSecretRef: name: slack-webhook key: url channel: "#alerts" googleChat: webhookSecretRef: name: gchat-webhook key: url status: totalEventsProcessed: 150 successfulRemediations: 142 failedRemediations: 8 rateLimitedEvents: 25 lastProcessedEvent: "2025-01-07T10:30:00Z" ``` ## Output Formats The remediate tool returns structured output: | Field | Description | |-------|-------------| | `status` | `success`, `failed`, or `awaiting_user_approval` | | `sessionId` | Session ID for continuation or visualization | | `investigation.iterations` | Number of AI tool loop iterations | | `investigation.dataGathered` | List of kubectl tools called | | `analysis.rootCause` | Identified root cause | | `analysis.confidence` | Confidence score (0-1) | | `analysis.factors` | Contributing factors | | `remediation.summary` | Human-readable summary | | `remediation.actions` | Commands with risk levels | | `remediation.risk` | Overall risk level | | `validationIntent` | Post-execution validation instructions | | `executionChoices` | Available execution options | | `results` | Execution results (if executed) | ## Error Handling The remediation workflow includes robust error handling: 1. **Session Not Found**: Clear guidance to start new investigation 2. **AI Service Errors**: Logged with request IDs for debugging 3. **JSON Parsing Failures**: Original AI response logged for analysis 4. **Command Execution Failures**: Individual command results tracked 5. **Validation Failures**: Recursive investigation continues despite errors 6. **Investigation Timeouts**: Max 30 iterations prevents infinite loops ## Configuration ### Environment Variables | Variable | Description | Default | |----------|-------------|---------| | `AI_PROVIDER` | AI provider selection | `anthropic` | | `ANTHROPIC_API_KEY` | Anthropic API key | Required if using | | `OPENAI_API_KEY` | OpenAI API key | Required if using | | `KUBECONFIG` | Kubernetes config path | Auto-detected | | `DOT_AI_SESSION_DIR` | Session storage directory | `./tmp/sessions` | | `WEB_UI_BASE_URL` | Web UI base URL | Optional | | `DEBUG_DOT_AI` | Enable debug logging | `false` | ### Supported AI Providers | Provider | Models | Notes | |----------|--------|-------| | Anthropic | Claude Sonnet 4.5, Opus, Haiku | Default, 1M token context | | OpenAI | GPT-5.1-codex | | | Google | Gemini 3 Pro, Flash | | | xAI | Grok-4 | | | Amazon Bedrock | Various | Uses AWS credential chain | | OpenRouter | Multi-model | Proxy to multiple providers | | Custom | Ollama, vLLM, LocalAI | Via `baseURL` config | ## See Also - [MCP Remediate Guide](//mcp/remediate/) - [Controller Documentation](//controller/) - [Web UI Documentation](//ui/) - [RemediationPolicy Reference](//controller/remediationpolicy/) ## AI Engine # DevOps AI Toolkit **AI-powered platform engineering and DevOps automation through intelligent Kubernetes operations and conversational workflows.** # REST API Gateway **Complete guide for accessing DevOps AI Toolkit functionality via standard HTTP REST endpoints.** ## Overview The REST API Gateway provides HTTP access to all DevOps AI Toolkit capabilities without requiring MCP protocol implementation. Perfect for: - **DevOps Engineers**: Integrate tools into automation scripts and CI/CD pipelines - **Platform Teams**: Add AI-powered Kubernetes assistance to existing HTTP-based infrastructure - **QA/Testing Teams**: Create comprehensive integration tests with simple HTTP calls - **Kubernetes Controller Developers**: Access AI tools via familiar REST patterns ### What it provides: - **Universal Tool Access**: All 9 tools available via `POST /api/v1/tools/{toolName}` endpoints - **Auto-Generated Documentation**: OpenAPI 3.0 specification with interactive documentation - **Zero Maintenance**: Automatically stays current when new tools are added - **Standard HTTP**: JSON request/response with proper status codes - **Tool Discovery**: Searchable catalog of available tools and capabilities ## Setup The REST API is automatically available when you deploy the DevOps AI Toolkit. **No special configuration is needed** - the same server provides MCP protocol, CLI, and REST API access simultaneously. **🎯 Follow the [Deployment Guide](../setup/deployment.md) to deploy the server.** The Kubernetes deployment automatically provides: - ✅ **REST API endpoints** on the configured ingress URL - ✅ **MCP protocol access** for AI development tools - ✅ **All dependencies** including Qdrant vector database - ✅ **Both protocols simultaneously** with no conflicts ## Quick Start ### 1. Discover Available Tools ```bash curl -s http://localhost:3456/api/v1/tools | jq '.data.tools[] | {name, description, category}' ``` **Response**: List of 9 available tools with descriptions: ```json [ { "name": "version", "description": "Get comprehensive system status including version information, Vector DB connection status, embedding service capabilities, Anthropic API connectivity, Kubernetes cluster connectivity, Kyverno policy engine status, and pattern management health check", "category": "System" }, { "name": "recommend", "description": "Deploy, create, setup, install, or run applications, infrastructure, and services on Kubernetes with AI recommendations", "category": "AI Tools" }, { "name": "remediate", "description": "AI-powered Kubernetes issue analysis that provides root cause identification and actionable remediation steps", "category": "Troubleshooting" } ] ``` ### 2. Get System Status ```bash curl -s -X POST http://localhost:3456/api/v1/tools/version \ -H "Content-Type: application/json" \ -d '{}' ``` **Response**: Comprehensive system status (truncated): ```json { "success": true, "data": { "result": { "status": "success", "system": { "version": { "version": "0.90.0", "nodeVersion": "v23.11.0", "platform": "darwin" }, "vectorDB": { "connected": true, "collections": { "patterns": {"exists": true, "documentsCount": 3}, "policies": {"exists": true, "documentsCount": 3}, "capabilities": {"exists": true, "documentsCount": 157} } }, "kubernetes": { "connected": true, "context": "kind-kind" } } } } } ``` ### 3. Access Interactive API Documentation ```bash # Get OpenAPI specification curl -s http://localhost:3456/api/v1/openapi | jq '.info' ``` **Response**: OpenAPI 3.0 specification with all tool schemas: ```json { "title": "DevOps AI Toolkit REST API", "description": "REST API gateway for DevOps AI Toolkit tools", "version": "1.0.0", "contact": { "name": "Viktor Farcic", "url": "https://devopstoolkit.live/" } } ``` ## API Reference ### Base URL and Versioning **Base URL**: `http://localhost:3456/api/v1` All endpoints use `/api/v1/` prefix for versioning. Future API changes will use new version paths (`/api/v2/`) maintaining backward compatibility. ### Core Endpoints #### Tool Discovery ```http GET /api/v1/tools ``` **Query Parameters**: - `category` - Filter by tool category (e.g., "AI Tools", "System", "Troubleshooting") - `tag` - Filter by a single tag (e.g., "kubernetes", "deployment", "analysis") - `search` - Search tool names and descriptions **Examples**: ```bash # Get all AI-powered tools curl "http://localhost:3456/api/v1/tools?category=AI%20Tools" # Find deployment-related tools curl "http://localhost:3456/api/v1/tools?tag=deployment" # Search for troubleshooting tools curl "http://localhost:3456/api/v1/tools?search=troubleshoot" ``` #### Tool Execution ```http POST /api/v1/tools/{toolName} Content-Type: application/json ``` **Request Body**: Tool-specific parameters (see OpenAPI spec for schemas) **Response**: Standardized JSON response with tool results #### OpenAPI Documentation ```http GET /api/v1/openapi ``` Returns complete OpenAPI 3.0 specification with: - All tool endpoints and schemas - Parameter descriptions and validation rules - Response formats and error codes - Interactive documentation support ### Response Format All REST API responses follow this standard format: ```json { "success": boolean, "data": { "result": any, // Tool execution result "tool": string, // Tool name "executionTime": number // Execution time in milliseconds }, "error": { // Only present if success: false "code": string, "message": string, "details": any }, "meta": { "timestamp": string, // ISO 8601 timestamp "requestId": string, // Unique request identifier "version": string // API version } } ``` ### HTTP Status Codes - **200 OK**: Successful tool execution - **400 Bad Request**: Invalid request parameters or missing required fields - **404 Not Found**: Tool does not exist - **405 Method Not Allowed**: Invalid HTTP method (tools only accept POST) - **500 Internal Server Error**: Tool execution failure or server error ## Tool Discovery and Filtering Instead of listing all tools statically, use the API to discover available tools dynamically: ### Discover All Tools ```bash curl http://your-ingress-url/api/v1/tools | jq '.data.tools[] | {name, description, category}' ``` ### Filter by Category ```bash # Get all AI-powered tools curl "http://your-ingress-url/api/v1/tools?category=AI%20Tools" # Available categories: AI Tools, Deployment, Management, System, Troubleshooting, Documentation ``` ### Filter by Tags ```bash # Find deployment-related tools curl "http://your-ingress-url/api/v1/tools?tags=deployment" # Find troubleshooting tools curl "http://your-ingress-url/api/v1/tools?tags=troubleshooting" ``` ### Search Tools ```bash # Search for specific functionality curl "http://your-ingress-url/api/v1/tools?search=kubernetes" curl "http://your-ingress-url/api/v1/tools?search=database" ``` ### Get Complete Tool Documentation For detailed parameter schemas and usage instructions: ```bash # Get OpenAPI specification with all tool schemas curl http://your-ingress-url/api/v1/openapi | jq '.paths' ``` ## Workflows and Use Cases The REST API provides the same workflows as MCP and CLI. The only difference is using HTTP POST requests. **For complete workflow patterns and use cases, see the tool-specific guides:** - **Deployment Workflows**: [Kubernetes Deployment Recommendations](../tools/recommend.md) - **Troubleshooting Workflows**: [AI-Powered Issue Remediation](../tools/remediate.md) - **Capability Management**: [Capability Management Guide](../tools/capability-management.md) - **Pattern Management**: [Pattern Management Guide](../organizational-data/patterns.md) **Example: REST API call** ```bash # REST API: curl -X POST http://your-ingress-url/api/v1/tools/recommend \ -H "Content-Type: application/json" \ -d '{"intent": "deploy PostgreSQL database"}' ``` The business logic, parameters, responses, and multi-step workflows are identical across all access methods. # Observability Guide **Complete guide for distributed tracing and observability in the DevOps AI Toolkit.** ## Overview **What it does**: Provides OpenTelemetry-based distributed tracing for debugging complex workflows, measuring AI provider performance, and understanding Kubernetes operation latency. **Use when**: You need to understand where time is spent in multi-step workflows, debug performance issues, or monitor AI/Kubernetes operations in production. **📖 Full Guide**: This document covers tracing setup, configuration, backend integration, and trace interpretation specific to the DevOps AI Toolkit. ### What is Distributed Tracing? Learn about distributed tracing concepts and OpenTelemetry: - [OpenTelemetry Documentation](https://opentelemetry.io/docs/) - [OpenTelemetry Tracing Concepts](https://opentelemetry.io/docs/concepts/observability-primer/#distributed-tracing) This guide focuses on DevOps AI Toolkit-specific tracing implementation, configuration, and usage patterns. ## Prerequisites - DevOps AI Toolkit deployed (see [Deployment Guide](../setup/deployment.md)) - Basic understanding of distributed tracing concepts (optional but helpful) - Backend for viewing traces (Jaeger, Grafana Tempo, vendor service) or use console output ## Quick Start ### Environment Variables Add tracing environment variables to your deployment configuration (see [Deployment Guide](../setup/deployment.md) for how to configure environment variables). | Variable | Required | Default | Description | |----------|----------|---------|-------------| | `OTEL_TRACING_ENABLED` | **Yes** | `false` | Enable/disable tracing | | `OTEL_SERVICE_NAME` | No | `dot-ai-mcp` | Service name in traces | | `OTEL_EXPORTER_TYPE` | No | `console` | Exporter type: `console`, `otlp`, `jaeger`, `zipkin` | | `OTEL_EXPORTER_OTLP_ENDPOINT` | Required for OTLP | - | OTLP endpoint URL (e.g., `http://localhost:4318/v1/traces`) | | `OTEL_SAMPLING_PROBABILITY` | No | `1.0` | Sampling rate: `0.0` to `1.0` (1.0 = 100%, 0.1 = 10%) | | `OTEL_DEBUG` | No | `false` | Enable debug logging for tracing | ### Verify Tracing Status After configuring, verify tracing status: ```md User: Show me the system status Agent: The system is healthy and all components are operational: ... Tracing: Enabled - Exporter: console - Service Name: dot-ai-mcp - Status: initialized ``` The agent will report tracing configuration as part of the system status. ## What Gets Traced The DevOps AI Toolkit automatically traces all operations without requiring code changes: ### Tool Execution - All tools (recommendations, remediation, capability management, etc.) - Tool parameters and execution duration - Success/failure status - Session IDs for workflow correlation ### AI Provider Operations - **Chat completions**: Claude, OpenAI, Google, xAI, and custom endpoints - **Tool loop iterations**: Multi-step AI workflows with per-iteration visibility - **Embeddings generation**: Vector embeddings for semantic search - **Token usage**: Input tokens, output tokens, cache metrics - **Model information**: Provider names and specific model versions ### Kubernetes Operations - **API client calls**: All Kubernetes API operations through the client library - **kubectl commands**: CLI command execution with operation details - **Resource information**: Resource types, namespaces, and operation latency ### Vector Database Operations - **Search queries**: Semantic and keyword searches with result counts - **Document operations**: Upserts, deletions, and retrievals - **Collection management**: Collection operations and health checks - **Performance metrics**: Query latency and result quality scores ## Backend Integration ### Jaeger Jaeger is an open-source distributed tracing platform. Run Jaeger locally with Docker: ```bash docker run -d --name jaeger \ -p 16686:16686 \ -p 4318:4318 \ jaegertracing/all-in-one:latest ``` Configure the server to send traces to Jaeger: ```bash OTEL_TRACING_ENABLED=true OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4318/v1/traces ``` Access the Jaeger UI at http://localhost:16686 to view traces. ### Other Backends Any tracing backend that supports OpenTelemetry OTLP protocol should work with the same configuration pattern: ``` OTEL_TRACING_ENABLED=true OTEL_EXPORTER_OTLP_ENDPOINT= ``` Refer to your backend's documentation for the specific OTLP endpoint URL. ## Viewing Traces ### Jaeger UI Open Jaeger UI at http://localhost:16686 (if using local Jaeger setup). **Finding Traces:** 1. Select `dot-ai-mcp` from the Service dropdown 2. Click "Find Traces" button 3. View list of recent traces with duration and span count **Trace Details:** - Click on a trace to see the complete request flow - Spans are displayed in a waterfall timeline showing parent-child relationships - Each span shows operation name, duration, and timing relative to the trace start - Click on individual spans to see detailed attributes ### Understanding Trace Information **Tool Execution Span:** - Operation name: `execute_tool ` - Shows total time for tool execution - Contains session ID and tool parameters **AI Provider Spans:** - Operation names: `chat `, `tool_loop `, `embeddings ` - Token usage: `gen_ai.usage.input_tokens`, `gen_ai.usage.output_tokens` - Cache metrics: `gen_ai.usage.cache_read_tokens`, `gen_ai.usage.cache_creation_tokens` - Model details: `gen_ai.request.model`, `gen_ai.provider.name` **Kubernetes Operation Spans:** - Operation names: API method names or `kubectl ` - Attributes: `k8s.api`, `k8s.method`, `k8s.operation`, `k8s.resource` - Shows latency for Kubernetes API calls **Vector Database Spans:** - Operation names: `search`, `upsert`, `delete`, `list`, etc. - Attributes: `db.operation.name`, `db.collection.name` - Result metrics: `db.query.result_count`, `db.vector.top_score` ### Trace Hierarchy All spans from a single tool invocation share the same trace ID and follow this hierarchy: ``` execute_tool (root span) ├── chat (AI operation) │ └── POST https://api.anthropic.com (HTTP call) ├── search (vector DB query) │ └── POST http://localhost:6333 (HTTP call) └── k8s.listNamespacedDeployment (Kubernetes API) └── GET https://kubernetes/apis/apps (HTTP call) ``` This hierarchy helps identify which operations are taking the most time and where bottlenecks occur. # Telemetry Guide **Anonymous usage analytics to help improve DevOps AI Toolkit.** # Organizational Data Management Concepts **Understanding the relationship between Capabilities, Patterns, and Policies in the DevOps AI Toolkit.** ## Overview The DevOps AI Toolkit manages three types of organizational knowledge that work together to provide intelligent, compliant, and standardized Kubernetes deployments: - **Capabilities** - What resources can do (semantic understanding) - **Patterns** - What resources to deploy (organizational preferences) - **Policies** - How resources should be configured (governance requirements) ## The Three Pillars of Organizational Knowledge ### Capabilities: Resource Understanding **Purpose**: Discover and understand what Kubernetes resources actually do - **Function**: Semantic understanding of cluster resources and their capabilities - **Required for**: All intelligent recommendations and resource discovery - **Example**: Teaches AI that `sqls.devopstoolkit.live` provides PostgreSQL database capabilities - **When to use**: First step - scan your cluster to teach AI about available resources - **Goal**: Make AI smarter about your cluster's existing resources and operators **Key Characteristics**: - **Automatic discovery** through cluster scanning and AI analysis - **Resource-specific** understanding of what each CRD and operator provides - **Foundation layer** that enables all other intelligent features - **Continuously updated** as new resources are deployed to cluster ### Patterns: Deployment Guidance **Purpose**: Define organizational preferences for resource combinations - **Function**: Organizational best practices for what resources work well together - **Required for**: Enhanced recommendations that follow team standards - **Example**: Defines that web applications should include Deployment + Service + Ingress + HPA - **When to use**: After capabilities - create patterns for your common deployment scenarios - **Goal**: Make AI follow your team's deployment standards and architecture decisions **Key Characteristics**: - **Platform team authored** based on organizational experience and standards - **Resource combination focused** on what to deploy together - **Suggestion-based** enhancement of AI recommendations (not enforcement) - **Use case specific** patterns for different types of applications and workloads ### Policies: Configuration Governance **Purpose**: Ensure resources are configured according to governance requirements - **Function**: Proactive compliance that guides users toward correct configurations - **Required for**: Governance compliance and security enforcement - **Example**: Ensures all containers have resource limits, images from trusted registries - **When to use**: Throughout deployment - policies guide configuration decisions - **Goal**: Make AI recommend compliant configurations from the start, preventing violations **Key Characteristics**: - **Security/platform team authored** based on compliance and governance needs - **Configuration focused** on how resources should be set up - **Proactive guidance** that prevents violations rather than blocking after creation - **Optionally enforceable** through generated Kyverno policies for cluster-level blocking ## How They Work Together ### The AI Recommendation Pipeline ``` User Intent → Capability Discovery → Pattern Enhancement → Policy Compliance → Final Configuration ``` 1. **User Intent**: "Deploy a web application with a database" 2. **Capability Discovery**: - AI searches cluster capabilities - Finds: `apps/v1/Deployment`, `sqls.devopstoolkit.live/SQL`, `networking.k8s.io/Ingress` - Understanding: Deployment for apps, SQL CRD for databases, Ingress for traffic 3. **Pattern Enhancement**: - AI searches organizational patterns - Finds: "Web Application Pattern" (Deployment + Service + Ingress + HPA) - Enhancement: Adds HPA and Service to the recommendation 4. **Policy Compliance**: - AI searches policy intents - Finds: "Resource Limits Policy", "Image Registry Policy" - Integration: Questions include required resource limits and trusted image defaults 5. **Final Configuration**: - User gets questions with policy-driven requirements and pattern-enhanced suggestions - Generated manifests are compliant and follow organizational standards from the start ### Practical Example **Scenario**: Developer requests "Deploy a Node.js API" **Without organizational data**: ``` Questions: - Application name? - Container image? - Port? Basic Deployment + Service created ``` **With full organizational data**: ``` Capabilities found: Deployment, Service, Ingress, HPA available Pattern matched: "Web Application Pattern" Policies found: "Resource Limits Policy", "Image Registry Policy" Enhanced questions: - Application name? - Container image? (⚠️ must be from registry.company.com - policy requirement) - Port? - CPU limit? (⚠️ required by Resource Limits Policy) [default: 500m] - Memory limit? (⚠️ required by Resource Limits Policy) [default: 512Mi] - Enable autoscaling? (suggested by Web Application Pattern) [default: yes] Generated resources: Deployment + Service + Ingress + HPA All with policy-compliant configurations and organizational best practices ``` ## When to Use Each Type ### Capabilities (Start Here - Required) **Always required** for intelligent recommendations. **Preferred: Use the controller** for automatic, event-driven scanning. See [Capability Management Guide](../tools/capability-management.md#method-1-controller-based-scanning-recommended). **Alternative: Manual scanning** when controller cannot reach the server: ```text "Scan my cluster capabilities" ``` **Manual scanning use cases**: - Server running locally (not accessible from cluster) - One-time scanning without controller setup ### Patterns (Optional but Recommended) **Enhance recommendations** with organizational standards: ``` "I want to create a deployment pattern for web applications" ``` **Use when**: - Your team has established deployment standards - You want consistent resource combinations across projects - Developers frequently ask "what resources do I need for X?" - You have architectural best practices to encode ### Policies (As Needed for Governance) **Enforce compliance** requirements proactively: ``` "I want to create a policy for container resource limits" ``` **Use when**: - You have security or compliance requirements to enforce - Manual policy enforcement is error-prone or slow - You want to guide users toward compliance rather than block them - Governance teams need to ensure consistent configuration standards ## Setup and Workflow Order ### Recommended Implementation Order 1. **Start with Capabilities** (Required foundation): ``` "Scan cluster capabilities" ``` - Enables all intelligent features - Takes 5-10 minutes for initial scan - Should be done before patterns or policies 2. **Add Patterns** (Organizational enhancement): ``` "Create organizational patterns for our common use cases" ``` - Start with 3-5 most common deployment types - Gather feedback from development teams - Iterate based on usage and effectiveness 3. **Implement Policies** (Governance requirements): ``` "Create policy intents for our compliance requirements" ``` - Focus on your most critical governance needs first - Test policy integration with real deployment scenarios - Consider Kyverno enforcement for critical policies ### Prerequisites for Each Type **All types require**: - DevOps AI Toolkit deployed - Vector DB service (Qdrant) for semantic storage - API keys for AI models and embedding providers (see [AI Model Configuration](../setup/deployment.md#ai-model-configuration)) **Additionally for Policies**: - Kyverno installed (optional - only needed for cluster enforcement) - kubectl access (optional - only needed for policy deployment) ## Best Practices ### Integration Strategy - **Start simple**: Begin with capabilities, add patterns for your top 3 use cases, implement 1-2 critical policies - **Iterate based on feedback**: Gather input from development teams on what's helpful vs. burdensome - **Maintain consistency**: Ensure patterns and policies complement rather than conflict with each other ### Team Collaboration - **Capabilities**: Platform team manages (automated scanning) - **Patterns**: Platform + development teams collaborate (based on real usage) - **Policies**: Security + platform teams own (based on compliance requirements) ### Quality and Maintenance - **Review quarterly**: Ensure organizational data reflects current standards and needs - **Update incrementally**: Add new patterns/policies as needs emerge rather than trying to cover everything upfront - **Measure effectiveness**: Track whether recommendations become more useful and compliant over time ## FAQ **Q: Do I need all three types?** A: Capabilities are required for intelligent recommendations. Patterns and policies are optional enhancements that add organizational consistency and compliance. **Q: Can they conflict with each other?** A: They're designed to be complementary. Patterns suggest what to deploy, policies ensure it's configured correctly. The AI balances both when making recommendations. **Q: What happens if I only have capabilities?** A: You get intelligent resource discovery and semantic matching, but without organizational context or governance guidance. **Q: How do I know if my organizational data is working?** A: Test with real deployment requests. The AI should mention organizational context and policy requirements in its recommendations. **Q: Can I use this without Vector DB?** A: No, all three types require Vector DB for semantic storage and retrieval. This enables intelligent matching based on user intent. ## See Also - **[Capability Management Guide](../tools/capability-management.md)** - Cluster resource discovery and understanding - **[Pattern Management Guide](patterns.md)** - Creating organizational deployment standards - **[Policy Management Guide](policies.md)** - Implementing governance and compliance requirements - **[Deployment Guide](../setup/deployment.md)** - Server deployment and configuration # Pattern Management Guide **Complete guide for creating and managing organizational deployment patterns with the DevOps AI Toolkit.** ## Overview Pattern Management enables platform engineers and architects to capture organizational deployment knowledge as reusable patterns. These patterns automatically enhance AI deployment recommendations, ensuring consistency with your team's best practices and organizational standards. ### What is Pattern Management? Pattern Management allows you to: - **Create deployment patterns** that capture your organization's preferred resource combinations - **Store patterns semantically** using Vector DB technology for intelligent matching - **Enhance AI recommendations** with organizational context and best practices - **Share institutional knowledge** across teams through standardized deployment approaches ### Understanding Organizational Data Types Pattern Management works alongside Capability Management and Policy Management to provide comprehensive organizational intelligence for AI recommendations. **Quick Overview**: - **Capabilities**: What resources can do (required foundation) - **Patterns**: What resources to deploy together (organizational preferences) - **Policies**: How resources should be configured (governance requirements) For a complete understanding of how these three types work together, see the **[Organizational Data Concepts Guide](concepts.md)**. ### How It Works 1. **Pattern Creation** → Platform engineers define deployment patterns with triggers and resource recommendations 2. **Semantic Storage** → Patterns are stored with AI-generated embeddings for intelligent matching 3. **AI Integration** → When users request deployments, relevant patterns automatically enhance recommendations 4. **Continuous Learning** → Patterns improve recommendation quality over time through usage (planned for future versions) ## Prerequisites Before using Pattern Management, ensure you have: ### Required Setup - **DevOps AI Toolkit deployed** (see [Deployment Guide](../setup/deployment.md)) - **Vector DB service** (Qdrant) for pattern storage - **API keys** for AI models and embedding providers (see [AI Model Configuration](../setup/deployment.md#ai-model-configuration)) can be set as environment variables or in your deployment configuration For complete setup instructions, see the [Deployment Guide](../setup/deployment.md). ## Configuration ### Vector Database Setup Pattern Management requires a Vector DB service for semantic pattern storage and retrieval. #### Qdrant Cloud (Recommended) 1. **Sign up** at [Qdrant Cloud](https://cloud.qdrant.io/) 2. **Create a cluster** (free tier works fine for pattern storage) 3. **Get connection details**: - URL from cluster dashboard - API key from cluster settings 4. **Add to `.mcp.json`** environment variables #### Local Qdrant ```bash # Run local Qdrant with Docker (detached/background) docker container run --detach --name qdrant --publish 6333:6333 --volume $(pwd)/tmp/qdrant_storage:/qdrant/storage qdrant/qdrant ``` **Configuration**: ```json "env": { "QDRANT_URL": "http://localhost:6333" } ``` **Test Connectivity**: ```bash # Verify Qdrant is responding curl http://localhost:6333/healthz ``` ### Embedding Service Setup For semantic pattern matching, configure an embedding provider: #### Embedding Provider Configuration The toolkit supports multiple embedding providers (OpenAI, Google, Amazon Bedrock) for enhanced semantic search. 📖 **[Complete Embedding Provider Setup Guide](../setup/deployment.md#embedding-provider-configuration)** - Choose your provider, set API keys, and configure dimensions. **Note**: An embedding provider is **required** for pattern management. The system cannot store or search patterns without embedding generation capabilities. ## Pattern Management Operations All pattern management is done through natural conversation with your AI assistant. The system provides a guided workflow for creating comprehensive patterns. ### Creating Organizational Patterns **Start Pattern Creation**: ``` I want to create a new organizational deployment pattern ``` The AI assistant will guide you through a step-by-step workflow: #### Step 1: Pattern Description The assistant will ask for a clear description of your pattern: **Example**: ``` Assistant: What deployment capability does this pattern provide? I need a capability name (2-4 words). You: stateless web applications ``` #### Step 2: Trigger Keywords Define initial keywords that should activate this pattern: **Example**: ``` Assistant: What keywords or phrases should trigger this pattern? Please provide keywords separated by commas. You: stateless application, web app, api service, microservice, rest api, frontend application ``` #### Step 3: Trigger Expansion The system suggests additional related terms: **Example**: ``` Assistant: I found these additional terms that might also trigger your pattern: - application - service - workload Which should also trigger this pattern? You can select specific ones, include all, or skip. You: include: application, service, workload ``` #### Step 4: Suggested Resources Specify Kubernetes resources this pattern should recommend: **Example**: ``` Assistant: Which Kubernetes resources should be suggested for stateless web applications? Please list the resource types separated by commas. You: Deployment, Service, Ingress, HorizontalPodAutoscaler ``` #### Step 5: Rationale Explain why these resources work well together: **Example**: ``` Assistant: Why does this combination of resources work well together for stateless web applications? You: This combination provides a complete stateless application deployment: Deployment manages replicated pods without persistent state, Service provides stable network access, Ingress handles external routing and SSL termination, and HorizontalPodAutoscaler automatically scales based on demand. Together they create resilient, scalable applications that can handle traffic fluctuations and failures gracefully. ``` #### Step 6: Pattern Ownership Specify who created the pattern: **Example**: ``` Assistant: What is your name or team identifier? This helps track pattern ownership. You: platform-team ``` #### Step 7: Review and Confirmation Review the complete pattern before creation: **Example**: ``` Assistant: Please review your pattern: **Description**: stateless web applications **Triggers**: stateless application, web app, api service, microservice, rest api, frontend application, application, service, workload **Suggested Resources**: Deployment, Service, Ingress, HorizontalPodAutoscaler **Rationale**: This combination provides a complete stateless application deployment... **Created By**: platform-team Does this look correct? Type 'confirm' to create the pattern, or 'modify' to make changes. You: confirm ``` The system will then create and store your pattern with semantic embeddings for intelligent matching. **Note**: The conversation examples above are simplified for readability. The actual workflow uses structured prompts, specific response formats, and includes additional validation steps, but the core information flow remains the same. ### Listing Patterns **View All Patterns**: ``` Show me all organizational patterns ``` **Example Response**: ``` Found 3 organizational deployment patterns: 1. **Stateless Applications** (Created: 2025-01-15) - Triggers: 8 keywords including "web app", "api service", "microservice" - Resources: Deployment, Service, Ingress, HorizontalPodAutoscaler - Created by: platform-team 2. **Database Workloads** (Created: 2025-01-14) - Triggers: 6 keywords including "database", "persistent storage", "stateful" - Resources: StatefulSet, PersistentVolumeClaim, Secret - Created by: data-team 3. **Network Security** (Created: 2025-01-13) - Triggers: 5 keywords including "network policy", "security", "isolation" - Resources: NetworkPolicy - Created by: security-team Search mode: semantic+keyword hybrid search enabled ``` ### Viewing Pattern Details **Get Specific Pattern**: ``` Show me details for the stateless applications pattern ``` **Example Response**: ``` Pattern: Stateless Applications **Description**: Stateless web applications **Created**: 2025-01-15 by platform-team **Triggers** (8): - stateless application - web app - api service - microservice - rest api - frontend application - http service - web service **Suggested Resources** (4): - Deployment - Service - Ingress - HorizontalPodAutoscaler **Rationale**: This combination provides a complete stateless application deployment: Deployment manages replicated pods without persistent state, Service provides stable network access, Ingress handles external routing and SSL termination, and HorizontalPodAutoscaler automatically scales based on demand. Together they create resilient, scalable applications that can handle traffic fluctuations and failures gracefully. ``` ### Deleting Patterns **Remove Pattern**: ``` Delete the network security pattern ``` The assistant will confirm the deletion and remove the pattern from storage. ## AI Integration Patterns automatically enhance deployment recommendations when users request deployments. For complete examples of how patterns influence recommendations, see the [Recommendation Guide](../tools/recommend.md). The recommendation system automatically: - Searches for relevant patterns using semantic matching - Includes pattern context in AI prompts - Balances organizational consistency with specific user needs **Important**: Patterns serve as suggestions to enhance AI decision-making, not rigid requirements. ## Pattern Examples ### Example 1: Stateless Application Pattern **Use Case**: Standard web applications, APIs, and microservices ```yaml Description: Stateless web applications Triggers: - stateless application - web app - api service - microservice - rest api - frontend application Resources: - Deployment - Service - Ingress - HorizontalPodAutoscaler Rationale: Provides complete stateless deployment with scaling, networking, and external access ``` **When It Activates**: User requests for "web app", "API deployment", "microservice setup", etc. ### Example 2: Database Workload Pattern **Use Case**: Persistent databases and stateful services ```yaml Description: Database and persistent storage workloads Triggers: - database - persistent storage - stateful service - data store - mysql - postgresql Resources: - StatefulSet - PersistentVolumeClaim - Secret - Service Rationale: Ensures data persistence, ordered deployment, and secure credential management ``` **When It Activates**: User requests for "database deployment", "persistent storage", "MySQL setup", etc. ### Example 3: Network Security Pattern **Use Case**: Security-focused deployments with network isolation ```yaml Description: Network security and isolation policies Triggers: - network security - network isolation - security policy - network policy - microsegmentation Resources: - NetworkPolicy - ServiceAccount - Role - RoleBinding Rationale: Implements defense-in-depth with network segmentation and RBAC controls ``` **Note**: This pattern uses Pod Security Standards (successor to deprecated PodSecurityPolicy) for pod-level security controls, which are configured at the namespace level rather than as individual resources. **When It Activates**: User mentions "security", "network isolation", "compliance requirements", etc. ### Example 4: Monitoring Pattern **Use Case**: Applications requiring observability and monitoring ```yaml Description: Monitoring and observability setup Triggers: - monitoring - observability - metrics - logging - tracing - prometheus Resources: - ServiceMonitor - PodMonitor - PrometheusRule - ConfigMap Rationale: Provides comprehensive observability with metrics collection, alerting, and dashboard configuration ``` **Note**: Resources like ServiceMonitor, PodMonitor, and PrometheusRule are Custom Resource Definitions (CRDs) provided by the Prometheus Operator and require it to be installed in your cluster. ConfigMap stores dashboard configurations for Grafana. **When It Activates**: User requests including "monitoring", "observability", "metrics collection", etc. ## Best Practices ### Pattern Creation Guidelines #### 1. Focus and Composability - **Create focused patterns** for specific use cases rather than trying to cover everything - **Make patterns composable** so multiple patterns can enhance a single recommendation - **Avoid overlapping triggers** that might cause confusion between patterns **Good Example**: ```yaml # Focused pattern Description: "Load balancer setup" Triggers: ["load balancer", "external access", "ingress"] Resources: ["Service", "Ingress"] ``` **Avoid**: ```yaml # Too broad Description: "Complete application setup" Triggers: ["application", "app", "deploy", "service", "database", "monitoring"] Resources: ["Deployment", "Service", "Ingress", "StatefulSet", "PVC", "Secret", "ServiceMonitor"] ``` #### 2. Clear Trigger Keywords - **Use specific triggers** that clearly indicate when the pattern applies - **Include common variations** and synonyms users might employ - **Add technical terms** your team commonly uses **Effective Triggers**: ```yaml Triggers: - "stateless application" # Specific architecture term - "web app" # Common colloquial term - "api service" # Technical specification - "microservice" # Architecture pattern - "rest api" # Implementation detail ``` #### 3. Meaningful Rationales - **Explain the why** behind resource combinations - **Describe interactions** between suggested resources - **Include failure scenarios** the pattern addresses **Strong Rationale Example**: ``` "This combination provides complete stateless deployment: Deployment manages replicated pods without persistent state, Service provides stable network access, Ingress handles external routing and SSL termination, and HorizontalPodAutoscaler automatically scales based on demand. Together they create resilient applications that can handle traffic fluctuations and pod failures gracefully." ``` ### Organizational Adoption #### 1. Start Simple - **Begin with 3-5 core patterns** covering your most common deployment types - **Validate with actual deployments** before expanding the pattern library - **Gather feedback** from development teams on pattern usefulness #### 2. Team Collaboration - **Involve multiple teams** in pattern creation (platform, security, development) - **Document pattern ownership** for future updates and maintenance - **Create patterns for team-specific needs** (data team patterns, frontend patterns, etc.) #### 3. Iterative Improvement - **Gather feedback** from teams on pattern effectiveness and usage - **Update patterns** based on changing organizational needs - **Archive outdated patterns** that no longer reflect best practices ### Pattern Quality Guidelines These are manual best practices for creating effective patterns. The system performs basic validation (required fields) but does not automatically warn about quality issues. #### 1. Resource Selection - **Include complementary resources** that work well together - **Focus on the core resources** needed for the pattern's use case - **Consider resource relationships** when selecting combinations #### 2. Trigger Optimization - **Test triggers** with real user language from past deployment requests - **Include both formal and informal terms** teams actually use - **Avoid overly generic triggers** that match unrelated requests #### 3. Maintenance - **Review patterns quarterly** to ensure they remain current - **Update resources** when new Kubernetes features become available - **Validate rationales** against current architectural decisions ## Troubleshooting ### Common Issues #### Pattern Creation Fails **Symptom**: Error when creating patterns through AI assistant **Possible Causes**: - Vector DB connection issues - Missing required environment variables - Qdrant collection initialization problems **Solutions**: 1. **Check system status**: ``` What's the status of the pattern management system? ``` 2. **Verify Vector DB connection**: - Confirm `QDRANT_URL` and `QDRANT_API_KEY` in `.mcp.json` - Test Qdrant accessibility from your network - Check Qdrant cluster status in dashboard 3. **Validate environment variables**: - Check that your AI model API key is configured (see [AI Model Configuration](../setup/deployment.md#ai-model-configuration)) - Verify `QDRANT_URL` and `QDRANT_API_KEY` are properly set - Confirm all required environment variables are loaded in your configuration #### Patterns Not Found During Search **Symptom**: Relevant patterns don't appear in recommendations **Possible Causes**: - Trigger keywords don't match user language - Embedding service not configured properly - Pattern storage issues **Solutions**: 1. **Review pattern triggers**: - Check if triggers match the language users actually employ - Add more trigger variations and synonyms - Test trigger effectiveness with common user requests 2. **Check search capabilities**: ``` Show me search capabilities for patterns ``` 3. **Verify embedding service**: - Confirm embedding provider API key is set (see [Embedding Provider Configuration](../setup/deployment.md#embedding-provider-configuration)) - Test that embedding service is responding - Check if patterns have embeddings stored #### Semantic Search Not Working **Symptom**: Only exact keyword matches work, semantic similarities missed **Possible Causes**: - OpenAI API key missing or invalid - Patterns created without embeddings - Embedding service connectivity issues **Solutions**: 1. **Verify OpenAI configuration**: - Confirm valid embedding provider API key is set (see [Embedding Provider Configuration](../setup/deployment.md#embedding-provider-configuration)) - Test OpenAI API accessibility - Check API key permissions and usage limits 2. **Check embedding status**: ``` What's the status of the embedding service? ``` 3. **Recreate patterns if needed**: - Patterns created without embedding service may need recreation - New patterns will automatically include embeddings if service is available ### System Diagnostics #### Check Overall System Health **Command**: ``` What's the current system status? ``` **Expected Response**: ```json { "status": "success", "system": { "version": { "version": "0.42.0", "nodeVersion": "v23.11.0", "platform": "darwin", "arch": "arm64" }, "vectorDB": { "connected": true, "url": "http://localhost:6333", "collectionName": "patterns", "patternsCount": 5 }, "embedding": { "available": true, "provider": "openai", "model": "text-embedding-3-small", "dimensions": 1536 }, "anthropic": { "connected": true, "keyConfigured": true } }, "summary": { "overall": "healthy", "patternSearch": "semantic+keyword", "capabilities": [ "semantic-search", "ai-recommendations" ] }, "timestamp": "2025-08-01T23:10:26.691Z" } ``` #### Verify Pattern Storage **Command**: ``` List all organizational patterns ``` **Check For**: - Patterns are being returned successfully - Pattern count matches expectations - Search capabilities indicate semantic or keyword mode #### Test Pattern Matching **Test Method**: 1. Create a test pattern with specific triggers 2. Make a deployment request using those triggers 3. Verify the pattern influences the recommendation 4. Check that AI mentions organizational context ## FAQ ### General Questions **Q: Do I need Qdrant's paid embedding service?** A: No! The system uses OpenAI to generate embeddings and stores them in Qdrant. The free Qdrant tier works perfectly for pattern storage and search. **Q: Can I use pattern management without an embedding service?** A: No. Pattern management requires an embedding provider. See [Embedding Provider Configuration](../setup/deployment.md#embedding-provider-configuration) for setup options. **Q: How many patterns should I create?** A: Start with 3-5 core patterns covering your most common deployment types. Expand based on team feedback and usage patterns. **Q: Can multiple patterns match a single request?** A: Yes! The AI can use multiple relevant patterns to create comprehensive recommendations that combine organizational best practices. ### Technical Questions **Q: What happens if Vector DB is unavailable?** A: Pattern operations will fail gracefully. Deployment recommendations continue working but without organizational pattern enhancement. **Q: Can I backup my patterns?** A: Currently, patterns are stored in your Qdrant instance. Back up your Qdrant data to preserve patterns. Export functionality is planned for future versions. **Q: How do I update an existing pattern?** A: Currently, delete the old pattern and create a new one. In-place editing is planned for future versions. **Q: Can I see which patterns influenced a recommendation?** A: The AI will mention when recommendations are enhanced by organizational patterns, though detailed pattern attribution is not yet available. ### Performance Questions **Q: How fast is pattern search?** A: Pattern retrieval typically takes under 100ms for semantic search with reasonable pattern volumes (under 100 patterns). **Q: Does pattern management slow down recommendations?** A: No significant impact. Pattern search runs in parallel with other recommendation analysis and adds minimal latency. **Q: How many patterns can the system handle?** A: The system is tested with 100+ patterns. Qdrant can scale to much larger volumes if needed. ## Support ### Getting Help **For setup issues**: - Review the [Deployment Guide](../setup/deployment.md) for foundational configuration - Check environment variable configuration in `.mcp.json` - Verify Vector DB connectivity and credentials **For pattern creation problems**: - Use system diagnostics to check service health - Review best practices for trigger keyword selection - Test with simple patterns first before creating complex ones **For AI integration questions**: - Verify that patterns are being stored successfully - Test pattern matching with known trigger keywords - Check that AI mentions organizational context in recommendations ### Community **Documentation**: Complete guides available in `docs/` directory **Issues**: Report bugs and feature requests at [GitHub Issues](https://github.com/vfarcic/dot-ai/issues) ## See Also - **[Deployment Guide](../setup/deployment.md)** - Server deployment and configuration - **[Tools and Features Overview](../tools/overview.md)** - Browse all available tools and features # Policy Management Guide **Complete guide for creating and managing organizational policy intents with the DevOps AI Toolkit.** ## Overview Policy Management enables platform engineers and security teams to create governance policies that proactively guide users toward compliant configurations. Unlike traditional policy enforcement that blocks manifests after they're created, this system integrates policies into AI recommendations, ensuring compliance from the start. ### What is Policy Management? Policy Management allows you to: - **Create policy intents** that capture your organization's security and governance requirements - **Store policies semantically** using Vector DB technology for intelligent matching - **Guide AI recommendations** with policy requirements as part of the configuration process - **Generate Kyverno policies** optionally from policy intents for cluster-level enforcement - **Share governance knowledge** across teams through standardized policy approaches ### Understanding Organizational Data Types Policy Management works alongside Capability Management and Pattern Management to provide comprehensive organizational intelligence for AI recommendations. **Quick Overview**: - **Capabilities**: What resources can do (required foundation) - **Patterns**: What resources to deploy together (organizational preferences) - **Policies**: How resources should be configured (governance requirements) For a complete understanding of how these three types work together, see the **[Organizational Data Concepts Guide](concepts.md)**. ### How It Works 1. **Policy Creation** → Security teams define policy intents describing governance requirements 2. **Semantic Storage** → Policies are stored with AI-generated embeddings for intelligent matching 3. **AI Integration** → When users configure resources, relevant policies enhance questions with compliance requirements 4. **Proactive Guidance** → Users get compliant configurations with policy-driven defaults and validation 5. **Optional Enforcement** → Policy intents can generate Kyverno policies for cluster-level blocking ## See Policy Management in Action [![Policy Management with DevOps AI Toolkit](https://img.youtube.com/vi/hLK9j2cn6c0/maxresdefault.jpg)](https://youtu.be/hLK9j2cn6c0) This video demonstrates how Policy Management transforms organizational governance by integrating security requirements directly into AI-powered deployment recommendations. Watch how platform engineers create policy intents that guide users toward compliant configurations from the start, rather than blocking non-compliant deployments after the fact. ## Prerequisites Before using Policy Management, ensure you have: ### Required Setup - **DevOps AI Toolkit deployed** (see [Deployment Guide](../setup/deployment.md)) - **Vector DB service** (Qdrant) for policy storage - **API keys** for AI models and embedding providers (see [AI Model Configuration](../setup/deployment.md#ai-model-configuration)) can be set as environment variables or in your deployment configuration ### Optional (for Kyverno enforcement) - **[Kyverno](https://kyverno.io/)** installed in your cluster for optional policy enforcement - **kubectl access** to your cluster for policy deployment For complete setup instructions, see the [Deployment Guide](../setup/deployment.md). ## Section 1: Policy Management Basics ### Creating Policy Intents **Start Policy Creation**: ``` I want to create a new policy intent ``` The AI assistant will guide you through a step-by-step workflow: #### Step 1: Policy Description The assistant asks for the policy requirement: **Example**: ``` Assistant: Please describe the policy intent you want to create. What should this policy enforce or ensure in your Kubernetes deployments? Examples: - "All containers must have resource limits defined" - "Images must be from trusted registries only" - "Pods must not run as root user" - "All services must have network policies" What policy requirement do you want to enforce? You: All containers must have CPU and memory resource limits defined to prevent resource starvation and ensure fair resource allocation ``` #### Step 2: Infrastructure Types Define what types of infrastructure this applies to: **Example**: ``` Assistant: What types of infrastructure should this apply to? Think about the categories of resources and systems this covers, not technical implementation details. Examples by domain: - **Applications**: web applications, microservices, APIs, frontends, backends - **Databases**: databases, SQL databases, NoSQL, data stores, persistent storage - **Networking**: load balancers, ingresses, services, network policies, DNS - **Storage**: persistent volumes, storage classes, backup systems - **Security**: secrets, certificates, authentication, authorization, RBAC - **Infrastructure**: clusters, nodes, compute resources, cloud resources Please provide infrastructure type keywords separated by commas. You: applications, microservices, web applications, containerized applications, workloads, pods ``` #### Step 3: Trigger Expansion The system suggests additional related terms: **Example**: ``` Assistant: I found these additional terms that might also trigger your policy: - containers - deployments - services - jobs - cronjobs - daemonsets - statefulsets - replicasets Which should also trigger this policy? You can select specific ones, include all, or skip. You: applications, microservices, web applications, containerized applications, workloads, pods, containers, deployments ``` #### Step 4: Policy Rationale Explain why this policy is important: **Example**: ``` Assistant: Why is this policy important for your organization? Please explain the rationale behind this policy requirement - what risks does it mitigate or what benefits does it provide? You: Resource limits prevent any single container from consuming excessive CPU or memory, which could starve other applications and degrade cluster performance. This ensures fair resource allocation and prevents noisy neighbor problems in multi-tenant environments. ``` #### Step 5: Policy Ownership Specify who created the policy: **Example**: ``` Assistant: Who should be credited as the creator of this policy intent? (This can be your name, team name, or organization.) You: platform-team ``` #### Step 6: Namespace Scope Selection (Kyverno Only) If Kyverno is installed in your cluster, you can choose which namespaces the policy should apply to: **Example**: ``` Assistant: # Policy Namespace Scope Your policy can be applied cluster-wide or limited to specific namespaces. ## Available Namespaces in Your Cluster: production, staging, development, default, kube-system, kube-public ## Choose the scope for your policy: 1. **Apply to all namespaces** (cluster-wide enforcement) - Type: `all` or `1` 2. **Apply only to specific namespaces** (inclusive list) - Type: `include: namespace1, namespace2, namespace3` - Example: `include: production, staging` 3. **Apply to all namespaces EXCEPT specific ones** (exclusion list) - Type: `exclude: namespace1, namespace2` - Example: `exclude: kube-system, kube-public` You: include: production, staging ``` **Namespace Options**: - **All namespaces**: Policy applies cluster-wide with no restrictions - **Include specific**: Policy applies only to the namespaces you specify - **Exclude specific**: Policy applies everywhere except the namespaces you specify **Note**: This step only appears when Kyverno is installed. For policy intents without Kyverno, skip to Step 7. #### Step 7: Kyverno Policy Generation The system generates a Kyverno policy and presents options: **Example**: ``` Assistant: Please review your policy intent: **Description**: All containers must have CPU and memory resource limits defined to prevent resource starvation and ensure fair resource allocation **Triggers**: applications, microservices, web applications, containerized applications, workloads, pods, containers, deployments **Rationale**: Resource limits prevent any single container from consuming excessive CPU or memory, which could starve other applications and degrade cluster performance. This ensures fair resource allocation and prevents noisy neighbor problems in multi-tenant environments. **Created By**: platform-team I've also generated a Kyverno ClusterPolicy that enforces this requirement: **Generated Kyverno Policy**: apiVersion: kyverno.io/v1 kind: ClusterPolicy metadata: name: require-container-resource-limits-d53b3447 labels: policy-intent/id: d53b3447-0eed-42a0-8a8b-651d18374325 spec: validationFailureAction: Enforce background: false rules: - name: require-pod-container-resource-limits match: any: - resources: kinds: - Pod operations: - CREATE - UPDATE validate: cel: expressions: - expression: >- object.spec.containers.all(container, has(container.resources) && has(container.resources.limits) && has(container.resources.limits.cpu) && has(container.resources.limits.memory) && container.resources.limits.cpu != '' && container.resources.limits.memory != '' ) message: "All containers must have CPU and memory resource limits defined..." # ... (additional rules for Deployment, StatefulSet, ReplicaSet, DaemonSet, ReplicationController) **Choose what to do:** 1. **Apply Kyverno policy to cluster** - Store policy intent AND deploy enforcement to cluster 2. **Store policy intent only** - Save for AI guidance without cluster enforcement 3. **Cancel** - Do nothing ⚠️ **Warning**: Option 1 will deploy active policy enforcement to your cluster. You: 1 ``` #### Step 8: Policy Review and Deployment Choice The system shows the generated policy and asks what you want to do: **Example**: ``` Assistant: Please review your policy intent: **Description**: All containers must have CPU and memory resource limits defined to prevent resource starvation and ensure fair resource allocation **Triggers**: applications, microservices, web applications, containerized applications, workloads, pods, containers, deployments **Rationale**: Resource limits prevent any single container from consuming excessive CPU or memory, which could starve other applications and degrade cluster performance. This ensures fair resource allocation and prevents noisy neighbor problems in multi-tenant environments. **Created By**: platform-team I've also generated a Kyverno ClusterPolicy that enforces this requirement: **Generated Kyverno Policy**: ```yaml # YAML policy content shown here with namespace restrictions based on Step 6 choice ``` **Choose what to do:** 1. **Apply Kyverno policy to cluster** - Store policy intent AND deploy enforcement to cluster 2. **Store policy intent only** - Save for AI guidance without cluster enforcement 3. **Cancel** - Do nothing ⚠️ **Warning**: Option 1 will deploy active policy enforcement to your cluster. You: 1 ``` #### Step 9: Policy Deployment (if chosen) If you chose option 1, the system validates and deploys the policy to your cluster: **Example**: ``` Assistant: **Deploying policy to cluster...** ✅ Kyverno policy validation successful ✅ Policy applied to cluster: require-container-resource-limits-d53b3447 ✅ Policy intent stored in database **Policy Intent and Enforcement Created Successfully!** **Policy ID**: d53b3447-0eed-42a0-8a8b-651d18374325 **Status**: Active - both AI guidance and cluster enforcement enabled **Kyverno Policy**: require-container-resource-limits-d53b3447 The policy intent will now guide AI recommendations during deployments, and the Kyverno policy enforces compliance at the cluster level. ``` The policy intent is now available to guide AI recommendations during resource configuration. ### Listing Policy Intents **View All Policies**: ``` Show me all policy intents ``` **Example Response**: ``` Found 1 policy intents (showing 1) 1. **Resource Limits Policy** (Created: 2025-08-25) - Description: All containers must have CPU and memory resource limits defined to prevent resource starvation and ensure fair resource allocation - Triggers: 8 keywords including "applications", "microservices", "containers", "deployments" - Created by: platform-team - Status: Intent only (no Kyverno policies deployed) Search mode: semantic+keyword hybrid search enabled ``` ### Viewing Policy Details **Get Specific Policy** (using policy ID): ``` Show me policy intent d53b3447-0eed-42a0-8a8b-651d18374325 ``` **Example Response**: ``` Policy intent retrieved successfully: **Policy ID**: d53b3447-0eed-42a0-8a8b-651d18374325 **Description**: All containers must have CPU and memory resource limits defined to prevent resource starvation and ensure fair resource allocation **Created**: 2025-08-25 by platform-team **Status**: Intent only (no Kyverno policies deployed) **Triggers** (8): - applications - microservices - web applications - containerized applications - workloads - pods - containers - deployments **Rationale**: Resource limits prevent any single container from consuming excessive CPU or memory, which could starve other applications and degrade cluster performance. This ensures fair resource allocation and prevents noisy neighbor problems in multi-tenant environments. ``` ### Searching Policy Intents **Search Policies** (using semantic search): ``` Find policies about "resource limits" ``` **Example Response**: ``` Found 1 policy intents matching "resource limits" 1. **Resource Limits Policy** (Score: 0.40) - Description: All containers must have CPU and memory resource limits defined... - Triggers: 8 keywords including "applications", "containers", "deployments" - Created: 2025-08-25 by platform-team - Status: Intent only Semantic search found relevant policies based on description and trigger matching. ``` **Search Features**: - **Semantic matching**: Finds policies based on meaning, not just exact keywords - **Score ranking**: Results ordered by relevance score - **Comprehensive search**: Searches policy descriptions, triggers, and rationales ### Deleting Policy Intents #### Delete Single Policy **Remove Specific Policy** (requires policy ID): ``` Delete policy intent d53b3447-0eed-42a0-8a8b-651d18374325 ``` **Note**: You need the specific policy ID for deletion. Use `list` or `search` operations first to find the policy ID you want to delete. For policies with no deployed Kyverno policies: **Example Response**: ``` Policy intent deleted successfully (no Kyverno policies to cleanup) **Deleted Policy**: - ID: d53b3447-0eed-42a0-8a8b-651d18374325 - Description: All containers must have CPU and memory resource limits defined... - Created: 2025-08-25 by platform-team ``` For policies with deployed Kyverno policies, the system will ask for confirmation: **Example Confirmation**: ``` Policy intent has deployed Kyverno policies that need cleanup decision: Policy intent "All containers must have resource limits..." has 1 deployed Kyverno policies in your cluster: require-container-resource-limits-d53b3447 **Choose what to do:** 1. **Delete everything** - Remove policy intent AND delete Kyverno policies from cluster 2. **Keep Kyverno policies** - Remove policy intent only, preserve cluster policies ⚠️ **Warning**: Option 1 will remove active policy enforcement from your cluster. **What would you like to do?** ``` #### Delete All Policies **Remove All Policies**: ``` Delete all policy intents ``` If no policies exist: **Example Response**: ``` No policy intents found to delete ``` If policies exist with deployed Kyverno policies, you'll get a batch confirmation similar to single delete, allowing you to choose whether to preserve or remove all cluster policies. **Note**: The conversation examples above are simplified for readability. The actual workflow uses structured prompts and includes additional validation steps, but the core information flow remains the same. ## AI Integration Policy intents automatically enhance deployment recommendations when users request deployments. For complete examples of how policies influence recommendations, see the [Recommendation Guide](../tools/recommend.md). The recommendation system automatically: - Searches for relevant policy intents using semantic matching - Includes policy requirements as REQUIRED questions with compliance indicators - Provides policy-compliant defaults and validation guidance - Balances governance requirements with specific user needs **Important**: Policy intents guide users toward compliant configurations proactively, rather than blocking manifests reactively. ## Troubleshooting ### Common Issues #### Policy Creation Fails **Symptom**: Error when creating policy intents through AI assistant **Possible Causes**: - Vector DB connection issues - Missing required environment variables - Qdrant collection initialization problems - Embedding service unavailable **Solutions**: 1. **Check system status**: ``` Show dot-ai status ``` 2. **Verify Vector DB connection**: - Confirm `QDRANT_URL` and `QDRANT_API_KEY` in `.mcp.json` - Test Qdrant accessibility from your network - Check Qdrant cluster status in dashboard 3. **Validate environment variables**: See [AI Model Configuration](../setup/deployment.md#ai-model-configuration) for AI model and embedding provider setup with all available options. #### Kyverno Policy Generation Fails **Symptom**: Policy intent created successfully but Kyverno generation fails **Possible Causes**: - Kyverno not installed in cluster - Invalid policy description that can't be converted to Kyverno rules - Cluster connectivity issues - Missing kubectl access **Solutions**: 1. **Check Kyverno installation**: ```bash kubectl get pods -n kyverno ``` 2. **Verify cluster access**: ```bash kubectl cluster-info ``` 3. **Simplify policy description**: - Use clear, specific requirements - Avoid complex conditional logic - Focus on single validation rules 4. **Manual policy application** (if generation fails): The system saves generated policies to files even on deployment failures. Check the error message for the file path and apply manually: ```bash kubectl apply -f /path/to/kyverno-policy.yaml ``` #### Policy Search Not Working **Symptom**: Relevant policies don't appear in AI recommendations **Possible Causes**: - Trigger keywords don't match user language - Embedding service not configured properly - Policy storage issues - Vector DB connectivity problems **Solutions**: 1. **Review policy triggers**: - Check if triggers match the language users actually employ - Add more trigger variations and synonyms - Test with different request phrasings 2. **Check embedding service**: - Verify embedding provider is configured (see [Embedding Provider Configuration](../setup/deployment.md#embedding-provider-configuration)) - Confirm embedding service API key is properly set 3. **Verify policy storage**: ``` List all policy intents ``` 4. **Test policy matching**: - Create a simple test policy with obvious triggers - Make a deployment request using those exact triggers - Check if AI mentions the policy in recommendations #### Kyverno Policy Deployment Fails **Symptom**: Policy intent and YAML generated successfully but deployment fails **Possible Causes**: - Insufficient cluster permissions - Network connectivity issues - Cluster API unavailable - Kyverno admission controller not ready **Solutions**: 1. **Check cluster permissions**: ```bash kubectl auth can-i create clusterpolicies kubectl auth can-i get clusterpolicies ``` 2. **Verify Kyverno status**: ```bash kubectl get pods -n kyverno kubectl logs -n kyverno -l app.kubernetes.io/name=kyverno ``` 3. **Manual deployment**: Use the generated policy file (path provided in error message): ```bash kubectl apply -f /path/to/generated-policy.yaml ``` #### Semantic Search Not Working **Symptom**: Only exact keyword matches work, semantic similarities missed **Possible Causes**: - OpenAI API key missing or invalid - Policies created without embeddings - Embedding service connectivity issues **Solutions**: 1. **Verify OpenAI configuration**: - Confirm valid embedding provider API key is set (see [Embedding Provider Configuration](../setup/deployment.md#embedding-provider-configuration)) - Test API key with simple request - Check API usage limits and billing status 2. **Check embedding status in system status**: ``` Show dot-ai status ``` Look for embedding service availability 3. **Recreate policies if needed**: - Policies created without embedding service may need recreation - New policies automatically include embeddings if service is available ### System Diagnostics #### Check Overall System Health **Command**: ``` What's the current system status? ``` **Expected Response** (healthy system): ```json { "status": "success", "system": { "vectorDB": { "connected": true, "url": "http://localhost:6333", "collectionName": "policies", "policiesCount": 3 }, "embedding": { "available": true, "provider": "openai", "model": "text-embedding-3-small" }, "kyverno": { "installed": true, "policyGenerationReady": true, "version": "1.10.0" }, "anthropic": { "connected": true, "keyConfigured": true } } } ``` #### Verify Policy Storage **Command**: ``` List all policy intents ``` **Check For**: - Policies are being returned successfully - Policy count matches expectations - Search capabilities indicate semantic mode #### Test Policy Integration **Test Method**: 1. Create a test policy with specific, unique triggers 2. Make a deployment request using those exact triggers 3. Verify the policy appears in AI questions with compliance indicators 4. Check that generated manifests follow policy requirements ## FAQ ### General Questions **Q: Do policies affect all deployment recommendations?** A: Yes, relevant policies automatically enhance AI recommendations when users deploy resources that match the policy triggers and rationale. **Q: Can I use policy management without Kyverno?** A: Yes! Policy intents provide AI guidance without requiring Kyverno. The cluster enforcement is optional. **Q: How many policies should I create?** A: Start with 3-5 core governance policies covering your most critical requirements (security, resources, compliance). Expand based on organizational needs. **Q: Can multiple policies apply to a single request?** A: Yes! The AI can apply multiple relevant policies to create comprehensive compliance guidance. ### Technical Questions **Q: What happens if Vector DB is unavailable?** A: Policy operations will fail. Deployment recommendations continue working but without policy guidance until Vector DB is restored. **Q: Can I backup my policies?** A: Currently, policies are stored in your Qdrant instance. Back up your Qdrant data to preserve policies. Export functionality is planned for future versions. **Q: How do I update an existing policy?** A: Currently, delete the old policy and create a new one. In-place editing is planned for future versions. **Q: What happens to Kyverno policies when I delete a policy intent?** A: The system asks whether you want to delete the cluster policies too, or preserve them while removing only the AI guidance. ### Security Questions **Q: Are policy intents stored securely?** A: Policy intents are stored in your Vector DB with standard Qdrant security. Use appropriate access controls and network security for your Vector DB instance. **Q: Can policies access sensitive cluster data?** A: No, policies only define validation rules. They don't have access to secret data or cluster state beyond what Kyverno normally validates. **Q: Do generated Kyverno policies follow security best practices?** A: Yes, generated policies use modern Kyverno CEL expressions and follow least-privilege principles. However, always review generated policies before deployment. ## See Also - **[Deployment Guide](../setup/deployment.md)** - Server deployment and configuration - **[Tools and Features Overview](../tools/overview.md)** - Browse all available tools and features # Quick Start **Get started with DevOps AI Toolkit in minutes — deploy applications, manage policies, and remediate issues using AI-powered Kubernetes workflows.** > **For the easiest setup**, we recommend installing the complete dot-ai stack which includes all components pre-configured. See the [Stack Installation Guide](/docs/stack). > > Continue below if you want to install components individually (for non-Kubernetes setups or granular control over configuration). ## Overview **What it does**: DevOps AI Toolkit provides AI-powered Kubernetes deployment, remediation, policy management, and capability discovery through conversational workflows via [MCP](/docs/mcp) or the [CLI](/docs/cli). **Use when**: You want intelligent Kubernetes operations without memorizing commands, need AI-powered troubleshooting, or want to establish governance policies across your cluster. **📖 Full Guide**: See the [Deployment Guide](setup/deployment.md) for detailed configuration options and the [Tools Overview](tools/overview.md) for complete feature reference. ## Prerequisites **Works without AI keys:** - ✅ **Shared prompts library** - No API key needed, works with any MCP-enabled coding agent **For AI-powered features (deployment, remediation, patterns, policies, capabilities):** - **AI Model API key** - Required for AI analysis and intelligent recommendations - **Multiple AI models supported** - see [AI Model Configuration](setup/deployment.md#ai-model-configuration) for all options and setup - **Quick setup**: Claude (default) - `export ANTHROPIC_API_KEY=your_key_here` **For Kubernetes deployment recommendations:** - **kubectl** configured with cluster access - Verify cluster access with: `kubectl get nodes` - Should show your cluster nodes without authentication errors **For organizational pattern management:** - **Vector DB service** (Qdrant) for pattern storage and semantic search - **Embedding provider API key** - Required for semantic pattern matching: - OpenAI: `OPENAI_API_KEY` - Google: `GOOGLE_API_KEY` - Amazon Bedrock: AWS credentials via environment variables or `~/.aws/credentials` **For policy management and governance:** - **Vector DB service** (Qdrant) for policy storage and semantic search - **Embedding provider API key** - Required for semantic policy matching (same options as above) - **Optional**: Kyverno installed in cluster for active policy enforcement ## Installation DevOps AI Toolkit runs as a server on Kubernetes. AI agents connect to it via [MCP](/docs/mcp) or the [CLI](/docs/cli). This guide walks through server deployment and MCP client setup. ## Usage **🎯 Recommended: Kubernetes Setup (Full Features)** Production-ready deployment with autonomous capability scanning via controller: ### Step 0: Create a Kubernetes Cluster (Optional) Skip this step if you already have a Kubernetes cluster with an ingress controller. **Prerequisites:** [Install Kind](https://kind.sigs.k8s.io/docs/user/quick-start/#installation) if you don't have it. **Create a Kind cluster with ingress support:** ```bash # Create Kind cluster configuration cat > kind-config.yaml << 'EOF' kind: Cluster apiVersion: kind.x-k8s.io/v1alpha4 nodes: - role: control-plane extraPortMappings: - containerPort: 80 hostPort: 80 protocol: TCP - containerPort: 443 hostPort: 443 protocol: TCP EOF # Create the cluster kind create cluster --name dot-ai --config kind-config.yaml # Install nginx ingress controller for Kind kubectl apply -f https://raw.githubusercontent.com/kubernetes/ingress-nginx/main/deploy/static/provider/kind/deploy.yaml # Wait for ingress controller to be ready kubectl wait --namespace ingress-nginx \ --for=condition=ready pod \ --selector=app.kubernetes.io/component=controller \ --timeout=90s ``` ### Step 1: Set Environment Variables ```bash export ANTHROPIC_API_KEY="sk-ant-api03-your-key-here" export OPENAI_API_KEY="sk-proj-your-openai-key-here" export DOT_AI_AUTH_TOKEN=$(openssl rand -base64 32) # Ingress class - change to match your ingress controller (traefik, haproxy, etc.) export INGRESS_CLASS_NAME="nginx" ``` ### Step 2: Install via Helm ```bash # Set versions from GitHub packages export DOT_AI_VERSION="..." # https://github.com/vfarcic/dot-ai/pkgs/container/dot-ai%2Fcharts%2Fdot-ai export DOT_AI_CONTROLLER_VERSION="..." # https://github.com/vfarcic/dot-ai-controller/pkgs/container/dot-ai-controller%2Fcharts%2Fdot-ai-controller # Install controller (enables autonomous capability scanning) helm install dot-ai-controller \ oci://ghcr.io/vfarcic/dot-ai-controller/charts/dot-ai-controller:$DOT_AI_CONTROLLER_VERSION \ --namespace dot-ai --create-namespace --wait # Install MCP server helm install dot-ai-mcp oci://ghcr.io/vfarcic/dot-ai/charts/dot-ai:$DOT_AI_VERSION \ --set secrets.anthropic.apiKey="$ANTHROPIC_API_KEY" \ --set secrets.openai.apiKey="$OPENAI_API_KEY" \ --set secrets.auth.token="$DOT_AI_AUTH_TOKEN" \ --set ingress.enabled=true \ --set ingress.className="$INGRESS_CLASS_NAME" \ --set ingress.host="dot-ai.127.0.0.1.nip.io" \ --set controller.enabled=true \ --namespace dot-ai --wait ``` ### Step 3: Create MCP Configuration Create the MCP client configuration file with your auth token: ```bash cat > .mcp.json << EOF { "mcpServers": { "dot-ai": { "type": "http", "url": "http://dot-ai.127.0.0.1.nip.io", "headers": { "Authorization": "Bearer $DOT_AI_AUTH_TOKEN" } } } } EOF ``` **Note:** The `$DOT_AI_AUTH_TOKEN` variable is expanded when creating the file. Make sure you're in the same terminal session where you set the environment variables in Step 1. ### Step 4: Start Your MCP Client ```bash claude # or your preferred MCP-enabled AI tool ``` Verify everything works by asking: ``` Show dot-ai status ``` You should see a status report showing all components are healthy. **What you get:** - ✅ **Full Features**: All capabilities including autonomous scanning via controller - ✅ **Production-Ready**: Scalable deployment with proper resource management - ✅ **Automatic Capability Discovery**: Controller watches for CRD changes and scans automatically - ✅ **Team Collaboration**: Shared MCP server accessible by multiple developers **Full Configuration**: See the [Deployment Guide](setup/deployment.md) for advanced configuration options. ### Step 5: Start Using Conversational Workflows Try these example prompts to explore the toolkit: | What You Want | Example Prompt | Guide | |---------------|----------------|-------| | Scan capabilities | Use controller (recommended) or "Scan my cluster for capabilities" | [Capability Management](tools/capability-management.md) | | Query cluster | "What databases are running?" | [Cluster Query](tools/query.md) | | Deploy an app | "I want to deploy a web application" | [Recommendation Guide](tools/recommend.md) | | Operate resources | "Scale my database to 3 replicas" | [Operations Guide](tools/operate.md) | | Fix issues | "Something is wrong with my database" | [Remediation Guide](tools/remediate.md) | | Create patterns | "Create a pattern for database deployments" | [Pattern Management](organizational-data/patterns.md) | | Create policies | "Create a policy requiring resource limits" | [Policy Management](organizational-data/policies.md) | | Setup project | "Help me setup governance files" | [Project Setup Guide](tools/project-setup.md) | | Use prompts | `/dot-ai:prd-create` | [Prompts Guide](tools/prompts.md) | ## Next Steps 📖 **[Deployment Guide →](setup/deployment.md)** - Detailed configuration, troubleshooting, and examples 📖 **[Complete Tools & Features Reference →](tools/overview.md)** - Comprehensive guide to all available tools, workflows, and advanced features # AI Engine Deployment **Deploy the DevOps AI Toolkit Engine to Kubernetes using Helm chart — production-ready deployment.** > **For the easiest setup**, we recommend installing the complete dot-ai stack which includes all components pre-configured. See the [Stack Installation Guide](/docs/stack). > > Continue below if you want to install components individually (for granular control over configuration). ## Overview The DevOps AI Toolkit Engine provides: 1. **Kubernetes Deployment Recommendations** — AI-powered application deployment assistance with enhanced semantic understanding 2. **Cluster Query** — Natural language interface for querying cluster resources, status, and health 3. **Capability Management** — Discover and store semantic resource capabilities for intelligent recommendation matching 4. **Pattern Management** — Organizational deployment patterns that enhance AI recommendations 5. **Policy Management** — Governance policies that guide users toward compliant configurations with optional Kyverno enforcement 6. **Kubernetes Issue Remediation** — AI-powered root cause analysis and automated remediation 7. **Shared Prompts Library** — Centralized prompt sharing via native slash commands 8. **REST API Gateway** — HTTP endpoints for all toolkit capabilities Access these tools through [MCP clients](/docs/mcp) or the [CLI](/docs/cli). ## What You Get - **Production Kubernetes Deployment** — Scalable deployment with proper resource management - **Integrated Qdrant Database** — Vector database for capability and pattern management - **External Access** — Ingress configuration for team collaboration - **Resource Management** — Proper CPU/memory limits and requests - **Security** — RBAC and ServiceAccount configuration ## Prerequisites - Kubernetes cluster (1.19+) with kubectl access - Helm 3.x installed - AI model API key (default: Anthropic). See [AI Model Configuration](#ai-model-configuration) for available model options. - OpenAI API key (required for vector embeddings) - Ingress controller (any standard controller) ## Quick Start (5 Minutes) ### Step 1: Set Environment Variables Export your API keys and auth token: ```bash # Required export ANTHROPIC_API_KEY="sk-ant-api03-..." export OPENAI_API_KEY="sk-proj-..." export DOT_AI_AUTH_TOKEN=$(openssl rand -base64 32) # Ingress class - change to match your ingress controller (traefik, haproxy, etc.) export INGRESS_CLASS_NAME="nginx" ``` ### Step 2: Install the Controller Install the dot-ai-controller to enable autonomous cluster operations: ```bash # Set the controller version from https://github.com/vfarcic/dot-ai-controller/pkgs/container/dot-ai-controller%2Fcharts%2Fdot-ai-controller export DOT_AI_CONTROLLER_VERSION="..." # Install controller (includes CRDs for Solution and RemediationPolicy) helm install dot-ai-controller \ oci://ghcr.io/vfarcic/dot-ai-controller/charts/dot-ai-controller:$DOT_AI_CONTROLLER_VERSION \ --namespace dot-ai \ --create-namespace \ --wait ``` The controller provides CRDs for autonomous cluster operations. Create Custom Resources like CapabilityScanConfig, Solution, RemediationPolicy, or ResourceSyncConfig to enable features such as capability scanning, solution tracking, and more. See the [Controller Setup Guide](/docs/controller/setup-guide) for complete details. ### Step 3: Install the Server Install the server using the published Helm chart: ```bash # Set the version from https://github.com/vfarcic/dot-ai/pkgs/container/dot-ai%2Fcharts%2Fdot-ai export DOT_AI_VERSION="..." helm install dot-ai-mcp oci://ghcr.io/vfarcic/dot-ai/charts/dot-ai:$DOT_AI_VERSION \ --set secrets.anthropic.apiKey="$ANTHROPIC_API_KEY" \ --set secrets.openai.apiKey="$OPENAI_API_KEY" \ --set secrets.auth.token="$DOT_AI_AUTH_TOKEN" \ --set ingress.enabled=true \ --set ingress.className="$INGRESS_CLASS_NAME" \ --set ingress.host="dot-ai.127.0.0.1.nip.io" \ --set controller.enabled=true \ --namespace dot-ai \ --wait ``` **Notes**: - Replace `dot-ai.127.0.0.1.nip.io` with your desired hostname for external access. - For enhanced security, create a secret named `dot-ai-secrets` with keys `anthropic-api-key`, `openai-api-key`, and `auth-token` instead of using `--set` arguments. - For all available configuration options, see the [Helm values file](https://github.com/vfarcic/dot-ai/blob/main/charts/values.yaml). - **Global annotations**: Add annotations to all Kubernetes resources using `annotations` in your values file (e.g., for [Reloader](https://github.com/stakater/Reloader) integration: `reloader.stakater.com/auto: "true"`). - **Custom endpoints** (OpenRouter, self-hosted): See [Custom Endpoint Configuration](#custom-endpoint-configuration) for environment variables, then use `--set` or values file with `ai.customEndpoint.enabled=true` and `ai.customEndpoint.baseURL`. - **Observability/Tracing**: Add tracing environment variables via `extraEnv` in your values file. See [Observability Guide](../operations/observability.md) for complete configuration. - **User-Defined Prompts**: Load custom prompts from your git repository via `extraEnv`. See [User-Defined Prompts](../tools/prompts.md#user-defined-prompts) for configuration. ### Step 4: Connect a Client With the server running, connect using your preferred access method: - **[MCP Client Setup](/docs/mcp)** — Connect via MCP protocol from Claude Code, Cursor, or other MCP clients - **[CLI](/docs/cli)** — Use the command-line interface for terminal and CI/CD pipelines ## Capability Scanning for AI Recommendations Many MCP tools depend on **capability data** to function: - **recommend**: Uses capabilities to find resources matching your deployment intent - **manageOrgData** (patterns): References capabilities when applying organizational patterns - **manageOrgData** (policies): Validates resources against stored capability metadata Without capability data, these tools may not work or will produce poor results. ### Enabling Capability Scanning Create a `CapabilityScanConfig` CR to enable autonomous capability discovery. The controller watches for CRD changes and automatically scans new resources. See the [Capability Scan Guide](/docs/controller/capability-scan-guide) for setup instructions. ## AI Model Configuration The DevOps AI Toolkit supports multiple AI models. Choose your model by setting the `AI_PROVIDER` environment variable. ### Model Requirements All AI models must meet these minimum requirements: - **Context window**: 200K+ tokens (some tools like capability scanning use large context) - **Output tokens**: 8K+ tokens (for YAML generation and policy creation) - **Function calling**: Required for MCP tool interactions ### Available Models | Provider | Model | AI_PROVIDER | API Key Required | Recommended | |----------|-------|-------------|------------------|-------------| | **Anthropic** | Claude Haiku 4.5 | `anthropic_haiku` | `ANTHROPIC_API_KEY` | Yes | | **Anthropic** | Claude Opus 4.6 | `anthropic_opus` | `ANTHROPIC_API_KEY` | Yes | | **Anthropic** | Claude Sonnet 4.6 | `anthropic` | `ANTHROPIC_API_KEY` | Yes | | **AWS** | Amazon Bedrock | `amazon_bedrock` | AWS credentials ([see setup](https://docs.aws.amazon.com/cli/latest/userguide/cli-configure-files.html)) | Yes | | **Google** | Gemini 3.1 Pro | `google` | `GOOGLE_GENERATIVE_AI_API_KEY` | Yes (might be slow) | | **Google** | Gemini 3 Flash | `google_flash` | `GOOGLE_GENERATIVE_AI_API_KEY` | Yes (preview) | | **Host** | Host Environment LLM | `host` | None (uses host's AI) | Yes (if supported) | | **Moonshot AI** | Kimi K2 | `kimi` | `MOONSHOT_API_KEY` | Yes | | **Moonshot AI** | Kimi K2 Thinking | `kimi_thinking` | `MOONSHOT_API_KEY` | Yes (might be slow) | | **OpenAI** | GPT-5.1 Codex | `openai` | `OPENAI_API_KEY` | No * | | **xAI** | Grok-4 | `xai` | `XAI_API_KEY` | No * | \* **Note**: These models may not perform as well as other providers for complex DevOps reasoning tasks. ### Models Not Supported | Provider | Model | Reason | |----------|-------|--------| | **DeepSeek** | DeepSeek V3.2 (`deepseek-chat`) | 128K context limit insufficient for heavy workflows | | **DeepSeek** | DeepSeek R1 (`deepseek-reasoner`) | 64K context limit insufficient for most workflows | **Why DeepSeek is not supported**: Integration testing revealed that DeepSeek's context window limitations (128K for V3.2, 64K for R1) cause failures in context-heavy operations like Kyverno policy generation, which can exceed 130K tokens. The toolkit requires 200K+ context for reliable operation across all features. ### Helm Configuration Set AI provider in your Helm values: ```yaml ai: provider: anthropic_haiku # or anthropic, anthropic_opus, google, etc. secrets: anthropic: apiKey: "your-api-key" ``` Or via `--set`: ```bash helm install dot-ai-mcp oci://ghcr.io/vfarcic/dot-ai/charts/dot-ai:$DOT_AI_VERSION \ --set ai.provider=anthropic_haiku \ --set secrets.anthropic.apiKey="$ANTHROPIC_API_KEY" \ # ... other settings ``` **AI Keys Are Optional**: The MCP server starts successfully without AI API keys. Tools like **Shared Prompts Library** and **REST API Gateway** work without AI. AI-powered tools (deployment recommendations, remediation, pattern/policy management, capability scanning) require AI keys (unless using the `host` provider) and will show helpful error messages when accessed without configuration. ## Embedding Provider Configuration The DevOps AI Toolkit supports multiple embedding providers for semantic search capabilities in pattern management, capability discovery, and policy matching. ### Available Embedding Providers | Provider | EMBEDDINGS_PROVIDER | Model | Dimensions | API Key Required | |----------|-------------------|-------|------------|------------------| | **Amazon Bedrock** | `amazon_bedrock` | `amazon.titan-embed-text-v2:0` | 1024 | AWS credentials | | **Google** | `google` | `text-embedding-004` (deprecated) | 768 | `GOOGLE_API_KEY` | | **Google** | `google` | `gemini-embedding-001` | 768 | `GOOGLE_API_KEY` | | **OpenAI** | `openai` (default) | `text-embedding-3-small` | 1536 | `OPENAI_API_KEY` | ### Helm Configuration Set embedding provider via `extraEnv` in your values file: ```yaml extraEnv: - name: EMBEDDINGS_PROVIDER value: "google" - name: GOOGLE_API_KEY valueFrom: secretKeyRef: name: dot-ai-secrets key: google-api-key ``` **Notes:** - **Same Provider**: If using the same provider for both AI models and embeddings (e.g., `AI_PROVIDER=google` and `EMBEDDINGS_PROVIDER=google`), you only need to set one API key - **Mixed Providers**: You can use different providers for AI models and embeddings (e.g., `AI_PROVIDER=anthropic` with `EMBEDDINGS_PROVIDER=google`) - **Embedding Support**: Not all AI model providers support embeddings. Anthropic does not provide embeddings; use OpenAI, Google, or Amazon Bedrock for embeddings - **Google Deprecation**: `text-embedding-004` will be discontinued on January 14, 2026. Use `gemini-embedding-001` for new deployments. When switching models, you must delete and recreate all embeddings (patterns, capabilities, policies) as vectors from different models are not compatible ## Custom Endpoint Configuration You can configure custom OpenAI-compatible endpoints for AI models. This enables using alternative providers like OpenRouter, self-hosted models, or air-gapped deployments. ### In-Cluster Ollama Example Deploy with a self-hosted Ollama service running in the same Kubernetes cluster: **Create a `values.yaml` file:** ```yaml ai: provider: openai model: "llama3.3:70b" # Your self-hosted model customEndpoint: enabled: true baseURL: "http://ollama-service.default.svc.cluster.local:11434/v1" secrets: customLlm: apiKey: "ollama" # Ollama doesn't require authentication openai: apiKey: "your-openai-key" # Still needed for vector embeddings ``` **Install with custom values:** ```bash helm install dot-ai-mcp oci://ghcr.io/vfarcic/dot-ai/charts/dot-ai:$DOT_AI_VERSION \ --values values.yaml \ --create-namespace \ --namespace dot-ai \ --wait ``` ### Other Self-Hosted Options **vLLM (Self-Hosted):** ```yaml ai: provider: openai model: "meta-llama/Llama-3.1-70B-Instruct" customEndpoint: enabled: true baseURL: "http://vllm-service:8000/v1" secrets: customLlm: apiKey: "dummy" # vLLM may not require authentication openai: apiKey: "your-openai-key" ``` **LocalAI (Self-Hosted):** ```yaml ai: provider: openai model: "your-model-name" customEndpoint: enabled: true baseURL: "http://localai-service:8080/v1" secrets: customLlm: apiKey: "dummy" openai: apiKey: "your-openai-key" ``` ### OpenRouter Example OpenRouter provides access to 100+ LLM models from multiple providers: ```yaml ai: provider: openai model: "anthropic/claude-3.5-sonnet" customEndpoint: enabled: true baseURL: "https://openrouter.ai/api/v1" secrets: customLlm: apiKey: "sk-or-v1-your-key-here" openai: apiKey: "your-openai-key" # Still needed for embeddings ``` **Note**: OpenRouter does not support embedding models. Use OpenAI, Google, or Amazon Bedrock for embeddings. Get your OpenRouter API key at [https://openrouter.ai/](https://openrouter.ai/) ### Important Notes - **Context window**: 200K+ tokens recommended - **Output tokens**: 8K+ tokens minimum - **Function calling**: Must support OpenAI-compatible function calling **Testing Status:** - Validated with OpenRouter (alternative SaaS provider) - Not yet tested with self-hosted Ollama, vLLM, or LocalAI - We need your help testing! Report results in [issue #193](https://github.com/vfarcic/dot-ai/issues/193) **Notes:** - OpenAI API key is still required for vector embeddings (Qdrant operations) - If model requirements are too high for your setup, please open an issue - Configuration examples are based on common patterns but not yet validated ## TLS Configuration To enable HTTPS, add these values (requires [cert-manager](https://cert-manager.io/) with a ClusterIssuer): ```yaml ingress: tls: enabled: true clusterIssuer: letsencrypt # Your ClusterIssuer name ``` Then update your `.mcp.json` URL to use `https://`. ## Web UI Visualization Enable rich visualizations of query results by connecting to a [DevOps AI Web UI](https://github.com/vfarcic/dot-ai-ui) instance. When configured, the query tool includes a `visualizationUrl` field in responses that opens interactive visualizations (resource topology, relationships, health status) in your browser. ### Configuration Add the Web UI base URL to your Helm values: ```yaml webUI: baseUrl: "https://dot-ai-ui.example.com" # Your Web UI instance URL ``` Or via `--set`: ```bash helm install dot-ai-mcp oci://ghcr.io/vfarcic/dot-ai/charts/dot-ai:$DOT_AI_VERSION \ --set webUI.baseUrl="https://dot-ai-ui.example.com" \ # ... other settings ``` ### Feature Toggle Behavior - **Not configured** (default): Query responses contain only text summaries. No `visualizationUrl` field is included. - **Configured**: Query responses include a `visualizationUrl` field (format: `{baseUrl}/v/{sessionId}`) that opens the visualization in the Web UI. ### Example Query Response When `webUI.baseUrl` is configured, query responses include: ```text **View visualization**: https://dot-ai-ui.example.com/v/abc123-session-id ``` This URL opens an interactive visualization of the query results in the Web UI. ## Gateway API (Alternative to Ingress) For Kubernetes 1.26+, you can use **Gateway API v1** for advanced traffic management with role-oriented design (platform teams manage Gateways, app teams create routes). ### When to Use **Use Gateway API when:** - Running Kubernetes 1.26+ with Gateway API support - Need advanced routing (weighted traffic, header-based routing) - Prefer separation of infrastructure and application concerns **Use Ingress when:** - Running Kubernetes < 1.26 - Simpler requirements met by Ingress features ### Prerequisites - Kubernetes 1.26+ cluster - Gateway API CRDs installed: `kubectl apply -f https://github.com/kubernetes-sigs/gateway-api/releases/download/v1.4.1/standard-install.yaml` - Gateway controller running (Istio, Envoy Gateway, Kong, etc.) - Existing Gateway resource created by platform team (reference pattern) ### Quick Start (Reference Pattern - RECOMMENDED) Reference an existing platform-managed Gateway: ```bash helm install dot-ai-mcp oci://ghcr.io/vfarcic/dot-ai/charts/dot-ai:$DOT_AI_VERSION \ --set secrets.anthropic.apiKey="$ANTHROPIC_API_KEY" \ --set secrets.openai.apiKey="$OPENAI_API_KEY" \ --set secrets.auth.token="$DOT_AI_AUTH_TOKEN" \ --set ingress.enabled=false \ --set gateway.name="cluster-gateway" \ --set gateway.namespace="gateway-system" \ --namespace dot-ai \ --wait ``` ### Configuration Reference ```yaml # Reference pattern (RECOMMENDED) gateway: name: "cluster-gateway" # Existing Gateway name namespace: "gateway-system" # Gateway namespace (optional) timeouts: request: "3600s" # SSE streaming timeout backendRequest: "3600s" # Creation pattern (development/testing only) gateway: create: true # Create Gateway (NOT for production) className: "istio" # GatewayClass name ``` ### Complete Guide See **[Gateway API Deployment Guide](gateway-api.md)** for: - Platform team Gateway setup (HTTP and HTTPS) - Application team deployment steps - Cross-namespace access (ReferenceGrant) - Development/testing creation pattern - Troubleshooting and verification - Migration from Ingress ## Next Steps Once the server is running: ### 1. Explore Tools - **[Tools Overview](../tools/overview.md)** — Complete guide to all available tools, how they work together, and recommended usage flow ### 2. Enable Observability (Optional) - **[Observability Guide](../operations/observability.md)** — Distributed tracing with OpenTelemetry for debugging workflows, measuring AI performance, and monitoring Kubernetes operations ### 3. Production Considerations - Consider backup strategies for vector database content (organizational patterns and capabilities) - Review [TLS Configuration](#tls-configuration) for HTTPS ## Support - **Bug Reports**: [GitHub Issues](https://github.com/vfarcic/dot-ai/issues) # Gateway API Deployment Guide **Deploy DevOps AI Toolkit using Kubernetes Gateway API v1 with production-ready reference pattern or development/testing creation pattern.** ## When to Use Gateway API **Use Gateway API when:** - Running Kubernetes 1.26+ with Gateway API support (GKE Autopilot, EKS, AKS, etc.) - Need advanced routing (weighted traffic, header-based routing, mirroring) - Prefer role-oriented design (platform team manages Gateways, app teams create routes) - Want standardized configuration across different Gateway implementations **Use Ingress instead when:** - Running Kubernetes < 1.26 - Gateway API CRDs not available - Simpler requirements met by Ingress features **Learn more:** [Gateway API official documentation](https://gateway-api.sigs.k8s.io/) ## Prerequisites **All deployments:** - Kubernetes 1.26+ cluster - Gateway API CRDs v1.2+ installed - Gateway controller running (Istio, Envoy Gateway, Kong, etc.) - Helm 3.x **Reference pattern (RECOMMENDED):** - Existing Gateway resource created by platform team - GatewayClass resource available - Optional: ReferenceGrant for cross-namespace access **Creation pattern (development/testing only):** - GatewayClass resource available ### Install Gateway API CRDs ```bash kubectl apply -f https://github.com/kubernetes-sigs/gateway-api/releases/download/v1.4.1/standard-install.yaml ``` **Output:** ``` customresourcedefinition.apiextensions.k8s.io/gatewayclasses.gateway.networking.k8s.io created customresourcedefinition.apiextensions.k8s.io/gateways.gateway.networking.k8s.io created customresourcedefinition.apiextensions.k8s.io/httproutes.gateway.networking.k8s.io created customresourcedefinition.apiextensions.k8s.io/referencegrants.gateway.networking.k8s.io created customresourcedefinition.apiextensions.k8s.io/grpcroutes.gateway.networking.k8s.io created customresourcedefinition.apiextensions.k8s.io/tcproutes.gateway.networking.k8s.io created customresourcedefinition.apiextensions.k8s.io/udproutes.gateway.networking.k8s.io created customresourcedefinition.apiextensions.k8s.io/backendtlspolicies.gateway.networking.k8s.io created ``` Verify installation: ```bash kubectl get crd gateways.gateway.networking.k8s.io httproutes.gateway.networking.k8s.io ``` **Output:** ``` NAME CREATED AT gateways.gateway.networking.k8s.io 2025-12-22T09:37:24Z httproutes.gateway.networking.k8s.io 2025-12-22T09:37:24Z ``` Check available GatewayClasses: ```bash kubectl get gatewayclass ``` **Output:** ``` NAME CONTROLLER ACCEPTED AGE kgateway kgateway.dev/kgateway True 22s ``` ## Reference Pattern (HTTP) - Development/Testing Only > ⚠️ **SECURITY WARNING**: This HTTP-only configuration transmits all traffic unencrypted, including authentication tokens and API keys. **DO NOT use in production.** Use the [HTTPS pattern](#reference-pattern-https---recommended) for production deployments. The reference pattern follows Gateway API best practices where platform teams manage shared Gateway infrastructure and application teams create HTTPRoutes. ### Step 1: Platform Team Creates Gateway (ONCE) Create a shared Gateway in a dedicated namespace: > **Note:** This HTTP-only example is for development/testing. For production, skip to the [HTTPS pattern](#reference-pattern-https---recommended). ```yaml apiVersion: gateway.networking.k8s.io/v1 kind: Gateway metadata: name: cluster-gateway namespace: gateway-system spec: gatewayClassName: kgateway listeners: - name: http protocol: HTTP port: 80 allowedRoutes: namespaces: from: All # Allow routes from all namespaces EOF ``` **Output:** ``` gateway.gateway.networking.k8s.io/cluster-gateway created ``` Wait for Gateway to be Programmed: ```bash kubectl wait --for=condition=Programmed gateway/cluster-gateway -n gateway-system --timeout=300s ``` **Output:** ``` gateway.gateway.networking.k8s.io/cluster-gateway condition met ``` Verify Gateway status: ```bash kubectl describe gateway cluster-gateway -n gateway-system ``` **Output:** ``` Name: cluster-gateway Namespace: gateway-system Labels: Annotations: API Version: gateway.networking.k8s.io/v1 Kind: Gateway Metadata: Creation Timestamp: 2025-12-22T09:38:15Z Spec: Gateway Class Name: kgateway Listeners: Allow Dropped Headers: Forward: false Allow Dropped Headers: Forward: false Allowed Routes: Namespaces: From: All Name: http Port: 80 Protocol: HTTP Status: Addresses: Type: IPAddress Value: XXX.XXX.XXX.XXX Conditions: Last Transition Time: 2025-12-22T09:38:17Z Message: Resource is valid Observed Generation: 1 Reason: Accepted Status: True Type: Accepted Last Transition Time: 2025-12-22T09:38:17Z Message: Listeners are ready Observed Generation: 1 Reason: Programmed Status: True Type: Programmed Listeners: Attached Routes: 0 Conditions: Last Transition Time: 2025-12-22T09:38:17Z Message: Listener is ready Observation Timestamp: 2025-12-22T09:38:17Z Reason: Ready Status: True Type: Ready Name: http Supported Kinds: Group: gateway.networking.k8s.io Kind: HTTPRoute Group: gateway.networking.k8s.io Kind: GRPCRoute ``` ### Step 2: Application Team Deploys dot-ai Create Helm values file (`gateway-http-values.yaml`): ```yaml # Deployment method deployment: method: standard # Disable traditional Ingress ingress: enabled: false # Reference existing Gateway (RECOMMENDED) gateway: name: "cluster-gateway" namespace: "gateway-system" timeouts: request: "3600s" backendRequest: "3600s" # Secrets configuration secrets: name: dot-ai-secrets auth: token: "" # SET THIS: openssl rand -base64 32 anthropic: apiKey: "" # SET THIS: sk-ant-api03-... openai: apiKey: "" # SET THIS: sk-proj-... # AI provider configuration ai: provider: anthropic # Qdrant vector database qdrant: enabled: true ``` Deploy with Helm: ```bash helm install dot-ai \ oci://ghcr.io/vfarcic/dot-ai/charts/dot-ai:0.168.0 \ --namespace dot-ai \ --create-namespace \ -f gateway-http-values.yaml \ --wait ``` **Output:** ``` NAME: dot-ai LAST DEPLOYED: Sun Dec 22 09:39:45 2025 NAMESPACE: dot-ai STATUS: deployed REVISION: 1 TEST SUITE: None NOTES: DevOps AI Toolkit (dot-ai) has been deployed successfully. Access the service: - HTTP: kubectl port-forward -n dot-ai svc/dot-ai 8080:8080 - Gateway API: Configure your DNS to point to the Gateway IP (XXX.XXX.XXX.XXX) For more information, visit: https://github.com/vfarcic/dot-ai ``` Or using `--set` flags: ```bash export DOT_AI_AUTH_TOKEN=$(openssl rand -base64 32) helm install dot-ai \ oci://ghcr.io/vfarcic/dot-ai/charts/dot-ai:0.168.0 \ --namespace dot-ai \ --create-namespace \ --set gateway.name=cluster-gateway \ --set gateway.namespace=gateway-system \ --set secrets.auth.token="$DOT_AI_AUTH_TOKEN" \ --set secrets.anthropic.apiKey="$ANTHROPIC_API_KEY" \ --set secrets.openai.apiKey="$OPENAI_API_KEY" \ --wait ``` **Output:** ``` NAME: dot-ai LAST DEPLOYED: Sun Dec 22 09:39:45 2025 NAMESPACE: dot-ai STATUS: deployed REVISION: 1 ``` Verify deployment: ```bash kubectl get pods -n dot-ai ``` **Output:** ``` NAME READY STATUS RESTARTS AGE dot-ai-7c5f8d9b4-xyz9w 1/1 Running 0 30s qdrant-0 1/1 Running 0 25s ``` ## Reference Pattern (HTTPS) - RECOMMENDED For production HTTPS deployments with cert-manager and wildcard certificates. ### Step 1: Platform Team Creates Gateway with HTTPS (ONCE) Create wildcard certificate and Gateway with HTTPS listener: ```yaml # Gateway with HTTP and HTTPS listeners apiVersion: gateway.networking.k8s.io/v1 kind: Gateway metadata: name: cluster-gateway namespace: gateway-system spec: gatewayClassName: istio listeners: - name: http protocol: HTTP port: 80 allowedRoutes: namespaces: from: All - name: https protocol: HTTPS port: 443 tls: mode: Terminate certificateRefs: - kind: Secret name: wildcard-tls allowedRoutes: namespaces: from: All ``` Apply and wait: ```bash kubectl apply -f - < **Note:** GKE Gateway Controller and potentially other implementations do not support HTTPRoute `timeouts`. If you encounter error `GWCER104 "Timeouts are not supported"`, set timeouts to empty strings and configure them via [GCPBackendPolicy](https://cloud.google.com/kubernetes-engine/docs/how-to/configure-gateway-resources#configure_timeout) instead: > ```yaml > gateway: > timeouts: > request: "" > backendRequest: "" > ``` ### Creation Pattern Values (Development/Testing Only) ```yaml gateway: create: true # Create Gateway (NOT for production) className: "istio" # GatewayClass name (required) annotations: {} # Gateway annotations listeners: http: enabled: true # HTTP listener port 80 hostname: "" # Optional hostname https: enabled: false # HTTPS listener port 443 hostname: "" # Optional hostname secretName: "" # TLS secret name timeouts: request: "3600s" backendRequest: "3600s" ``` ## Cross-Namespace Gateway Access (ReferenceGrant) If the Gateway uses `allowedRoutes.namespaces.from: Same`, create a ReferenceGrant to allow cross-namespace access. **When is ReferenceGrant needed?** ```yaml # NO ReferenceGrant needed listeners: - name: http allowedRoutes: namespaces: from: All # Allows all namespaces # ReferenceGrant REQUIRED listeners: - name: http allowedRoutes: namespaces: from: Same # Only same namespace (requires ReferenceGrant for cross-namespace) ``` ### ReferenceGrant Example Platform team creates in Gateway namespace: ```yaml apiVersion: gateway.networking.k8s.io/v1beta1 kind: ReferenceGrant metadata: name: allow-dot-ai-routes namespace: gateway-system # Gateway namespace spec: from: - group: gateway.networking.k8s.io kind: HTTPRoute namespace: dot-ai # Application namespace to: - group: gateway.networking.k8s.io kind: Gateway # Optional: name: cluster-gateway # Restrict to specific Gateway ``` Apply: ```bash kubectl apply -f - < ⚠️ **WARNING**: HTTP transmits credentials in plaintext. Use only for local development. ```json { "mcpServers": { "dot-ai": { "url": "http://dot-ai.example.com", "transport": { "type": "http" } } } } ``` ## Verification Steps ### 1. Check Gateway Status ```bash # Reference mode kubectl get gateway cluster-gateway -n gateway-system ``` **Output:** ```text NAME CLASS ADDRESS PROGRAMMED AGE cluster-gateway kgateway XXX.XXX.XXX.XXX True 5m22s ``` Verify detailed status: ```bash kubectl describe gateway cluster-gateway -n gateway-system ``` Look for `Programmed` condition = `True`. ### 2. Check HTTPRoute ```bash kubectl get httproute -n dot-ai ``` **Output:** ```text NAME HOSTNAMES PARENTREFS AGE dot-ai-http [dot-ai.example.com] cluster-gateway 2m ``` Verify detailed status: ```bash kubectl describe httproute -n dot-ai ``` **Output (excerpt):** ```text Status: Parents: - Conditions: - Type: Accepted Status: "True" Reason: Accepted - Type: ResolvedRefs Status: "True" Reason: ResolvedRefs Controller Name: kgateway.dev/kgateway Parent Ref: Group: gateway.networking.k8s.io Kind: Gateway Name: cluster-gateway Namespace: gateway-system ``` ### 3. Check Backend Services ```bash kubectl get svc,pod -n dot-ai ``` **Output:** ```text NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE service/dot-ai ClusterIP XXX.XXX.XXX.XX 8080/TCP 2m NAME READY STATUS RESTARTS AGE pod/dot-ai-7c5f8d9b4-xyz9w 1/1 Running 0 2m pod/qdrant-0 1/1 Running 0 2m ``` ### 4. Verify Gateway IP Address ```bash # Get Gateway IP kubectl get gateway cluster-gateway -n gateway-system -o jsonpath='{.status.addresses[0].value}' ``` **Output:** ```text XXX.XXX.XXX.XXX ``` ## Troubleshooting ### Gateway Not Getting IP Address ```bash kubectl describe gateway cluster-gateway -n gateway-system ``` **Common issues:** - GatewayClass not found → Check `kubectl get gatewayclass` - Gateway controller not running → Check controller pods - Invalid listener configuration → Review Gateway spec **Solution:** ```bash # Verify GatewayClass exists and is accepted kubectl get gatewayclass -o yaml # Check Gateway controller logs kubectl logs -n istio-system -l app=istio-ingressgateway # Example for Istio ``` ### HTTPRoute Not Routing Traffic ```bash kubectl describe httproute -n dot-ai ``` **Common issues:** - Gateway name mismatch - Cross-namespace without ReferenceGrant - Backend Service doesn't exist **Solution:** ```bash # Verify parentRef matches Gateway kubectl get httproute -n dot-ai -o yaml | grep -A 5 parentRefs # Check Service exists kubectl get svc -n dot-ai # For cross-namespace, verify ReferenceGrant kubectl get referencegrant -n gateway-system ``` ### Cross-Namespace Access Denied ```bash kubectl get referencegrant -n gateway-system kubectl describe referencegrant -n gateway-system ``` **Common issues:** - ReferenceGrant missing - ReferenceGrant doesn't allow your namespace - Gateway uses `from: Same` without ReferenceGrant **Solution:** Create ReferenceGrant as shown in [Cross-Namespace Gateway Access](#cross-namespace-gateway-access-referencegrant) section. ### Gateway Name Conflict (Creation Mode) ```bash kubectl get gateway,deploy -n dot-ai ``` **Issue:** Gateway name conflicts with application Deployment name. **Solution:** The chart automatically adds `-http` suffix to prevent this. Verify: ```bash kubectl get gateway -n dot-ai # Expected: dot-ai-http (or -http) ``` ### DNS Record Not Created (external-dns) ```bash kubectl logs -n external-dns -l app.kubernetes.io/name=external-dns ``` **Common issues:** - external-dns not running - Missing annotations on Gateway - DNS provider credentials missing **Solution:** ```bash # Verify Gateway has external-dns annotation kubectl get gateway -n dot-ai -o yaml | grep external-dns # Check DNS resolution dig +short dot-ai.example.com ``` ### Connection Timeout **Issue:** HTTP requests timeout or SSE connections fail. **Solution:** Verify timeout configuration: ```bash kubectl get httproute -n dot-ai -o yaml | grep -A 5 timeouts ``` Expected: ```yaml timeouts: request: 3600s backendRequest: 3600s ``` ## Cost Comparison | Mode | Gateway per App | Load Balancer Cost | 10 Apps Total Cost | |------|----------------|-------------------|-------------------| | **Reference (RECOMMENDED)** | No (shared) | $18-30/month | $18-30/month | | **Creation (dev/test)** | Yes | $18-30/month each | $180-300/month | **Recommendation:** Use reference pattern for production to save costs and follow Gateway API best practices. ## Migration from Ingress Migrate from traditional Ingress to Gateway API: ### Step 1: Note Current Configuration ```bash # Get current Ingress hostname kubectl get ingress -n dot-ai -o yaml | grep host # Get TLS configuration kubectl get ingress -n dot-ai -o yaml | grep -A 5 tls ``` ### Step 2: Platform Team Creates Gateway Create Gateway with same hostname as current Ingress: ```bash kubectl apply -f - < **Note**: For Git repositories, the [dot-ai-controller](/docs/controller/knowledge-source-guide) handles document lifecycle automatically - it detects changes, deletes outdated chunks, and re-ingests updated content. **Example** (manual deletion): ```text Delete the recommendation guide from the knowledge base (URI: https://github.com/vfarcic/dot-ai/blob/main/docs/guides/mcp-recommendation-guide.md) ``` **Response**: ```text Done. Deleted 33 chunks for the recommendation guide from the knowledge base. ``` **What happened behind the scenes:** - All chunks matching the URI were found in the knowledge base - Each chunk was deleted from the vector database - The document is no longer searchable ## Troubleshooting ### Check Knowledge Base Status To verify the knowledge base is working and see how many documents are stored: ```text Show dot-ai status ``` Look for the `knowledgeBase` collection in the response: ```text Collections | Collection | Status | Documents | |---------------|----------|-----------| | knowledgeBase | ✅ Ready | 156 | ``` If the collection shows `⚠️ Not created`, no documents have been ingested yet. If it shows an error, check your Vector DB connection and embedding provider configuration. ## FAQ **Q: How do I update a document?** A: Simply re-ingest it. The system automatically deletes existing chunks for the URI before storing new ones. **Q: How are documents chunked?** A: Documents are split into chunks of approximately 1000 characters with 200 character overlap to preserve context across chunk boundaries. **Q: Can I search across multiple repositories?** A: Yes. All ingested documents are stored in the same knowledge base regardless of source. Search returns results from all sources, with URIs indicating the origin. ## See Also - **[Deployment Guide](../setup/deployment.md)** - Server deployment and configuration - **[Controller Knowledge Source Guide](/docs/controller/knowledge-source-guide)** - Automated Git repository ingestion - **[Tools and Features Overview](overview.md)** - Browse all available tools # Operate Guide **Complete guide for AI-powered Kubernetes operations with the DevOps AI Toolkit.** ## Using via Web UI These tools are also available through the [Web Dashboard](/docs/ui). ## Prerequisites Before using this guide, complete the [Deployment Guide](../setup/deployment.md) to set up: - DevOps AI Toolkit deployed - AI model API key configured (see [AI Model Configuration](../setup/deployment.md#ai-model-configuration) for supported models and setup) - `KUBECONFIG` pointing to your Kubernetes cluster **Required - Capability Management:** - Vector DB service (Qdrant) for capability storage - Cluster capabilities discovered via [Capability Management Guide](capability-management.md) - **Note**: Operations will fail without capabilities - the system requires semantic understanding of your cluster resources **Optional - Enhanced with Organizational Context:** - Organizational patterns created via [Pattern Management Guide](../organizational-data/patterns.md) - Policy intents created via [Policy Management Guide](../organizational-data/policies.md) - When configured, operations automatically follow organizational best practices and governance rules ## Overview **What it does**: Provides AI-powered Day 2 operations for any Kubernetes resources through natural language intents. Updates, scales, enhances, and manages workloads, databases, infrastructure, and cloud resources with cluster-aware recommendations and organizational governance. **Use when**: You need to perform operational changes on deployed resources - applications, databases, storage, AWS/Azure/GCP resources via operators, networking, or any Kubernetes-managed infrastructure. 📖 **Full Guide**: This document covers the complete operations workflow with detailed examples and behind-the-scenes explanations. ### Key Features The DevOps AI Toolkit operate feature provides: - **Natural language operations** - Describe what you want, AI figures out how to do it - **Cluster-aware decisions** - Leverages installed operators and custom resources automatically - **Helm release support** - Works with Helm-managed applications (upgrades, rollbacks, uninstalls) in addition to raw Kubernetes resources - **Pattern-driven operations** - Applies organizational best practices to every change - **Policy enforcement** - Validates operations against governance rules before execution - **Dry-run validation** - All changes tested before proposing to ensure they'll work - **Safe execution** - Exact approved commands executed with comprehensive validation - **Iterative validation** - Verifies operations completed successfully with AI analysis ### How AI-Driven Operations Work **Traditional Problem**: Manual operations require deep Kubernetes expertise, error-prone YAML editing, and knowing which operators/resources are available. **AI Operations Solution**: Natural language intents analyzed by AI with full cluster awareness and organizational context. **Operations Flow**: 1. **Intent Analysis**: AI understands your operational goal 2. **Cluster Investigation**: AI inspects current state and discovers relevant resources 3. **Context Integration**: Applies relevant organizational patterns, policies, and cluster capabilities 4. **Solution Design**: Generates operational plan that satisfies your intent 5. **Dry-Run Validation**: Tests all changes to ensure they'll succeed 6. **User Approval**: Review proposed changes with full transparency 7. **Execution**: Approved commands executed exactly as shown 8. **Validation**: AI verifies operation completed successfully ## Complete Workflow Example ### Pattern-Driven Scaling with HPA Creation This example demonstrates how organizational patterns influence operational decisions. We'll scale a deployment twice: 1. **Without pattern** - AI recommends direct replica scaling 2. **With pattern** - Same intent, but AI creates HPA instead **Scenario Setup:** ```bash # Deployment exists: demo-api in operate-demo namespace # Current state: 2 replicas, nginx:1.21, with resource requests/limits ``` #### Step 1: Baseline Operation (No Pattern) First, let's see what AI recommends without any organizational patterns: ```md User: scale demo-api deployment in operate-demo namespace to 4 replicas ``` ~~~md Agent: Let me analyze the current state and propose a scaling operation. [Uses mcp__dot-ai__operate with intent "scale demo-api deployment in operate-demo namespace to 4 replicas"] **Analysis Complete:** Scaling demo-api deployment from 2 to 4 replicas in the operate-demo namespace. This is a simple, low-risk scaling operation that will increase application availability and capacity. **Current State:** - Deployment: demo-api (operate-demo namespace) - Current replicas: 2/2 running - Image: nginx:1.21 - Strategy: RollingUpdate (25% max unavailable/surge) - Resources: 100m CPU/128Mi memory requests, 200m CPU/256Mi memory limits **Proposed Changes:** - **Update Deployment** demo-api: replicas: 4 (increased from 2) - Rationale: Directly satisfies the operational intent to scale deployment to 4 replicas **Commands to Execute:** ```bash kubectl scale deployment demo-api --replicas=4 -n operate-demo ``` **✅ Dry-Run Validation:** Success - Deployment accepts the replica count modification **Patterns Applied:** None **Capabilities Used:** Deployment **Policies Checked:** - Applications must NEVER use 'latest' tag - PASSED (uses nginx:1.21) **Risk Assessment:** LOW RISK - Scaling operation is non-disruptive - New pods created in parallel with existing configuration - No downtime expected, easily reversible **Session ID:** opr-1763169767759-0c2a4bb7 📊 **Visualization:** http://localhost:3000/v/opr-1763169767759-0c2a4bb7 Would you like to execute this operation? ~~~ #### Visualization (Optional) The visualization URL opens an interactive analysis view in the Web UI: > **Note**: Visualization URLs require additional setup: > - Configure `webUI.baseUrl` in your Helm values (see [Deployment Guide - Web UI Visualization](../setup/deployment.md#web-ui-visualization)) > - Install the Web UI (see [Web UI Documentation](/docs/ui)) **What happened behind the scenes:** - AI analyzed deployment and discovered current state: 2 replicas, properly configured resources - No organizational patterns found related to "scaling" - Generated simple `kubectl scale` command - direct approach - Validated with dry-run successfully - Proposed straightforward replica increase **Note:** Without patterns, AI takes the most direct approach - modifying replicas directly. This works, but doesn't capture organizational best practices like using HPA for managed scaling. #### Step 3: Pattern-Influenced Operation Now let's run the **exact same scaling intent** and see how the pattern changes the recommendation: ```md User: scale demo-api deployment in operate-demo namespace to 4 replicas. Start the process again. ``` ~~~md Agent: Let me analyze the operation with organizational patterns applied. [Uses mcp__dot-ai__operate with intent "scale demo-api deployment in operate-demo namespace to 4 replicas"] **Analysis Complete:** Scaling demo-api deployment from 2 to 4 replicas using HorizontalPodAutoscaler as per organizational Pattern 1 (Horizontal scaling with HPA). The pattern specifies that all scaling operations should use HPA for managing multiple replicas, even when both min and max are set to the same value. **Current State:** - Deployment: demo-api (operate-demo namespace) - Current replicas: 2 running - Image: nginx:1.21 - Resource requests: 100m CPU / 128Mi memory (HPA-ready) - No HPA configured **Proposed Changes:** - **Create HorizontalPodAutoscaler** demo-api - minReplicas: 4, maxReplicas: 4 - CPU utilization target: 80% - Rationale: Follows organizational Pattern 1 which requires all scaling operations to use HPA for consistency and operational best practices. Creates foundation for future autoscaling enhancements. **HPA Manifest:** ```yaml apiVersion: autoscaling/v2 kind: HorizontalPodAutoscaler metadata: name: demo-api namespace: operate-demo spec: scaleTargetRef: apiVersion: apps/v1 kind: Deployment name: demo-api minReplicas: 4 maxReplicas: 4 metrics: - type: Resource resource: name: cpu target: type: Utilization averageUtilization: 80 ``` **✅ Dry-Run Validation:** Success - Cluster accepted the HPA resource definition **Patterns Applied:** - Pattern 1: Horizontal scaling with HPA - All scaling operations use HorizontalPodAutoscaler for consistency and operational governance **Capabilities Used:** HorizontalPodAutoscaler (autoscaling/v2) **Risk Assessment:** LOW RISK - Scaling up from 2 to 4 replicas on healthy deployment - Existing pods continue running while new replicas created - Easily reversible by deleting HPA or adjusting counts **Session ID:** opr-1763170292295-0dbae152 Would you like to execute this operation? ~~~ **What happened behind the scenes:** - Vector search found relevant pattern: "Horizontal scaling with HPA" (semantic match on "scaling", "replicas") - AI read pattern rationale: "All scaling operations should use HorizontalPodAutoscaler..." - AI changed strategy from direct replica scaling to HPA creation - Generated HPA manifest with minReplicas=maxReplicas=4 to lock at desired count - Validated HPA manifest with dry-run successfully - Listed pattern in "patternsApplied" field for audit trail **Key Difference:** Same intent, completely different approach! Pattern transformed a simple `kubectl scale` into an HPA resource, enforcing organizational best practice. ## Summary This example demonstrated: 1. **Baseline behavior** - Without patterns, AI recommends direct approach (kubectl scale) 2. **Pattern creation** - Define organizational best practice (use HPA for scaling) 3. **Pattern influence** - Same intent produces different recommendation (HPA creation) 4. **Safe execution** - Exact approved commands executed with AI validation **Key Takeaway:** Organizational patterns transform operations from "what works" to "what's best for your organization" - automatically enforcing governance without manual intervention. **Learn More:** - See [Pattern Management Guide](../organizational-data/patterns.md) for detailed information on creating and managing organizational patterns - Explore [Policy Management Guide](../organizational-data/policies.md) for governance and compliance enforcement ## Best Practices ### Writing Effective Intents **Be specific about target resources:** ```md ✅ Good: "scale demo-api deployment in production namespace to 5 replicas" ❌ Vague: "scale the app" ``` **Include namespace when working with multiple environments:** ```md ✅ Good: "update my-api in staging namespace to v2.0" ❌ Ambiguous: "update my-api to v2.0" (which namespace?) ``` **Specify operational requirements when relevant:** ```md ✅ Good: "update my-api to v2.0 with zero downtime" ✅ Good: "make my-database highly available with backups" ``` ### Session Management - **Review proposals carefully** - Always review proposed changes before execution - **Sessions are temporary** - Session data expires after operation completion - **Refine if needed** - Use `refinedIntent` parameter to clarify ambiguous requests ### Pattern and Policy Integration - **Create patterns proactively** - Define operational best practices before they're needed - **Use specific triggers** - Patterns with clear triggers match more accurately - **Document rationale** - Clear rationale helps AI apply patterns correctly - **Test patterns** - Verify patterns influence recommendations as expected ## Related Guides - **[Recommendation Guide](recommend.md)** - Initial application deployment with capability-enhanced recommendations - **[Remediate Guide](remediate.md)** - AI-powered troubleshooting and issue resolution - **[Pattern Management Guide](../organizational-data/patterns.md)** - Creating and managing organizational patterns - **[Policy Management Guide](../organizational-data/policies.md)** - Defining and enforcing governance policies - **[Capability Management Guide](capability-management.md)** - Cluster resource discovery and semantic capabilities # Tools and Features Overview **Quick reference to all available tools in the DevOps AI Toolkit.** ## Why Your Infrastructure AI Sucks (And How to Fix It) [![Why Your Infrastructure AI Sucks (And How to Fix It)](https://img.youtube.com/vi/Ma3gKmuXahc/maxresdefault.jpg)](https://youtu.be/Ma3gKmuXahc) This video demonstrates the core AI-powered deployment workflow: capabilities discovery, organizational patterns, policy enforcement, context management, and intelligent workflows. Watch how these components work together to transform generic AI responses into infrastructure solutions that actually work in your organization. ## Available Tools ### 🚀 Kubernetes Deployment Recommendations AI-powered application deployment assistance with smart intent clarification, semantic resource matching, and automated manifest generation. - **What it does**: Clarifies deployment intents through adaptive questioning, then provides intelligent deployment suggestions based on your cluster capabilities. When no capability matches, automatically discovers and installs third-party applications via Helm charts from ArtifactHub. - **Use when**: Deploying applications to Kubernetes clusters, or installing third-party tools (Prometheus, Argo CD, Crossplane, etc.) via Helm - **📖 Full Guide**: [Kubernetes Deployment Recommendations](recommend.md) ### 🔎 Cluster Query Natural language interface for querying your Kubernetes cluster. - **What it does**: Ask questions about your cluster in plain English - find resources by concept, describe specific resources, or check health status - **Use when**: Exploring your cluster, finding resources without memorizing kubectl syntax, or checking resource status - **📖 Full Guide**: [Cluster Query](query.md) ### 🔍 Capability Management Teaches the AI what your Kubernetes resources actually do through semantic discovery and analysis. - **What it does**: Scans cluster resources and discovers their capabilities for intelligent matching - **Use when**: Setting up recommendations (required) or improving deployment intelligence - **📖 Full Guide**: [Capability Management](capability-management.md) ### 🏛️ Pattern Management Captures organizational deployment knowledge as reusable patterns that enhance AI recommendations. - **What it does**: Creates deployment templates with your organization's best practices - **Use when**: Standardizing deployments across teams or enforcing organizational standards - **📖 Full Guide**: [Pattern Management](../organizational-data/patterns.md) ### 🛡️ Policy Management Enables proactive governance through policy intents that guide users toward compliant configurations. - **What it does**: Creates governance policies that integrate into AI recommendations with optional Kyverno enforcement - **Use when**: Implementing security requirements, compliance standards, or configuration governance - **📖 Full Guide**: [Policy Management](../organizational-data/policies.md) ### 🔧 Kubernetes Issue Remediation AI-powered issue analysis and remediation with intelligent root cause identification. - **What it does**: Multi-step investigation loop to identify root causes and generate executable remediation commands - **Use when**: Troubleshooting Kubernetes failures, diagnosing pod/networking/storage issues, or understanding "what's wrong" - **📖 Full Guide**: [Kubernetes Issue Remediation](remediate.md) ### ⚙️ Kubernetes Operations AI-powered Day 2 operations for any Kubernetes resources through natural language intents. - **What it does**: Handles updates, scaling, enhancements, rollbacks, and any operational changes to workloads, databases, infrastructure, cloud resources, or Helm-managed applications with pattern-driven recommendations and dry-run validation - **Use when**: Performing operational changes on deployed resources - applications, databases, storage, Helm releases, AWS/Azure/GCP resources via operators, networking, or any Kubernetes-managed infrastructure - **📖 Full Guide**: [Kubernetes Operations](operate.md) ### 📦 Project Setup & Governance Comprehensive repository setup with governance, legal, security, and automation files. - **What it does**: Generates 25+ standardized files including LICENSE, CODE_OF_CONDUCT, CONTRIBUTING, SECURITY policies, GitHub issue/PR templates, workflows (OpenSSF Scorecard), and automation (Renovate, Labeler, Stale Bot) - **Use when**: Setting up new repositories, standardizing team workflows, or implementing governance and security best practices - **📖 Full Guide**: [Project Setup & Governance](project-setup.md) ### 💬 Shared Prompts Library Centralized prompt sharing via native slash commands in AI coding agents. - **What it does**: Provides curated prompts as slash commands (e.g., `/explain-code`, `/security-review`) - **Use when**: Boosting productivity with standardized prompts across projects - **📖 Full Guide**: [Shared Prompts Library](prompts.md) ### 🌐 REST API Gateway HTTP REST endpoints for all DevOps AI Toolkit tools, enabling integration with traditional applications and CI/CD pipelines. - **What it does**: Exposes all tools via standard HTTP POST/GET endpoints with auto-generated OpenAPI documentation - **Use when**: Integrating with automation scripts, CI/CD pipelines, Kubernetes controllers, or any non-MCP applications - **📖 Full Guide**: [REST API Gateway](../api/rest-api.md) ### 🩺 System Status & Diagnostics Comprehensive health check and diagnostics for your DevOps AI Toolkit installation. - **What it does**: Shows version info, component health (Kubernetes, Vector DB, AI provider), collection statistics, and active capabilities - **Use when**: Troubleshooting setup issues, verifying installation, or checking what features are available - **📖 Full Guide**: [System Status](version.md) ### 📚 Knowledge Base Semantic search over organizational documentation. - **What it does**: Ingests documents into a searchable vector store, enabling natural language search across your organization's documentation - **Use when**: Making organizational knowledge searchable by AI, finding relevant documentation by meaning rather than keywords - **📖 Full Guide**: [Knowledge Base](knowledge-base.md) ## Quick Start 1. **Complete Setup**: Follow the [Deployment Guide](../setup/deployment.md) 2. **Start with Capability Management** to scan your cluster (required for recommendations and operations) 3. **Try Deployment Recommendations** with a simple application 4. **Use Kubernetes Operations** for Day 2 changes (updates, scaling, enhancements) 5. **Optional**: Create organizational patterns, policy intents, or use issue remediation ## Prerequisites **Required for all tools:** - **DevOps AI Toolkit deployed**: See [Deployment Guide](../setup/deployment.md) **Works without AI keys:** - ✅ **Project Setup & Governance** - ✅ **Shared Prompts Library** - ✅ **REST API Gateway** **For AI-powered features (deployment, remediation, patterns, policies, capabilities):** - **AI Model API key**: See [AI Model Configuration](../setup/deployment.md#ai-model-configuration) for model options - **Cluster access**: `KUBECONFIG` for Kubernetes integration (deployment, remediation, capabilities) - **Vector database**: Qdrant for capability, pattern, and policy storage - **Embedding provider API key**: OpenAI, Google, or Amazon Bedrock for pattern/policy semantic search ## Tool Dependencies - **Deployment Recommendations** ← requires **Capability Management** - **Kubernetes Operations** ← requires **Capability Management** - **Pattern Management** → enhances **Deployment Recommendations** and **Kubernetes Operations** - **Policy Management** → enhances **Deployment Recommendations** and **Kubernetes Operations** - **Kubernetes Issue Remediation** ← independent - **Project Setup** ← independent - **Shared Prompts Library** ← independent - **REST API Gateway** ← provides HTTP access to all tools ## Getting Help For troubleshooting, use the system status command: ``` Show dot-ai status ``` This provides comprehensive diagnostics including component health, connection status, and capability summary. See the [System Status Guide](version.md) for detailed troubleshooting steps. # GitHub Repository Setup & Governance Guide **Complete guide for AI-powered repository setup and governance with the DevOps AI Toolkit.** ## Prerequisites Before using this guide, complete the [Deployment Guide](../setup/deployment.md) to set up: - DevOps AI Toolkit deployed - No additional dependencies required (tool does not need Kubernetes or AI model access) **Note**: This tool does NOT require: - Kubernetes cluster access - Qdrant vector database - AI model API keys The tool generates files based on templates and your answers - no AI inference or cluster access needed. ## Overview The DevOps AI Toolkit project setup feature provides: - **Repository audit** - Scans repositories to identify missing governance, legal, and automation files - **Gap analysis** - Shows which standardized files are missing from your repository - **Template-based generation** - Creates 25+ files across 9 scopes using industry best practices - **Interactive workflow** - Asks relevant questions and generates customized files - **Standards compliance** - All templates based on authoritative sources (Contributor Covenant, OpenSSF, GitHub) ### What It Generates The tool can generate files across these scopes: - **readme**: README.md with project information - **legal**: LICENSE files (MIT, Apache-2.0, GPL-3.0, BSD-3-Clause) with NOTICE/COPYRIGHT as needed - **governance**: CODE_OF_CONDUCT.md, CONTRIBUTING.md, SECURITY.md - **community**: SUPPORT.md, GOVERNANCE.md, CODEOWNERS - **github-issues**: Bug report, feature request, and question templates (.github/ISSUE_TEMPLATE/) - **pr-template**: Pull request template (.github/PULL_REQUEST_TEMPLATE.md) - automatically detected and used by the [`prd-done` workflow](prompts.md#available-prompts) for intelligent PR creation - **github-community**: FUNDING.yml, AUTHORS, CONTRIBUTORS, CHANGELOG, release.yml - **github-security**: OpenSSF Scorecard workflow (.github/workflows/scorecard.yml) - **github-automation**: Renovate, GitHub Labeler, Stale Bot configurations ### When to Use This Tool Use project setup when you want to: - Bootstrap a new repository with proper governance - Audit existing repository for missing files - Standardize team workflows across multiple repositories - Implement security and contribution guidelines - Add GitHub automation (dependency updates, PR labeling, stale issue management) ### Video: Transform Your GitHub Repository [![Transform Your GitHub Repository: Templates, Automation, and Governance](https://img.youtube.com/vi/gYl3moYa4iI/maxresdefault.jpg)](https://youtu.be/gYl3moYa4iI) ## Complete Workflow Example ### Step 1: Request Repository Audit ``` User: Audit my repository for missing governance files ``` ``` Agent: I'll scan your repository to identify missing files. [Scans repository for 21 files] Repository Analysis: ✓ readme: 1/1 files exist ✓ legal: 1/1 files exist ○ governance: 0/6 files exist Missing: CODE_OF_CONDUCT.md, CONTRIBUTING.md, SECURITY.md, docs/MAINTAINERS.md, docs/GOVERNANCE.md, docs/ROADMAP.md ○ community: 0/2 files exist Missing: SUPPORT.md, ADOPTERS.md ○ github-issues: 0/3 files exist Missing: .github/ISSUE_TEMPLATE/bug_report.yml, .github/ISSUE_TEMPLATE/feature_request.yml, .github/ISSUE_TEMPLATE/config.yml ○ pr-template: 0/1 files exist Missing: .github/PULL_REQUEST_TEMPLATE.md ○ github-community: 0/2 files exist Missing: .github/CODEOWNERS, .github/release.yml ○ github-security: 0/1 files exist Missing: .github/workflows/scorecard.yml ○ github-automation: 0/4 files exist Missing: renovate.json, .github/labeler.yml, .github/workflows/labeler.yml, .github/workflows/stale.yml Which scopes would you like to set up? (You can choose multiple) ``` ### Step 2: Select Scopes and Answer Questions ``` User: Set up governance and github-issues ``` ``` Agent: I'll set up governance first (6 files to generate). Let me analyze your repository to suggest answers: 1. Project name: my-awesome-app 2. Repository URL: https://github.com/username/my-awesome-app 3. Enforcement email: conduct@example.com 4. Security email: security@example.com 5. Maintainer email: maintainers@example.com 6. Lead maintainer name: John Doe 7. Lead maintainer GitHub: johndoe 8. Setup command: npm install 9. Test command: npm test 10. Lint command: npm run lint ... (27 questions total) Do these look correct, or would you like to modify any? ``` ``` User: Looks good ``` ### Step 3: Files Generated and Written ``` Agent: Generated 6 files for governance scope: - CODE_OF_CONDUCT.md - CONTRIBUTING.md - SECURITY.md - docs/MAINTAINERS.md - docs/GOVERNANCE.md - docs/ROADMAP.md [Writes these 6 files to repository] ✓ Governance files created successfully! Now let's set up github-issues (3 files to generate). Questions for github-issues: 1. Project name: my-awesome-app 2. Repository URL: https://github.com/username/my-awesome-app 3. Bug report form title: Bug Report 4. Feature request form title: Feature Request ... (8 questions total) Do these look correct? ``` ``` User: Yes ``` ``` Agent: Generated 3 files for github-issues scope: - .github/ISSUE_TEMPLATE/bug_report.yml - .github/ISSUE_TEMPLATE/feature_request.yml - .github/ISSUE_TEMPLATE/config.yml [Writes these 3 files to repository] ✓ GitHub issue templates created successfully! All requested scopes are now complete. Your repository now has: - 6 governance files - 3 GitHub issue templates ``` ## Scope Reference | Scope | Files | Description | |-------|-------|-------------| | **readme** | README.md | Basic project README with name, description, license | | **legal** | LICENSE, NOTICE*, COPYRIGHT* | License files (MIT, Apache-2.0, GPL-3.0, BSD-3-Clause) | | **governance** | CODE_OF_CONDUCT.md, CONTRIBUTING.md, SECURITY.md, MAINTAINERS.md, GOVERNANCE.md, ROADMAP.md | Complete governance documentation | | **community** | SUPPORT.md, ADOPTERS.md | Support channels and project adopters | | **github-issues** | bug_report.yml, feature_request.yml, config.yml | GitHub issue templates | | **pr-template** | PULL_REQUEST_TEMPLATE.md | Pull request template with checklist | | **github-community** | FUNDING.yml*, AUTHORS*, CONTRIBUTORS*, CHANGELOG*, release.yml* | Community and funding files | | **github-security** | workflows/scorecard.yml | OpenSSF Scorecard security workflow | | **github-automation** | renovate.json, labeler.yml, workflows/labeler.yml, workflows/stale.yml | Automated dependency updates, PR labeling, stale issue management | _* Conditional files - only generated based on your answers_ ## Common Use Cases ### New Open Source Project ``` User: Set up a new open source project with all governance files ``` **Recommended scopes**: legal, governance, community, github-issues, pr-template, github-security, github-automation ### Existing Project Audit ``` User: Audit my existing project for missing files ``` The agent will show what's missing and let you choose which scopes to add. ### Add GitHub Automation Only ``` User: Add Renovate and issue automation to my project ``` **Recommended scopes**: github-automation ### Security and Compliance ``` User: Add security scanning and policies ``` **Recommended scopes**: governance (includes SECURITY.md), github-security ## Tips ### Workflow Tips **Review before committing**: The agent generates files but YOU control what gets committed. Review the generated content before committing. **Multiple sessions**: You can run the tool multiple times. It only generates missing files, never overwrites existing ones. **Customize templates**: After generation, you can edit files to match your project's specific needs. ### Best Practices **Start with governance**: If setting up a new project, start with `legal` and `governance` scopes first. **Add automation last**: Set up `github-automation` after your basic repository structure is in place. **Review email addresses**: Double-check all email addresses in governance files before committing. **Test workflows**: After generating GitHub Actions workflows, verify they run successfully. ## See Also - **[Deployment Guide](../setup/deployment.md)** - Server deployment and configuration - **[Tools and Features Overview](overview.md)** - Browse all available tools and features # Shared Prompt Library ## What are Shared Prompts Shared Prompts provide a centralized way to share and discover prompts across projects and team members. Instead of manually copying prompt files between projects, prompts are served directly through the DevOps AI Toolkit and automatically appear as native slash commands in AI coding agents. **Key Benefits:** - **Zero setup**: Connect to the server and prompts are immediately available - **Native integration**: Prompts appear as slash commands in your coding agent's command menu - **Cross-project sharing**: Same prompts available across all projects without file management - **Instant updates**: New prompts added to server are immediately available to all users - **Team consistency**: Everyone uses the same proven prompts ## Prerequisites Before using Shared Prompts, you need: - **DevOps AI Toolkit deployed** — see [Deployment Guide](../setup/deployment.md) - **Access via** [MCP client](/docs/mcp) or [CLI](/docs/cli) **Note**: The Anthropic API key is only required for AI-powered tools (like deployment recommendations), not for using the shared prompts library. ## How It Works The DevOps AI Toolkit implements the standard MCP Prompts specification, exposing two key endpoints: - **`prompts/list`**: Returns all available prompts with metadata - **`prompts/get`**: Returns specific prompt content by ID When you connect to the server, your coding agent automatically discovers available prompts and makes them accessible through agent-specific interfaces. The access method varies by agent - some use slash commands, others integrate prompts as available tools. ## See Shared Prompts in Action [![Using PRD Prompts with DevOps AI Toolkit](https://img.youtube.com/vi/LUFJuj1yIik/maxresdefault.jpg)](https://youtu.be/LUFJuj1yIik) This video demonstrates how to use the PRD (Product Requirements Document) prompts served by the DevOps AI Toolkit. Watch how to create, manage, and work through complete PRD workflows using the shared prompts library with conversational AI agents. ## Using Shared Prompts ### Discovering Available Prompts 1. Ensure you're connected to the DevOps AI Toolkit (see [MCP client setup](/docs/mcp)) 2. Access prompts using your agent's interface: - **Claude Code**: Type `/` and look for `/dot-ai:prompt-name` commands - **VS Code + GitHub Copilot**: Type `/` and look for `/mcp.dot-ai.prompt-name` commands - **Other agents**: Check your agent's command menu or ask the agent to use specific prompts by name 3. Browse available prompts through your agent's discovery interface ### Executing Shared Prompts **Claude Code:** ```bash # Create a new PRD /dot-ai:prd-create # Get current PRD status /dot-ai:prds-get ``` **VS Code + GitHub Copilot:** ```bash # Create a new PRD /mcp.dot-ai.prd-create # Get current PRD status /mcp.dot-ai.prds-get ``` **Other Agents:** - May use slash commands with different formats - Or ask the agent to use the prompt by name (e.g., "Use the prd-create prompt") The prompt content executes exactly as if it were a local command file, but without any file management on your part. ## Available Prompts ### Project Management Prompts **`prd-create`** - **Purpose**: Create a comprehensive Product Requirements Document following documentation-first approach - **Use when**: Starting a new feature that requires detailed planning, tracking, and documentation - **Example**: Beginning work on a new feature or major enhancement **`prds-get`** - **Purpose**: Fetch all open GitHub issues with 'PRD' label from the current project repository - **Use when**: Want to see all active Product Requirements Documents and their status - **Example**: Getting overview of current project priorities and PRD status **`prd-next`** - **Purpose**: Analyze existing PRD to identify and recommend the single highest-priority task to work on next - **Use when**: Need guidance on what to work on next within a PRD, uses smart auto-detection of target PRD - **Example**: Continuing work on a PRD and need to prioritize remaining tasks **`prd-start`** - **Purpose**: Start working on a PRD implementation - **Use when**: Beginning work on a specific PRD - **Example**: Moving from planning phase to implementation phase **`prd-update-progress`** - **Purpose**: Update PRD progress based on git commits and code changes, enhanced by conversation context - **Use when**: Implementation work has been completed and need to mark PRD items as done based on actual code changes - **Example**: After completing development tasks, update PRD to reflect current status **`prd-update-decisions`** - **Purpose**: Update PRD based on design decisions and strategic changes made during conversations - **Use when**: Architecture, workflow, or requirement decisions were made in conversation that need to be captured in the PRD - **Example**: After making architectural decisions that affect the original PRD scope **`prd-done`** - **Purpose**: Complete PRD implementation workflow - create branch, push changes, create PR, merge, and close issue - **Enhanced PR Creation**: Automatically detects and uses project PR templates (e.g., `.github/PULL_REQUEST_TEMPLATE.md`), analyzes git changes to auto-fill PR content, and prompts for information requiring human judgment - **Template Integration**: Works seamlessly with PR templates generated by the [Project Setup](project-setup.md) tool - **Use when**: Finished implementing a PRD and ready to deploy and close out the work - **Example**: All PRD tasks completed and ready for final deployment and closure **`prd-close`** - **Purpose**: Close a PRD that is already implemented or no longer needed without creating a PR - **Use when**: PRD functionality already exists (implemented elsewhere), PRD is superseded, or requirements changed - **Example**: Closing a PRD whose features were already implemented in a separate project ### Development Prompts **`generate-dockerfile`** - **Purpose**: Generate production-ready, secure, multi-stage Dockerfile and .dockerignore for any project - **Use when**: Containerizing an application for production deployment, or improving an existing Dockerfile - **Example**: Setting up Docker for a new project, or fixing security issues in an existing Dockerfile **`generate-cicd`** - **Purpose**: Generate intelligent CI/CD workflows through interactive conversation by analyzing repository structure and user preferences - **Use when**: Setting up CI/CD pipelines that understand your project's existing automation (Makefile, npm scripts, etc.) - **Example**: Adding GitHub Actions workflows that use your project's build system rather than raw commands ## Example Workflows ### Workflow 1: Complete PRD Lifecycle - **Start new feature**: Use `prd-create` prompt to create comprehensive requirements document 1. GitHub issue created with PRD label 2. PRD file generated with proper naming 3. Complete documentation content written across multiple files with traceability - **Check priorities**: Use `prds-get` prompt to see all active PRDs and priorities 1. Open PRD issues fetched from GitHub 2. Issues formatted with status analysis 3. Next steps recommendations provided - **Begin implementation**: Use `prd-start` prompt to begin working on specific PRD 1. Target PRD auto-detected from context 2. PRD readiness validated 3. Feature branch created 4. First implementation task identified with detailed plan - **Get next task**: Use `prd-next` prompt to identify highest-priority remaining work 1. Current PRD state analyzed 2. Single highest-value next task identified 3. Implementation design guidance provided - **Update decisions**: Use `prd-update-decisions` prompt when design decisions are made during implementation 1. Conversation context analyzed for design decisions 2. Decision impact assessed across requirements and scope 3. PRD sections updated with new decisions and rationale - **Update progress**: Use `prd-update-progress` prompt after completing implementation tasks 1. Git commits and code changes analyzed 2. Changes mapped to PRD requirements 3. PRD checkboxes updated with work log entry - **Finalize**: Use `prd-done` prompt to deploy, merge, and close out completed work 1. Pre-completion validation performed 2. Pull request created and merged 3. GitHub issue closed with final validation ### Workflow 2: Containerization #### Video: Production Dockerfile Generation [![My Production Dockerfile Rules: How I Build Docker Images](https://img.youtube.com/vi/ueTe-VQaD7c/maxresdefault.jpg)](https://youtu.be/ueTe-VQaD7c) This video demonstrates how to use the `generate-dockerfile` prompt to create production-ready, secure, multi-stage Dockerfiles. Watch how the prompt analyzes your project structure and generates optimized Docker configurations following best practices. - **Containerize project**: Use `generate-dockerfile` prompt to create production-ready Docker configuration 1. Project structure analyzed (language, framework, dependencies) 2. Multi-stage Dockerfile generated with security best practices 3. Minimal .dockerignore created based on Dockerfile's COPY commands 4. Image built and validated automatically 5. Container tested to ensure application runs correctly - **Improve existing Dockerfile**: Use same prompt when Dockerfile already exists 1. Existing Dockerfile analyzed against best practices 2. Security issues identified (running as root, :latest tags, etc.) 3. Optimizations applied while preserving intentional customizations 4. Changes explained with rationale ### Workflow 3: CI/CD Generation - **Generate CI/CD workflows**: Use `generate-cicd` prompt to create intelligent workflows 1. CI platform confirmed (GitHub Actions supported, feature request offered for others) 2. Repository analyzed (language, automation, existing CI, deployment mechanism) 3. Findings presented for user confirmation 4. Workflow choices presented (PR workflow, release triggers, deployment strategy) 5. Workflows generated using project automation (npm test, make build, etc.) 6. Required secrets and permissions documented 7. Workflows committed and validated ## Cross-Agent Compatibility ### Supported Coding Agents **Agents with Full Slash Command Support**: ✅ - **Claude Code**: Prompts appear as `/dot-ai:prompt-name` - **VS Code + GitHub Copilot**: Prompts appear as `/mcp.dot-ai.prompt-name` - Both support native slash command integration and prompt discovery - Tools appear in agent configuration menus - Full metadata support with descriptions **Other MCP-Compatible Agents**: 🤔 Expected to work (not validated) - Other MCP-compatible clients like Cursor should work since they follow MCP specifications - Different agents may use different slash command formats (e.g., `/mcp.dot-ai.prompt-name` vs `/dot-ai:prompt-name`) - **Help us validate**: Try these prompts in your agent and [report your experience via GitHub issues](https://github.com/vfarcic/dot-ai/issues) ## Contributing Prompts Have a useful prompt to share? Contribute it to the shared library: 1. **Fork the repository** and create a feature branch 2. **Add your prompt** to the `shared-prompts/` directory following existing naming conventions 3. **Update the documentation** by adding your prompt to the "Available Prompts" section above 4. **Submit a pull request** with a clear description of what the prompt does and when to use it ### Prompt Metadata Format Each prompt file must include YAML frontmatter that defines how it appears in coding agents: ```yaml # Your Prompt Content Your prompt instructions go here... ``` **Metadata Fields:** - **`name`**: Becomes the slash command name (e.g., `name: prd-create` → `/dot-ai:prd-create`) - **`description`**: Shows up in coding agent command menus and help text - **`category`**: Used for organizing prompts in documentation (must be one of: `project-management`, `development`) **How It Works:** 1. **Prompt Discovery**: Your coding agent automatically discovers all available prompts and their metadata 2. **Slash Commands**: Each prompt appears as a slash command — the `name` becomes the command, `description` appears in menus 3. **Organization**: The `category` field groups prompts in documentation for easy browsing **Contribution Guidelines:** - Use descriptive, kebab-case names (e.g., `database-optimization`, `api-security-review`) - Include clear purpose and usage examples in your PR description - Test your prompt across different scenarios before contributing - Follow the established prompt format and documentation patterns ## User-Defined Prompts Serve custom prompts from your own git repository. Your prompts appear alongside built-in prompts. ### Why User-Defined Prompts? - **Agent-agnostic**: Prompts work with any compatible coding agent (Claude Code, Cursor, VS Code, etc.) without maintaining separate prompt directories for each tool - **Team consistency**: Share standard prompts across all projects without contributing to the core project - **Organization-specific workflows**: Create prompts tailored to your team's processes - **Version control**: Manage prompts through standard git workflows (commit, push, PR) - **Works everywhere**: Prompts work across all Kubernetes deployments ### Configuration Configure user prompts via environment variables: | Variable | Purpose | Default | |----------|---------|---------| | `DOT_AI_USER_PROMPTS_REPO` | Git repository URL (HTTPS) | None (feature disabled) | | `DOT_AI_USER_PROMPTS_BRANCH` | Branch to use | `main` | | `DOT_AI_USER_PROMPTS_PATH` | Subdirectory within repo | Root directory | | `DOT_AI_GIT_TOKEN` | Authentication token for private repos | None | | `DOT_AI_USER_PROMPTS_CACHE_TTL` | Cache duration in seconds | `86400` (24 hours) | **Supported Git Providers:** - GitHub (github.com) - GitLab (gitlab.com or self-hosted) - Gitea / Forgejo (self-hosted) - Bitbucket (bitbucket.org) - Any git server supporting HTTPS clone ### Repository Setup Create a git repository with prompt files as markdown (`.md`) files: ``` my-team-prompts/ ├── deploy-app.md ├── review-pr.md └── team-standup.md ``` Or use a subdirectory within an existing repository: ``` my-project/ ├── src/ ├── docs/ └── prompts/ # Set DOT_AI_USER_PROMPTS_PATH=prompts ├── deploy-app.md └── review-pr.md ``` ### Prompt File Format User prompts follow the same format as built-in prompts, with optional MCP arguments support: ```yaml # Deploy Application Deploy the application to {{environment}}. {{#if version}} Deploying version: {{version}} {{/if}} ## Steps 1. Verify the deployment configuration 2. Run pre-deployment checks 3. Execute deployment to {{environment}} 4. Validate deployment success ``` **Metadata Fields:** - **`name`**: Becomes the slash command (e.g., `name: deploy-app` → `/dot-ai:deploy-app`) - **`description`**: Shows in coding agent command menus - **`category`**: Organizes prompts in documentation - **`arguments`**: Optional parameters substituted via `{{argumentName}}` placeholders ### Deployment Configuration #### Kubernetes (Helm) Add environment variables via `extraEnv` in your Helm values: ```bash helm upgrade --install dot-ai oci://ghcr.io/vfarcic/helm/dot-ai \ --namespace dot-ai --create-namespace \ --set ai.anthropic.apiKey="${ANTHROPIC_API_KEY}" \ --set ai.embeddings.openai.apiKey="${OPENAI_API_KEY}" \ --set-json 'extraEnv=[ {"name":"DOT_AI_USER_PROMPTS_REPO","value":"https://github.com/your-org/team-prompts.git"}, {"name":"DOT_AI_USER_PROMPTS_PATH","value":"prompts"}, {"name":"DOT_AI_GIT_TOKEN","value":"'"${DOT_AI_GIT_TOKEN}"'"} ]' ``` ### How It Works 1. **First access**: Repository is cloned to a local cache directory 2. **Subsequent access**: Repository is pulled if cache TTL has expired 3. **Merging**: User prompts are merged with built-in prompts 4. **Precedence**: Built-in prompts take precedence over user prompts with the same name ### Error Handling The feature is designed for graceful degradation: | Scenario | Behavior | |----------|----------| | Repository not configured | Built-in prompts only (no error) | | Clone fails (auth, network) | Built-in prompts only, error logged | | Pull fails | Cached version used, warning logged | | Invalid prompt format | Prompt skipped, warning logged | | Name collision with built-in | User prompt skipped, warning logged | ### Troubleshooting User Prompts **User prompts don't appear** - **Cause**: Repository not configured or clone failed - **Solution**: Verify `DOT_AI_USER_PROMPTS_REPO` is set and accessible - **Check**: Run "Show dot-ai status" to verify prompt loading and connectivity **Private repository auth fails** - **Cause**: Missing or invalid `DOT_AI_GIT_TOKEN` - **Solution**: Set a valid personal access token (PAT) with repo read access - **Note**: Tokens are never logged; URLs are sanitized in log output **Changes not appearing** - **Cause**: Cache hasn't expired yet - **Solution**: Wait for TTL to expire, or set `DOT_AI_USER_PROMPTS_CACHE_TTL=0` for testing **Prompt has same name as built-in** - **Cause**: Name collision with built-in prompt - **Solution**: Rename your prompt to a unique name - **Note**: Built-in prompts always take precedence ## Troubleshooting ### Common Issues **Prompts don't appear in command menu** - **Cause**: Server not connected or prompts capability not enabled - **Solution**: Check connection status and server configuration - **See**: [Deployment Guide](../setup/deployment.md) for server troubleshooting, [MCP client setup](/docs/mcp) for connection issues **Prompt execution fails with "not found" error** - **Cause**: Prompt ID mismatch or server synchronization issue - **Solution**: Refresh the connection or restart your coding agent - **Workaround**: Disconnect and reconnect to the server **Prompts work in one agent but not another** - **Cause**: Agent-specific MCP implementation differences - **Solution**: Check agent-specific compatibility notes above - **Alternative**: Use a fully compatible agent for prompt-heavy workflows ## See Also - **[Deployment Guide](../setup/deployment.md)** - Server deployment and configuration - **[Tools and Features Overview](overview.md)** - Browse all available tools and features # Cluster Query Guide **Complete guide for using natural language to query your Kubernetes cluster with the DevOps AI Toolkit.** ## See It In Action [![Why Kubernetes Querying Is Broken and How I Fixed It](https://img.youtube.com/vi/uUdbQkq5c4k/maxresdefault.jpg)](https://youtu.be/uUdbQkq5c4k) ## Using via Web UI These tools are also available through the [Web Dashboard](/docs/ui). ## Prerequisites Before using this guide, complete the [Deployment Guide](../setup/deployment.md). **Optional but strongly recommended:** - Scanned capabilities - see [Capability Management](capability-management.md) - Resources synced to Vector DB - automatically handled by the [dot-ai-controller](/docs/controller) > **Note**: While the query tool technically works without capabilities and resources in the Vector DB (falling back to raw kubectl commands), AI performance and accuracy will be severely reduced. Capabilities provide semantic understanding (knowing that "database" relates to StatefulSet, CNPG, etc.), and resources provide fast inventory lookups. Without these, the AI loses its ability to answer concept-based queries and must rely solely on kubectl, resulting in slower and less intelligent responses. ## Related Projects **[dot-ai-controller](/docs/controller)** - A Kubernetes controller that automatically scans cluster capabilities and syncs resources to the Vector DB. This enables the query tool's semantic understanding and fast inventory lookups without manual setup. ## Overview The DevOps AI Toolkit query feature provides: - **Natural language cluster queries** - Ask questions about your cluster in plain English - **Semantic understanding** - Query by concept ("databases", "web servers") not just resource names - **Multi-source intelligence** - Combines Vector DB knowledge with live kubectl data - **AI-powered orchestration** - LLM decides which tools to use based on your intent - **Read-only operations** - Safe cluster exploration without risk of modifications ### How AI-Driven Querying Works **Traditional Problem**: Finding cluster resources requires knowing kubectl syntax, label conventions, and resource types. **AI Query Solution**: Describe what you want to find in natural language, and AI orchestrates the right tools. **Query Flow**: 1. **Intent Analysis**: AI interprets your natural language query 2. **Tool Selection**: AI chooses appropriate tools (Vector DB search, kubectl commands) 3. **Data Gathering**: Tools execute and return results 4. **Response Synthesis**: AI summarizes findings in human-readable format ### Three Data Sources The query tool combines three data sources to answer your questions intelligently: | Source | What it provides | Example use | |--------|------------------|-------------| | **Capabilities** | Semantic understanding - knows that "database" relates to StatefulSet, CNPG Cluster, Crossplane RDS, etc. | "What databases are running?" | | **Resources** | Searchable inventory - find resources by name patterns, labels, or annotations without knowing kubectl syntax or exact resource locations | "Find all resources with label team=platform" | | **Kubernetes API** | Live cluster state - real-time status, events, logs, and detailed resource information via kubectl | "Describe the my-postgres cluster" | This combination enables natural language queries while providing accurate, live cluster data. ## Complete Workflow Examples > **Tip**: If your AI coding agent doesn't automatically use the query tool, add "use dot-ai" to your query (e.g., "What databases are running? Use dot-ai"). This section demonstrates common query patterns: - **[Semantic Queries](#semantic-queries)** - Find resources by concept - **[Direct Queries](#direct-queries)** - Ask about specific resources or namespaces - **[Status Queries](#status-queries)** - Check health and troubleshoot issues ### Direct Queries Direct queries ask about specific resources. The AI finds the resource and retrieves detailed information without needing you to specify the exact namespace or resource type. #### Example: Describe a specific resource ```text Describe the my-postgres cluster ``` **Response:** ```text my-postgres Cluster Type: CloudNativePG (CNPG) PostgreSQL Cluster Namespace: default Status: Healthy (1/1 instances ready) Cluster Overview | Property | Value | |--------------------|-----------------| | PostgreSQL Version | 17.4 | | Age | 22 minutes | | Instances | 1 (single node) | | Primary Pod | my-postgres-1 | | Database | app | Configuration | Setting | Value | |-----------------------|-----------------------| | Storage | 1Gi persistent volume | | Encoding | UTF8 (C locale) | | WAL Level | logical | | TLS | TLSv1.3 (min/max) | | Max Parallel Workers | 32 | | Max Replication Slots | 32 | | Superuser Access | Disabled (secure) | Services | Service | Purpose | |----------------|----------------------| | my-postgres-rw | Read-write (primary) | | my-postgres-r | Read | | my-postgres-ro | Read-only | The cluster is production-ready with enterprise features including automated failover support, backup/recovery capabilities, and secure TLS communication. ``` **What happened**: The AI searched the resources inventory to locate `my-postgres`, identified it as a CNPG Cluster in the default namespace, then used the Kubernetes API to get detailed status and configuration. # Recommendation Guide **Complete guide for AI-powered Kubernetes deployment recommendations with the DevOps AI Toolkit.** ## See It In Action [![DevOps AI Toolkit: AI-Powered Application Deployment](https://img.youtube.com/vi/8Yzn-9qQpQI/maxresdefault.jpg)](https://youtu.be/8Yzn-9qQpQI) This video explains the platform engineering problem and demonstrates the Kubernetes deployment recommendation workflow from intent to running applications. ## Using via Web UI These tools are also available through the [Web Dashboard](/docs/ui). ## Prerequisites Before using this guide, complete the [Deployment Guide](../setup/deployment.md) to set up: - DevOps AI Toolkit deployed - AI model API key configured (see [AI Model Configuration](../setup/deployment.md#ai-model-configuration) for supported models and setup) - `KUBECONFIG` pointing to your Kubernetes cluster (optional but recommended) **Required - Capability Management:** - Vector DB service (Qdrant) for capability storage - Cluster capabilities discovered via [Capability Management Guide](capability-management.md) - **Note**: Recommendations will fail without capabilities - the system requires semantic understanding of your cluster resources **Optional - Enhanced with Organizational Patterns:** - Organizational patterns created via [Pattern Management Guide](../organizational-data/patterns.md) - When configured, recommendations automatically include organizational best practices ## Overview **What it does**: Provides AI-powered Kubernetes deployment recommendations using cluster capabilities, and installs third-party applications via Helm charts when no matching capability exists. Features semantic capability matching, organizational pattern integration, and guided workflow from intent clarification through deployment. **Use when**: You need intelligent deployment recommendations that leverage your cluster's actual capabilities, or want to install third-party tools (Prometheus, Argo CD, Crossplane, etc.) via Helm charts with AI-guided configuration. 📖 **Full Guide**: This document covers the complete recommendation workflow with detailed examples and behind-the-scenes explanations. ### Key Features The DevOps AI Toolkit recommendation feature provides: - **Intent refinement guidance** - When intent is vague, provides guidance for the client agent (Claude Code, Cursor, etc.) to analyze the project and gather details before proceeding - **Semantic capability understanding** - Uses discovered resource capabilities for intelligent semantic matching - **AI-powered recommendations** - Suggests optimal deployment approaches based on capability-filtered resources with rich context - **Enhanced resource discovery** - Finds resources like `sqls.devopstoolkit.live` for "PostgreSQL database" requests through semantic search - **Organizational pattern integration** - Automatically enhances recommendations with your team's best practices and standards - **Guided configuration** - Step-by-step configuration collection for your deployments - **Manifest generation** - Creates ready-to-deploy Kubernetes YAML files - **Helm chart installation** - Discovers and installs third-party applications via ArtifactHub when no cluster capability matches - **Automated deployment** - Deploys and monitors your applications ### How Capability-Enhanced Recommendations Work **Traditional Problem**: AI sees `sqls.devopstoolkit.live` as a meaningless name among many cluster resources. **Capability Solution**: AI understands that `sqls.devopstoolkit.live` provides PostgreSQL database capabilities with multi-cloud support. **Recommendation Flow**: 1. **User Intent**: "I need a PostgreSQL database" 2. **Capability Search**: Semantic search finds relevant resources with capability context 3. **Pre-filtered Results**: AI gets semantically relevant resources instead of all generic resource names 4. **Enhanced Recommendations**: Optimal matches like `sqls.devopstoolkit.live` appear as top suggestions **Helm Fallback**: When no cluster capability matches (e.g., "Install Prometheus" but no Prometheus operator installed), the system automatically searches ArtifactHub for Helm charts and presents installation options. ## Complete Workflow Examples ### Example 1: Stateless Golang Web Application ```md User: I want to deploy my app ``` ```md Agent: [Uses mcp__dot-ai__recommend with intent "deploy my app"] The intent needs more detail. Let me analyze your project to understand what you're deploying. [Agent analyzes the project directory, reads package.json, Dockerfile, source code, etc.] Based on my analysis of your project: - **Language**: Go (found go.mod, main.go) - **Type**: Web application (HTTP server on port 8080) - **Dependencies**: Connects to PostgreSQL database (found database connection code) - **Container**: Ready to containerize (Dockerfile present) Is this for production or development? And should the app be publicly accessible? ``` ```md User: It's for production, needs external access, expects moderate traffic with occasional spikes. ``` ```md Agent: Perfect! Based on my analysis and your requirements, let me get production-focused deployment recommendations. [Uses mcp__dot-ai__recommend with refined intent "deploy stateless Golang web application for production with external access via ingress, moderate traffic with occasional spikes requiring autoscaling, connects to PostgreSQL database, container image available" and final: true] I found 5 deployment solutions for your Golang web application. All include: - Deployment - manages your stateless app with rolling updates - Service - internal load balancing and service discovery - Ingress - external HTTP/HTTPS access - HorizontalPodAutoscaler - handles traffic spikes The difference is how PostgreSQL is provisioned: | # | Solution | Database Approach | Score | |-----|---------------------------|----------------------------------------------------|-------| | 1 | Multi-cloud (Recommended) | DevOps Toolkit SQL - portable across AWS/GCP/Azure | 96 | | 2 | Azure-native | Azure Flexible Server + ResourceGroup | 92 | | 3 | AWS-native | RDS DBInstance | 90 | | 4 | GCP-native | Cloud SQL DatabaseInstance + Database | 88 | | 5 | UpCloud | ManagedDatabasePostgreSQL | 85 | Organizational Context: - Solutions 1 & 2 follow your organization's patterns (DevOps Toolkit DB pattern, Azure ResourceGroup pattern) - Policy enforced: Applications must never use the latest image tag - Database regions are standardized (us-east-1 for AWS, us-east1 for GCP, eastus for Azure) 📊 **View visualization**: http://localhost:3000/v/sol-... Which solution would you like to proceed with? (1-5) ``` **What happened behind the scenes:** - First call with vague intent ("deploy my app" - under 100 characters) triggered intent refinement guidance - Agent received guidance to analyze the project and gather details before proceeding - Agent used its local tools (file reading, code analysis) to discover: Go language, web app type, PostgreSQL dependency, Dockerfile - Agent discussed findings with user and gathered missing context (production, external access, traffic pattern) - Second call with detailed intent + `final: true` proceeded to recommendations - Capability search found relevant resources: Deployment, Service, Ingress, HorizontalPodAutoscaler, plus database operators - Vector DB search found matching organizational patterns: DevOps Toolkit DB pattern, Azure ResourceGroup pattern - Generated 5 solution alternatives with different cloud provider options, scoring based on capability relevance and pattern alignment - Policy intents automatically enforced (e.g., no `latest` image tags, standardized regions) - Visualization URL generated for exploring solutions in Web UI (shows resources, patterns, and configuration details) #### Visualization (Optional) The visualization URL opens an interactive view in the Web UI showing all solution options with their resources, patterns, and scoring: > **Note**: Visualization URLs require additional setup: > - Configure `webUI.baseUrl` in your Helm values (see [Deployment Guide - Web UI Visualization](../setup/deployment.md#web-ui-visualization)) > - Install the Web UI (see [Web UI Documentation](/docs/ui)) **Note:** Requires capability scanning via [Capability Management Guide](capability-management.md) and optionally organizational patterns via [Pattern Management](../organizational-data/patterns.md). Policy requirements shown require active policy intents created via [Policy Management Guide](../organizational-data/policies.md). ```md User: I'll go with option 1 ``` ```md Agent: Excellent choice! Let me get the configuration requirements. [Uses mcp__dot-ai__recommend with stage: 'chooseSolution', solutionId: 'sol-1765409923079-fa3f055c'] I need some required configuration details: **Application:** - Name - What should the app be called? (e.g., my-golang-app) - Namespace - Where to deploy? Options: a-team, b-team, default, dot-ai - Container image - Full image URI with version tag (e.g., myregistry.io/app:v1.0.0) - latest tag is not allowed per policy - Port - What port does your app listen on? (default: 8080) - Ingress hostname - Public DNS name (e.g., myapp.example.com) **Database:** - Cloud provider - aws, azure, or gcp? - Size - small, medium, or large? - PostgreSQL version - e.g., 15, 14.8 **Output:** - Format - raw YAML, helm chart, or kustomize? - Output path - Where to save files? (e.g., ./manifests) You can answer all at once or just the ones you want to customize (I'll use sensible defaults for the rest). ``` **What happened behind the scenes:** - recommend tool with stage 'chooseSolution' loaded the selected solution and returned only the required stage questions - Questions generated from resource schema analysis: Deployment needs name/image/port, Ingress needs hostname, database needs provider/size/version - Policy intents enforced: "latest" image tag not allowed (flagged in question) - New packaging questions added: `outputFormat` (raw/helm/kustomize) and `outputPath` for GitOps-ready output - System categorizes questions as required (must answer) to proceed ```md User: my-app, a-team namespace, ghcr.io/vfarcic/silly-demo:v1.5.223, port 8080, silly-demo.acme.com for ingress. AWS provider, small database, PostgreSQL 15. Output as Kustomize to ./manifests. ``` ```md Agent: Perfect! Let me configure the required settings. [Uses mcp__dot-ai__recommend with stage: 'answerQuestion:required', solutionId, answers: {name: "my-app", namespace: "a-team", container-image: "ghcr.io/vfarcic/silly-demo:v1.5.223", application-port: 8080, ingress-host: "silly-demo.acme.com", database-cloud-provider: "aws", database-size: "small", database-version: "15.0", outputFormat: "kustomize", outputPath: "./manifests"}] Required configuration saved. Now for basic settings (all optional - I can use defaults): **Scaling:** - Initial replicas: 2 - Min replicas (HPA): 2 - Max replicas (HPA): 10 **Resources per pod:** - CPU: 250m request / 500m limit - Memory: 256Mi request / 512Mi limit **Networking:** - Ingress class: nginx - Enable TLS/HTTPS: yes **Database:** - Database name: appdb Would you like to customize any of these, or should I proceed with these defaults? (say "skip" to use defaults) ``` **What happened behind the scenes:** - recommend tool with stage 'answerQuestion:required' processed required stage answers and updated solution file - Validated all required fields were provided (name, namespace, image, port, ingress host, database config) - Captured output format (Kustomize) and path (./manifests) for later manifest generation - Determined next stage is 'basic' and returned basic configuration questions with sensible defaults - Questions are optional at this stage - user can answer, customize specific values, or skip to use defaults ```md User: 2 replicas, DB name: mydb ``` ```md Agent: Basic settings saved. Now for advanced configuration (all optional): [Uses mcp__dot-ai__recommend with stage: 'answerQuestion:basic', solutionId, answers: {replicas: 2, hpa-min-replicas: 2, hpa-max-replicas: 10, cpu-request: "250m", cpu-limit: "500m", memory-request: "256Mi", memory-limit: "512Mi", ingress-class: "nginx", database-name: "mydb", enable-tls: true}] **Deployment Strategy:** - RollingUpdate with max surge 1, max unavailable 0 (zero-downtime) **Health Probes:** - Readiness: /health, 10s initial delay, 10s period - Liveness: /health, 30s initial delay, 15s period **Autoscaling:** - Target CPU: 75% - Target memory: 75% **Resilience:** - PodDisruptionBudget: min 1 available - Termination grace period: 30s - Pod anti-affinity: spread across nodes **Security:** - Run as non-root: yes - Read-only filesystem: no - TLS secret name: app-tls-cert **Database:** - Backups enabled: yes Would you like to customize any of these, or say "skip" to use defaults? ``` **What happened behind the scenes:** - recommend tool with stage 'answerQuestion:basic' processed basic stage answers (replicas: 2, database-name: mydb, plus defaults) - Updated solution file and determined next stage is 'advanced' - Returned advanced configuration questions focused on deployment strategy, health probes, autoscaling, resilience, and security ```md User: Skip advanced configuration ``` ```md Agent: No problem! Moving to the final configuration stage. [Uses mcp__dot-ai__recommend with stage: 'answerQuestion:advanced', solutionId, answers: {}] Any additional requirements or constraints for your deployment? ``` **What happened behind the scenes:** - recommend tool with stage 'answerQuestion:advanced' processed empty advanced stage answers (using defaults for all) - Determined next stage is 'open' and returned the open-ended question ```md User: N/A ``` ```md Agent: Perfect! Configuration complete. [Uses mcp__dot-ai__recommend with stage: 'answerQuestion:open', solutionId, answers: {open: "N/A"}] Configuration complete. Ready to generate deployment manifests. ``` **What happened behind the scenes:** - recommend tool with stage 'answerQuestion:open' processed open stage with "N/A" (no additional requirements) - All configuration stages now complete (required ✓, basic ✓, advanced defaults, open ✓) - Returned status: 'ready_for_manifest_generation', nextAction: 'generateManifests' ```md User: Generate the manifests ``` ```md Agent: Configuration complete. Generating your Kustomize manifests... [Uses mcp__dot-ai__recommend with stage: 'generateManifests', solutionId] Your Kustomize manifests are ready at ./manifests/: manifests/ ├── kustomization.yaml # Points to overlays/production ├── README.md ├── overlays/ │ └── production/ │ └── kustomization.yaml # ← Edit this file to customize └── base/ ├── kustomization.yaml ├── namespace.yaml ├── deployment.yaml # Image without tag (tag set in overlay) ├── service.yaml ├── ingress.yaml ├── hpa.yaml ├── pdb.yaml ├── sql.yaml └── secret.yaml **What's included:** - **Deployment** - 2 replicas, rolling updates, pod anti-affinity, non-root security - **Service** - ClusterIP on port 80 - **Ingress** - silly-demo.acme.com with TLS via cert-manager - **HPA** - scales 2-10 replicas based on CPU/memory (75% threshold) - **PDB** - ensures at least 1 pod available during disruptions - **SQL** - AWS PostgreSQL 15 (small) via DevOps Toolkit pattern - **Secret** - database credentials (placeholder - will be populated by Crossplane) **To customize:** Edit `overlays/production/kustomization.yaml`: - `image:` section to change image tag (for upgrades) - `patches:` for other customizations Deploy with: kubectl apply -k ./manifests 📊 **View visualization**: http://localhost:3000/v/sol-... Would you like me to deploy this to your cluster now? ``` **What happened behind the scenes:** - recommend tool with stage 'generateManifests' loaded the fully configured solution file - AI generated Kustomize structure based on `outputFormat: kustomize` selection - Raw manifests generated first, validated via `kubectl kustomize | kubectl apply --dry-run=server` - AI packaged validated manifests into Kustomize structure: - `base/` contains generic manifests (image without tag) - `overlays/production/kustomization.yaml` sets image tag and customizations - Response includes `files` array with `relativePath` and `content` for each file - Agent writes files to user's chosen `outputPath` (./manifests) - Each resource includes dot-ai.io/ labels for tracking - README.md generated with deployment instructions and overlay examples - Visualization URL generated for exploring deployment configuration in Web UI #### Visualization (Optional) The visualization URL opens an interactive view with multiple tabs including resource allocation, autoscaling configuration, health monitoring, generated manifests, traffic flow, and more: ```md User: Yes, deploy it! ``` ```md Agent: Deploying your application... [Uses mcp__dot-ai__recommend with stage: 'deployManifests', solutionId] 🎉 **Deployed successfully!** All resources were created: - namespace/a-team configured - deployment.apps/my-app created - service/my-app created - ingress.networking.k8s.io/my-app created - horizontalpodautoscaler.autoscaling/my-app created - poddisruptionbudget.policy/my-app created - sql.devopstoolkit.live/my-app-db created - secret/my-app-db-credentials created **Next steps:** 1. Wait for the SQL resource to provision the AWS RDS PostgreSQL instance (may take 5-10 minutes) 2. Once RDS is ready, Crossplane will populate the actual database credentials 3. The app will become healthy once it can connect to the database 4. Access your app at https://silly-demo.acme.com (ensure DNS points to your ingress controller) **Verify status:** kubectl get all -n a-team kubectl get sql -n a-team kubectl get ingress -n a-team ``` **What happened behind the scenes:** - recommend tool with stage 'deployManifests' applied Kustomize manifests using `kubectl apply -k` - Created all resources: Namespace, Deployment, Service, Ingress, HPA, PDB, SQL (database), Secret - SQL resource triggers Crossplane to provision AWS RDS PostgreSQL instance - Provided next steps and verification commands ### Example 2: Third-Party Application Installation (Helm) This example demonstrates installing third-party tools via Helm charts when no matching cluster capability exists. ```md User: Install Prometheus ``` ```md Agent: [Uses mcp__dot-ai__recommend with intent "install Prometheus monitoring" and final: true] Found a Prometheus Helm chart solution: **prometheus-community/prometheus (v27.49.0)** - App Version: v3.8.0 - Official chart maintained by prometheus-community - Verified Publisher with 523 stars - Repository: https://prometheus-community.github.io/helm-charts This is the standard, recommended chart for installing Prometheus monitoring on Kubernetes. 📊 **View visualization**: http://localhost:3000/v/sol-... Would you like me to proceed with this chart? I'll guide you through the configuration options. ``` **What happened behind the scenes:** - System detected "Install Prometheus" as a third-party installation intent - No Prometheus-related CRDs found in cluster capabilities, triggering Helm chart discovery - ArtifactHub API searched for "prometheus" charts - AI analyzed and scored charts based on official status, verified publisher, and popularity - Visualization URL generated for exploring Helm chart details in Web UI #### Visualization (Optional) The visualization URL opens an interactive view with multiple tabs including monitoring stack architecture, data collection flow, chart information, stack components, and more: > **Note**: Visualization URLs require additional setup: > - Configure `webUI.baseUrl` in your Helm values (see [Deployment Guide - Web UI Visualization](../setup/deployment.md#web-ui-visualization)) > - Install the Web UI (see [Web UI Documentation](/docs/ui)) ```md User: Yes, continue ``` ```md Agent: [Uses mcp__dot-ai__recommend with stage: 'chooseSolution', solutionId] I need some details to configure your Prometheus installation: **Required Configuration:** - What name should be used for this Prometheus deployment? - Which namespace should Prometheus be deployed to? - Should Prometheus use persistent storage for metrics data? ``` **What happened behind the scenes:** - System fetched chart values.yaml and README via Helm CLI - AI analyzed configuration options and categorized into required/basic/advanced questions - Cluster context gathered: available namespaces, StorageClasses - Questions include suggested answers based on chart defaults and cluster state ```md User: prometheus, monitoring namespace, no persistence ``` ```md Agent: [Uses mcp__dot-ai__recommend with stage: 'answerQuestion:required', solutionId, answers] Here are some basic configuration options. Would you like to customize these or use defaults? **Key settings (defaults shown):** - Data retention: 15 days - Scrape interval: 1 minute - AlertManager: enabled - kube-state-metrics: enabled (cluster object metrics) - node-exporter: enabled (node hardware metrics) - PushGateway: disabled - Ingress: disabled ``` ```md User: Enable AlertManager ``` ```md Agent: [Uses mcp__dot-ai__recommend with stage: 'answerQuestion:basic', solutionId, answers] Basic configuration set. Now for advanced options. [Uses mcp__dot-ai__recommend with stage: 'answerQuestion:advanced', solutionId, answers: {}] Skipping advanced options. Ready to generate the installation. [Uses mcp__dot-ai__recommend with stage: 'generateManifests', solutionId] Helm installation ready. Here's the configuration: **Helm Command:** helm upgrade --install prometheus prometheus-community/prometheus \ --namespace monitoring --create-namespace \ --version 27.49.0 -f values.yaml **Values (values.yaml):** server: global: scrape_interval: 1m scrape_timeout: 10s evaluation_interval: 1m retention: 15d persistentVolume: enabled: false alertmanager: enabled: true kube-state-metrics: enabled: true prometheus-node-exporter: enabled: true prometheus-pushgateway: enabled: false 📊 **View visualization**: http://localhost:3000/v/sol-... Ready to install? ``` **What happened behind the scenes:** - AI generated values.yaml containing only non-default values - System ran `helm upgrade --install --dry-run` to validate the installation - Returned helm command and values for user review ```md User: Yes, deploy ``` ```md Agent: [Uses mcp__dot-ai__recommend with stage: 'deployManifests', solutionId] Prometheus deployed successfully to the monitoring namespace. **Deployed Components:** - Prometheus Server - AlertManager - kube-state-metrics - node-exporter **Access Prometheus UI:** kubectl --namespace monitoring port-forward svc/prometheus-server 9090:80 Then open http://localhost:9090 **Access AlertManager:** kubectl --namespace monitoring port-forward svc/prometheus-alertmanager 9093:9093 Then open http://localhost:9093 **Internal DNS names:** - Prometheus: prometheus-server.monitoring.svc.cluster.local:80 - AlertManager: prometheus-alertmanager.monitoring.svc.cluster.local:9093 ``` **What happened behind the scenes:** - Helm repository added and `helm upgrade --install --wait` executed - Helm waited for all pods to become ready before reporting success - Access instructions provided based on installed components #### Post-Deployment Visualization After deployment, return to the Web UI visualization and click **Reload** to see updated runtime information including deployed components, pod status, and live cluster state: ## See Also - **[Deployment Guide](../setup/deployment.md)** - Server deployment and configuration - **[Tools and Features Overview](overview.md)** - Browse all available tools and features # Remediate Guide **Complete guide for AI-powered Kubernetes issue analysis and remediation with the DevOps AI Toolkit.** ## Using via Web UI These tools are also available through the [Web Dashboard](/docs/ui). ## Prerequisites Before using this guide, complete the [Deployment Guide](../setup/deployment.md) to set up: - DevOps AI Toolkit deployed - AI model API key configured (see [AI Model Configuration](../setup/deployment.md#ai-model-configuration) for supported models and setup) - `KUBECONFIG` pointing to your Kubernetes cluster (optional for pure analysis) - kubectl command-line tool installed **Required Dependencies:** - kubectl access to cluster for investigation (get, describe, logs, events, top, explain operations) **Optional Dependencies:** - Write permissions for remediation execution (patch, apply, delete, scale, etc.) - only needed if you want to execute recommended fixes ## Overview The DevOps AI Toolkit remediate feature provides: - **AI-powered root cause analysis** - Multi-iteration investigation loop to understand Kubernetes issues - **Helm-aware diagnostics** - Detects and diagnoses issues with Helm-managed applications, including stuck releases and failed upgrades - **Intelligent data gathering** - Safe, read-only cluster investigation with targeted operations - **Comprehensive remediation plans** - Step-by-step solutions with risk assessment and validation - **Dual execution modes** - Manual approval workflow or automatic execution based on confidence/risk thresholds - **Iterative remediation support** - Handles complex issues requiring multiple sequential fixes - **Complete audit trail** - Full investigation history and remediation tracking ## Related Projects **[dot-ai-controller](/docs/controller/remediation-guide/)** - A Kubernetes controller that transforms reactive infrastructure management into proactive, AI-enhanced operations. The controller monitors Kubernetes cluster events and automatically triggers this project's remediation tool for intelligent analysis and resolution. It supports both automatic remediation (where AI directly implements solutions) and manual remediation (where AI provides recommendations for human review), with configurable filtering, rate limiting, Slack notifications, and confidence/risk thresholds. This enables continuous cluster health monitoring with intelligent auto-healing capabilities. ### How AI-Driven Investigation Works **Traditional Problem**: Basic kubectl commands show symptoms but require manual analysis to identify root causes. **AI Investigation Solution**: Multi-step analysis loop gathers relevant data and reasons through complex interdependencies. **Investigation Flow**: 1. **Initial Analysis**: AI reviews issue description 2. **Data Gathering Loop**: AI requests specific kubectl operations to gather targeted information 3. **Iterative Analysis**: Each data point informs the next investigation step 4. **Root Cause Identification**: AI synthesizes findings into comprehensive analysis 5. **Remediation Generation**: Specific kubectl commands with risk assessment and validation instructions ## Complete Workflow Examples This section demonstrates both manual and automatic execution modes: - **[Manual Mode Example](#manual-mode-example)** - User controls execution timing with approval choices - **[Automatic Mode Example](#automatic-mode-example)** - Tool executes automatically based on confidence/risk thresholds ### Option 1: Automatic Execution **User selects Choice 1** - Tool executes commands internally and returns complete results ```text User: 1 ``` ```text Agent: I'll execute the remediation automatically and validate the results. [Uses remediate tool with executeChoice: 1, sessionId: "rem_2025-09-20T1653_818d80901b10ce62", mode: "manual"] ✅ **Issue Resolved Successfully!** **Executed Command:** kubectl apply -f - < ### Automatic Mode Example {#automatic-mode-example} **Automatic Mode**: Tool executes first remediation automatically, then falls back to manual mode when validation discovers additional issues ``` User: Something is wrong with my database in sql-demo namespace. Remediate it automatically. ``` ``` Agent: I'll help you remediate the database issue in the sql-demo namespace. Let me use the remediate tool to automatically analyze and fix the problem. [Uses remediate tool with issue: "Database issue in sql-demo namespace", mode: "automatic", maxRiskLevel: "medium", confidenceThreshold: 0.7] The AI analysis has identified the root cause of your database issue with 95% confidence: Root Cause: The SQL resource 'test-db' is failing because the KCL function is trying to concatenate 'POSTGRES_' with a missing 'version' field in the resource specification. Recommended Action: - Add version field to the SQL resource with PostgreSQL version 13 - Risk Level: Low - Command: kubectl patch sql/test-db -n sql-demo --type=merge -p '{"spec":{"version":"13"}}' Choose execution method: 1. Execute automatically (Recommended) - Run the command automatically - Risk: Low 2. Execute via agent - I'll run the kubectl command using the Bash tool - Then validate the results - Risk: Low Which option would you prefer? (1 or 2) ``` **What happened behind the scenes:** - Agent interpreted "automatically" and set confidence/risk thresholds (70% confidence, medium risk allowed) - Tool executed first remediation automatically (compositionRef issue): `kubectl patch sql/test-db -n sql-demo --type=merge -p '{"spec":{"crossplane":{"compositionRef":{"name":"google-postgresql"}}}}'` - Post-execution validation discovered a new issue (missing version field) - **Safety mechanism activated**: Tool switches to manual mode when validation finds additional issues - User sees current issue analysis but tool includes `"executed": true` and `"results": [...]` showing previous automatic execution - This prevents potentially dangerous infinite execution loops while maintaining transparency **Key Automatic Mode Features Demonstrated:** - **Single-issue auto-execution**: First issue (compositionRef) was fixed without user intervention - **Intelligent thresholds**: Agent interpreted "automatically" as moderate risk tolerance (70% confidence, medium risk) - **Safety fallback**: After first fix, requires manual approval for additional issues discovered during validation - **Complete transparency**: JSON shows both previous execution results and current issue analysis - **User control**: Users can specify execution preferences like "high confidence only" or "allow risky fixes" # System Status Guide **Quick diagnostics and health check for your DevOps AI Toolkit installation.** ## Prerequisites Before using this guide, complete the [Deployment Guide](../setup/deployment.md). ## Overview **What it does**: Provides comprehensive system diagnostics including version info, component health, connection status, and capability summary. **Use when**: Troubleshooting setup issues, verifying installation, or checking what features are available. ## Usage Example ```md User: Show dot-ai status ``` ```md Agent: [Uses mcp__dot-ai__version] **System Health: ✅ Healthy** | Component | Status | |-------------|---------------------------------| | Version | 0.179.0 | | Kubernetes | Connected (v1.34.0, in-cluster) | | AI Provider | Anthropic (claude-haiku-4-5) | | Vector DB | Connected | | Embedding | OpenAI text-embedding-3-small | | Kyverno | v1.16.1 (ready) | | Tracing | OTLP → Jaeger | **Collections** | Collection | Status | Documents | |--------------|----------|-----------| | capabilities | ✅ Ready | 523 | | patterns | ✅ Ready | 12 | | policies | ✅ Ready | 8 | **Active Capabilities** - Policy intent management - Capability scanning - Semantic search - AI recommendations - Kubernetes integration - Kyverno policy generation 📊 **View visualization**: http://localhost:3000/v/ver-... ``` **What happened behind the scenes:** - System checked connectivity to all components (Kubernetes, Vector DB, AI provider) - Retrieved collection statistics from Qdrant - Verified Kyverno installation and readiness - Checked tracing configuration - Visualization URL generated for exploring system status in Web UI ### Visualization (Optional) The visualization URL opens an interactive dashboard showing system health, component status, and configuration details: > **Note**: Visualization URLs require additional setup: > - Configure `webUI.baseUrl` in your Helm values (see [Deployment Guide - Web UI Visualization](../setup/deployment.md#web-ui-visualization)) > - Install the Web UI (see [Web UI Documentation](/docs/ui)) ## Troubleshooting Common Issues ### Kubernetes Not Connected ```text | Kubernetes | ❌ Not connected | ``` **Causes:** - `KUBECONFIG` environment variable not set - Kubeconfig file doesn't exist or is invalid - Cluster is unreachable **Solutions:** 1. Verify kubeconfig exists: `ls $KUBECONFIG` 2. Test connectivity: `kubectl cluster-info` 3. Check the server has access to kubeconfig file ### Vector DB Not Connected ```text | Vector DB | ❌ Not connected | ``` **Causes:** - Qdrant service not running - Wrong `QDRANT_URL` configuration - Network connectivity issues **Solutions:** 1. Verify Qdrant is running: `kubectl get pods -n dot-ai | grep qdrant` 2. Check `QDRANT_URL` environment variable 3. Test connectivity: `curl $QDRANT_URL/collections` ### Collections Not Created ```text | capabilities | ⚠️ Not created | ``` **Causes:** - Capability scan hasn't been run yet - Vector DB connectivity issues **Solutions:** 1. Run capability scan: See [Capability Management Guide](capability-management.md) 2. For patterns: See [Pattern Management Guide](../organizational-data/patterns.md) 3. For policies: See [Policy Management Guide](../organizational-data/policies.md) ### AI Provider Issues ```text | AI Provider | ❌ Not configured | ``` **Causes:** - Missing API key (e.g., `ANTHROPIC_API_KEY`, `OPENAI_API_KEY`) - Invalid API key - Provider service unavailable **Solutions:** 1. Verify API key is set in environment or Helm values 2. Test API key validity with provider's API 3. Check [AI Model Configuration](../setup/deployment.md#ai-model-configuration) ## See Also - **[Deployment Guide](../setup/deployment.md)** - Server deployment and configuration - **[Capability Management Guide](capability-management.md)** - Scan cluster capabilities - **[Tools and Features Overview](overview.md)** - Browse all available tools ## MCP # MCP **Connect to the DevOps AI Toolkit Engine via MCP protocol from your preferred coding assistant.** MCP (Model Context Protocol) is an open protocol that lets AI coding assistants connect to external tools and data sources. When you connect your coding assistant to the DevOps AI Toolkit via MCP, all toolkit tools become available directly in your editor — no context switching required. > **Alternative**: The [CLI](/docs/cli) provides command-line access to all toolkit capabilities — for AI agents (with lower token overhead), scripting, CI/CD pipelines, and direct resource access beyond what MCP exposes. ## Prerequisites - DevOps AI Toolkit Engine deployed to Kubernetes — see [AI Engine Deployment](/docs/ai-engine/setup/deployment) - An MCP-compatible client (Claude Code, Cursor, VS Code, Cline, etc.) - Your server URL and auth token from the deployment step ## Step 1: Configure Your Client Create an `.mcp.json` file in your project root: ```json { "mcpServers": { "dot-ai": { "type": "http", "url": "http://dot-ai.127.0.0.1.nip.io", "headers": { "Authorization": "Bearer " } } } } ``` Replace `` with the token from your deployment (run `echo $DOT_AI_AUTH_TOKEN` to view it). **Notes**: - Replace the URL with your actual hostname if you changed `ingress.host` during deployment. - For production deployments with TLS, use `https://` in the URL. ## Step 2: Start Your Client Start your MCP client. It will automatically connect to the server. - **Claude Code**: Run `claude` in the directory containing `.mcp.json` - **Cursor**: The MCP connection activates when you open the project - **VS Code / Cline**: The MCP connection activates based on your extension settings ## Step 3: Verify the Connection In your MCP client, ask: ```text Show dot-ai status ``` You should see comprehensive system status including Kubernetes connectivity, vector database, and all available tools. ## Client Compatibility The DevOps AI Toolkit works with any MCP-compatible coding agent or development tool. ### Popular MCP Clients **Claude Code** - Save `.mcp.json` in your project root - Start with `claude` — tools are automatically available **Cursor** - Settings -> "MCP Servers" -> Add configuration -> Restart **Cline (VS Code Extension)** - Configure in VS Code settings or extension preferences **VS Code (with MCP Extension)** - Add configuration to `settings.json` under `mcp.servers` **Other MCP Clients** - Any client supporting the Model Context Protocol standard - Use the HTTP configuration pattern shown in [Step 1](#step-1-configure-your-client) ## Next Steps - **[Tools Overview](/docs/ai-engine/tools/overview)** — Explore all available tools and how they work together ## CLI # CLI Documentation **Command-line interface for AI-powered Kubernetes operations** ## What is the CLI? The CLI provides command-line access to all [DevOps AI Toolkit](/docs/ai-engine/) capabilities. It's a lightweight HTTP client designed for both AI agents and human operators who prefer terminal-based workflows. Unlike MCP (limited to 8 high-level tools to minimize context window usage), the CLI exposes **all REST API endpoints** since there's no token cost per command. This means you get access to direct resource queries, logs, events, and more—all from a single binary with zero runtime dependencies. **Key benefits:** - **Single binary** — No installation dependencies, just download and run - **Cross-platform** — Linux, macOS, Windows (amd64 + arm64) - **Token efficient** — Lower token overhead than MCP for AI agents - **Complete API access** — All 26 REST API endpoints (MCP exposes 8 tools) - **Composable** — Shell piping, scripting, and CI/CD integration ## When to Use the CLI The CLI is ideal for: - **Scripting and automation** — Shell scripts, CI/CD pipelines, scheduled jobs - **AI agent integration** — Lower token overhead than MCP protocol - **Direct API access** — Commands for resources, logs, events, namespaces not available via MCP - **Composability** — Pipe output between commands, combine with other CLI tools For details on DevOps AI Toolkit features (query, recommend, remediate, etc.), see the [main documentation](/docs/ai-engine/). ## Getting Started **[Quick Start](quick-start.md)** — Set up your AI agent to use the CLI ## Documentation ### Setup - **[Installation](setup/installation.md)** — Homebrew, Scoop, binary download - **[Configuration](setup/configuration.md)** — Server URL, authentication, output format - **[Shell Completion](setup/shell-completion.md)** — Bash, Zsh, Fish autocompletion ### Guides - **[Commands Overview](guides/cli-commands-overview.md)** — All available commands - **[Skills Generation](guides/skills-generation.md)** — Enable AI agents to discover and use the CLI - **[Output Formats](guides/output-formats.md)** — YAML vs JSON - **[Automation](guides/automation.md)** — Scripting and CI/CD integration ## Architecture ``` ┌─────────────┐ │ CLI │ └──────┬──────┘ │ HTTP (GET/POST/DELETE) │ Bearer auth, JSON body ▼ ┌─────────────────────┐ │ DevOps AI Toolkit │ │ REST API Server │ └─────────────────────┘ ``` The CLI is a stateless HTTP client that reads the embedded OpenAPI spec and generates commands dynamically. All commands map directly to REST API endpoints. ## Related Projects - **[DevOps AI Toolkit](/docs/ai-engine/)** — Main server (MCP + REST API) - **[Web UI](/docs/ui/)** — Visualizations and dashboards - **[Stack](/docs/stack/)** — Kubernetes deployment # Automation Use the CLI in scripts, CI/CD pipelines, and automated workflows. ## Exit Codes The CLI uses standard exit codes for automation: | Code | Meaning | |------|---------| | `0` | Success | | `1` | Tool execution error (server returned error) | | `2` | Connection error (server unreachable) | | `3` | Usage error (invalid arguments, missing required params) | ## Error Handling in Scripts **Check exit code:** ```bash #!/bin/bash if dot-ai ; then echo "Success" else echo "Failed with exit code $?" exit 1 fi ``` **Handle specific errors:** ```bash #!/bin/bash dot-ai EXIT_CODE=$? case $EXIT_CODE in 0) echo "Success" ;; 1) echo "Server error" ;; 2) echo "Connection failed" ;; 3) echo "Invalid usage" ;; esac ``` ## CI/CD Integration ### GitHub Actions ```yaml name: Deploy on: [push] jobs: deploy: runs-on: ubuntu-latest steps: - name: Install CLI run: | curl -sL https://github.com/vfarcic/dot-ai-cli/releases/latest/download/dot-ai-linux-amd64 \ -o /usr/local/bin/dot-ai && chmod +x /usr/local/bin/dot-ai - name: Deploy application env: DOT_AI_URL: ${{ secrets.DOT_AI_URL }} DOT_AI_AUTH_TOKEN: ${{ secrets.DOT_AI_AUTH_TOKEN }} run: | dot-ai --output json ``` ### GitLab CI ```yaml deploy: image: ubuntu:latest before_script: - apt-get update && apt-get install -y curl - curl -sL https://github.com/vfarcic/dot-ai-cli/releases/latest/download/dot-ai-linux-amd64 -o /usr/local/bin/dot-ai - chmod +x /usr/local/bin/dot-ai script: - dot-ai --output json variables: DOT_AI_URL: $DOT_AI_URL DOT_AI_AUTH_TOKEN: $DOT_AI_AUTH_TOKEN ``` ## Processing Output **Extract values with jq:** ```bash #!/bin/bash RESULT=$(dot-ai --output json | jq -r '.result') echo "Result: $RESULT" ``` **Loop over array results:** ```bash #!/bin/bash dot-ai resources --kind Deployment --output json | \ jq -r '.items[].metadata.name' | \ while read name; do echo "Processing: $name" done ``` ## Configuration Best Practices **Use environment variables in CI/CD:** ```bash export DOT_AI_URL="https://dot-ai.example.com" export DOT_AI_AUTH_TOKEN="${SECRET_TOKEN}" export DOT_AI_OUTPUT_FORMAT="json" ``` **Don't hardcode credentials:** ```bash # Bad dot-ai --token hardcoded-token # Good dot-ai --token "${DOT_AI_AUTH_TOKEN}" ``` ## Scripting Examples **Conditional execution:** ```bash #!/bin/bash if dot-ai --output json | jq -e '.healthy' > /dev/null; then echo "System healthy, proceeding..." # Continue with workflow else echo "System unhealthy, aborting" exit 1 fi ``` **Retry logic:** ```bash #!/bin/bash MAX_RETRIES=3 RETRY_COUNT=0 while [ $RETRY_COUNT -lt $MAX_RETRIES ]; do if dot-ai ; then echo "Success" exit 0 fi RETRY_COUNT=$((RETRY_COUNT + 1)) echo "Retry $RETRY_COUNT/$MAX_RETRIES" sleep 5 done echo "Failed after $MAX_RETRIES attempts" exit 1 ``` ## Next Steps - **[Output Formats](output-formats.md)** — Control output for parsing - **[Configuration](../setup/configuration.md)** — Environment variables and flags - **[Commands Overview](cli-commands-overview.md)** — Available commands # Commands Overview The CLI exposes all DevOps AI Toolkit server capabilities as commands. Commands are automatically generated from the server's OpenAPI specification. ## Discovering Commands To see all available commands: ```bash dot-ai --help ``` To see help for a specific command: ```bash dot-ai --help ``` For details on what each feature does, see the [server documentation](/docs/ai-engine/). ## Global Flags These flags work with all commands: | Flag | Environment Variable | Description | |------|---------------------|-------------| | `--server-url` | `DOT_AI_URL` | Server URL (default: `http://localhost:3456`) | | `--token` | `DOT_AI_AUTH_TOKEN` | Authentication token | | `--output` | `DOT_AI_OUTPUT_FORMAT` | Output format: `yaml` or `json` (default: `yaml`) | | `--help` | - | Show command help | ## Usage Patterns **Basic command execution:** ```bash dot-ai [arguments] [flags] ``` **With output format:** ```bash dot-ai --output json ``` **Remote server:** ```bash dot-ai --server-url https://remote:3456 --token mytoken ``` **Piping output:** ```bash dot-ai --output json | jq '.result' ``` ## Next Steps - **[Skills Generation](skills-generation.md)** — Enable AI agents to use the CLI - **[Output Formats](output-formats.md)** — YAML vs JSON - **[Automation](automation.md)** — Use in scripts and CI/CD - **[Server Features](/docs/ai-engine/)** — What each command does # Output Formats Control how the CLI formats command output. ## Available Formats ### YAML (Default) Human-readable structured output. **When to use:** - Interactive terminal use - Reading output directly - Debugging and development **Example:** ```bash dot-ai version ``` **Output:** ```yaml version: 1.2.1 server: version: 1.2.1 healthy: true ``` ### JSON Raw API response, machine-parseable. **When to use:** - Scripting and automation - Piping to other tools (jq, etc.) - AI agents processing output - CI/CD pipelines **Example:** ```bash dot-ai version --output json ``` **Output:** ```json { "version": "1.2.1", "server": { "version": "1.2.1", "healthy": true } } ``` ## Setting Output Format **Command-line flag:** ```bash dot-ai --output json dot-ai --output yaml ``` **Environment variable:** ```bash export DOT_AI_OUTPUT_FORMAT="json" dot-ai ``` **Default:** `yaml` ## Processing Output **Extract fields with jq:** ```bash dot-ai version --output json | jq '.server.version' ``` **Filter arrays:** ```bash dot-ai resources --kind Deployment --output json | jq '.items[] | .metadata.name' ``` **Combine with other tools:** ```bash dot-ai resources --kind Pod --output json | jq -r '.items[].metadata.name' | xargs -I {} echo "Pod: {}" ``` ## For AI Agents AI agents should use JSON output for structured parsing: ```bash dot-ai --output json ``` This ensures consistent, parseable responses without YAML formatting ambiguities. ## Next Steps - **[Automation](automation.md)** — Use output in scripts and CI/CD - **[Commands Overview](cli-commands-overview.md)** — See all available commands - **[Configuration](../setup/configuration.md)** — Set default output format # Skills Generation Enable AI agents to use the DevOps AI Toolkit CLI and access server prompts as native skills. ## What Are Skills? Skills are agent capabilities that make AI coding assistants (Claude Code, Cursor, Windsurf) aware of available tools and workflows. The CLI can generate skills from server capabilities. ## What Gets Generated Skills generation serves two purposes: ### 1. CLI Awareness (Routing Skill) Creates a `dot-ai` routing skill that makes agents aware of the CLI: - Triggers on Kubernetes and DevOps operations - Directs agents to use CLI instead of MCP - Teaches agents to use `dot-ai --help` for command discovery - Lower token overhead than MCP protocol ### 2. Server Prompts (Prompt Skills) Exposes server prompts as native agent skills: - Each server prompt becomes an agent skill (e.g., `dot-ai-projectSetup`, `dot-ai-query`) - Users can invoke them as native skills in their agent - Prefixed with `dot-ai-` to avoid naming conflicts ## Supported Agents - **Claude Code** — `.claude/skills/` - **Cursor** — `.cursor/skills/` - **Windsurf** — `.windsurf/skills/` Note: Cursor also auto-discovers skills from `.claude/skills/`, so Claude Code skills work in Cursor without duplication. ## Generate Skills **For Claude Code:** ```bash dot-ai skills generate --agent claude-code ``` **For Cursor:** ```bash dot-ai skills generate --agent cursor ``` **For Windsurf:** ```bash dot-ai skills generate --agent windsurf ``` **Custom path (unsupported agents):** ```bash dot-ai skills generate --path ./custom/skills/ ``` ## Updating Skills Re-running the command updates all `dot-ai-*` skills: ```bash dot-ai skills generate --agent claude-code ``` Existing `dot-ai-*` skills are deleted and regenerated with the latest server capabilities. ## How It Works 1. CLI fetches prompts and tool metadata from the server 2. Generates a routing skill for CLI awareness 3. Creates individual skills for each server prompt 4. All skills use `dot-ai-` prefix for namespacing ## Agent Behavior Once skills are generated: **Routing:** - Agents become aware of CLI for Kubernetes operations - Agents prefer CLI over MCP when both are available - Agents use `dot-ai --help` to discover commands **Prompts:** - Server prompts appear as native agent skills - Users can invoke them directly in their coding assistant - Skills stay in sync with server capabilities ## Next Steps - **[Automation](automation.md)** — Use CLI in scripts and CI/CD - **[Output Formats](output-formats.md)** — Control output format - **[Configuration](../setup/configuration.md)** — Configure server URL # Quick Start Get your AI agent using the DevOps AI Toolkit CLI. ## Prerequisites - Running [DevOps AI Toolkit server](/docs/ai-engine/setup/deployment) - AI coding assistant: Claude Code, Cursor, or Windsurf ## Install the CLI **macOS/Linux:** ```bash brew install vfarcic/tap/dot-ai ``` For other platforms, see [Installation Guide](setup/installation.md). ## Configure Server Connection Point the CLI to your server: ```bash export DOT_AI_URL="http://dot-ai.127.0.0.1.nip.io" # your server URL export DOT_AI_AUTH_TOKEN="your-token" # if authentication is enabled ``` See [Configuration Guide](setup/configuration.md) for details. ## Generate Agent Skills Enable your AI agent to discover and use the CLI: ```bash # For Claude Code dot-ai skills generate --agent claude-code # For Cursor dot-ai skills generate --agent cursor # For Windsurf dot-ai skills generate --agent windsurf ``` See [Skills Generation](guides/skills-generation.md) for what this does. ## Verify It Works Ask your agent to use the dot-ai CLI: ``` "Use the dot-ai CLI to check the server version" ``` Your agent should execute `dot-ai version` and show you the results. If this works, your agent is successfully using the CLI! ## What's Next Your agent can now use all DevOps AI Toolkit capabilities via CLI. For details on what you can do, see the [server documentation](/docs/ai-engine/). **CLI-specific topics:** - **[Commands Overview](guides/cli-commands-overview.md)** — How to discover and use commands - **[Output Formats](guides/output-formats.md)** — Control CLI output format - **[Automation](guides/automation.md)** — Use CLI in scripts and CI/CD # Configuration Configure the CLI to connect to your DevOps AI Toolkit server. ## Server URL Specify the server address: **Environment variable:** ```bash export DOT_AI_URL="https://your-server-url" ``` **Command-line flag:** ```bash dot-ai query "test" --server-url https://your-server-url ``` **Default:** `http://localhost:3456` ## Authentication Set the authentication token: **Environment variable:** ```bash export DOT_AI_AUTH_TOKEN="your-token-here" ``` **Command-line flag:** ```bash dot-ai query "test" --token your-token-here ``` **Default:** No authentication (for local development) ## Output Format Choose the output format: **Environment variable:** ```bash export DOT_AI_OUTPUT_FORMAT="json" # or "yaml" ``` **Command-line flag:** ```bash dot-ai query "test" --output json ``` **Default:** `yaml` **Options:** - `yaml` — Human-readable, structured output (default) - `json` — Machine-parseable, raw API response ## Configuration Precedence Settings are applied in this order (highest to lowest priority): 1. **Command-line flags** (`--server-url`, `--token`, `--output`) 2. **Environment variables** (`DOT_AI_URL`, `DOT_AI_AUTH_TOKEN`, `DOT_AI_OUTPUT_FORMAT`) 3. **Defaults** (`http://localhost:3456`, no token, `yaml`) ## Example Configuration **For local development:** ```bash # No configuration needed - defaults work dot-ai version ``` **For remote server:** ```bash # Set once in your shell profile export DOT_AI_URL="https://dot-ai.example.com" export DOT_AI_AUTH_TOKEN="your-token" # Then use normally dot-ai query "what pods are running?" ``` **For multiple environments:** ```bash # Development DOT_AI_URL="https://dev.example.com" dot-ai query "test" # Production DOT_AI_URL="https://prod.example.com" DOT_AI_AUTH_TOKEN="prod-token" dot-ai query "test" ``` ## Next Steps - **[Shell Completion](shell-completion.md)** — Enable command autocompletion - **[Commands Overview](../guides/cli-commands-overview.md)** — See all available commands - **[Automation](../guides/automation.md)** — Use in scripts and CI/CD # Installation Install the DevOps AI Toolkit CLI on your preferred platform. ## Prerequisites - Access to a running [DevOps AI Toolkit server](/docs/ai-engine/setup/deployment) ## Homebrew (macOS/Linux) ```bash brew install vfarcic/tap/dot-ai ``` ## Scoop (Windows) ```bash # Add the bucket scoop bucket add dot-ai https://github.com/vfarcic/scoop-dot-ai # Install scoop install dot-ai ``` ## Binary Download Download the latest release for your platform: **macOS (Apple Silicon):** ```bash curl -sL https://github.com/vfarcic/dot-ai-cli/releases/latest/download/dot-ai-darwin-arm64 \ -o /usr/local/bin/dot-ai && chmod +x /usr/local/bin/dot-ai ``` **macOS (Intel):** ```bash curl -sL https://github.com/vfarcic/dot-ai-cli/releases/latest/download/dot-ai-darwin-amd64 \ -o /usr/local/bin/dot-ai && chmod +x /usr/local/bin/dot-ai ``` **Linux (x86_64):** ```bash curl -sL https://github.com/vfarcic/dot-ai-cli/releases/latest/download/dot-ai-linux-amd64 \ -o /usr/local/bin/dot-ai && chmod +x /usr/local/bin/dot-ai ``` **Linux (ARM64):** ```bash curl -sL https://github.com/vfarcic/dot-ai-cli/releases/latest/download/dot-ai-linux-arm64 \ -o /usr/local/bin/dot-ai && chmod +x /usr/local/bin/dot-ai ``` **Windows:** Download from [GitHub Releases](https://github.com/vfarcic/dot-ai-cli/releases/latest) and add to PATH. ## Configuration Configure the server URL and authentication: ```bash export DOT_AI_URL="https://your-server-url" export DOT_AI_AUTH_TOKEN="your-token" ``` See [Configuration](configuration.md) for more options. ## Verification Verify the CLI can connect to your server: ```bash dot-ai version ``` You should see version and diagnostic information from the server. ## Next Steps - **[Configuration](configuration.md)** — Detailed configuration options - **[Shell Completion](shell-completion.md)** — Enable command autocompletion - **[Commands Overview](../guides/cli-commands-overview.md)** — See all available commands # Shell Completion Enable command and flag autocompletion for your shell. ## Bash **Linux:** ```bash dot-ai completion bash | sudo tee /etc/bash_completion.d/dot-ai > /dev/null ``` **macOS:** ```bash dot-ai completion bash > $(brew --prefix)/etc/bash_completion.d/dot-ai ``` Then restart your shell or source the completion file: ```bash source $(brew --prefix)/etc/bash_completion.d/dot-ai ``` ## Zsh ```bash dot-ai completion zsh > "${fpath[1]}/_dot-ai" ``` Then restart your shell or run: ```bash compinit ``` ## Fish ```bash dot-ai completion fish > ~/.config/fish/completions/dot-ai.fish ``` Then restart your shell or run: ```bash source ~/.config/fish/completions/dot-ai.fish ``` ## What Gets Completed Shell completion provides: - **Commands** — All available CLI commands - **Flags** — Global and command-specific flags - **Enum values** — Valid values for flags like `--output` (`yaml`, `json`) - **Help** — Press tab to see available options ## Next Steps - **[Commands Overview](../guides/cli-commands-overview.md)** — Learn all available commands - **[Configuration](configuration.md)** — Configure server URL and authentication ## Controller # DevOps AI Toolkit Controller A Kubernetes controller that provides resource tracking, event-driven remediation, and resource visibility capabilities for your cluster. ## Mission The DevOps AI Toolkit Controller bridges the gap between Kubernetes resources and intelligent operations. It enables: - **Resource awareness** through logical grouping and health aggregation - **Proactive remediation** through AI-powered event analysis - **Resource discoverability** through semantic search integration ## Who Should Use This - **Platform Engineers** building self-service Kubernetes platforms - **DevOps Teams** looking to automate incident response - **SREs** who want intelligent monitoring and remediation - **Developers** deploying applications and needing resource visibility ## Scope ### In Scope - Kubernetes resource tracking and lifecycle management - Event-driven remediation with AI analysis - Resource synchronization for semantic search - Integration with DevOps AI Toolkit MCP ### Out of Scope - Direct AI/LLM processing (delegated to MCP) - Application-level monitoring - Multi-cluster management - GitOps workflows ## Features The DevOps AI Toolkit Controller provides four main capabilities: ### 1. Solution CRD - Resource Tracking Track and manage deployed Kubernetes resources as logical solutions: - **Resource Grouping**: Links all resources (Deployments, Services, etc.) that compose a logical solution - **Intent Preservation**: Stores the original user intent and deployment context - **Metadata Storage**: Captures deployment rationale, patterns, policies, and documentation links - **Health Monitoring**: Aggregates health status across all tracked resources - **Automatic Cleanup**: Deleting a Solution CR automatically deletes all child resources via ownerReferences **Works standalone** - No external dependencies required. ### 2. RemediationPolicy CRD - Event-Driven Remediation Monitor Kubernetes events and automatically remediate issues using the [DevOps AI Toolkit](/docs/mcp): - **Event Watching**: Configurable filtering of Kubernetes events by type, reason, and involved objects - **Automatic Mode**: System detects, analyzes, and fixes issues without human intervention - **Manual Mode**: System provides remediation recommendations via Slack for human execution - **Slack Notifications**: Rich notifications with remediation results and next steps - **Rate Limiting**: Prevents event storms with configurable cooldowns - **Status Reporting**: Comprehensive observability through status updates **Requires** [DevOps AI Toolkit MCP](/docs/mcp) for AI-powered analysis. ### 3. ResourceSyncConfig CRD - Resource Visibility Enable semantic search and resource discovery across your cluster: - **Resource Discovery**: Automatically discovers all resource types in your cluster - **Change Tracking**: Watches for resource changes (create, update, delete) - **Semantic Search**: Syncs resource metadata to MCP for natural language queries - **Debounced Sync**: Batches changes to reduce API calls - **Periodic Resync**: Full state sync catches any missed events **Requires** [DevOps AI Toolkit MCP](/docs/mcp) for semantic search capabilities. ### 4. CapabilityScanConfig CRD - Autonomous Capability Discovery Keep your cluster's capability data up-to-date for AI-powered recommendations: - **Autonomous Discovery**: Automatically detects CRD changes (create, update, delete) - **Event-Driven Scanning**: Triggers capability scans when new CRDs are installed - **Startup Reconciliation**: Syncs cluster state with MCP on controller restart - **Resource Filtering**: Include/exclude patterns for targeted scanning - **Debounced Batching**: Groups rapid CRD changes into efficient batch requests **Requires** [DevOps AI Toolkit MCP](/docs/mcp) for capability storage and analysis. ### 5. GitKnowledgeSource CRD - Documentation Ingestion Automatically sync documentation from Git repositories into the knowledge base: - **Document Ingestion**: Syncs markdown and other files to MCP knowledge base - **Change Detection**: Only processes files changed since the last sync - **Scheduled Sync**: Periodically re-syncs to capture updates - **Pattern Matching**: Flexible glob patterns for file selection - **Automatic Cleanup**: Removes documents from knowledge base when CR is deleted **Requires** [DevOps AI Toolkit MCP](/docs/mcp) for knowledge base storage. ## Quick Start > **Recommended**: For the easiest setup, we recommend installing the complete dot-ai stack which includes all components. See the [Stack Installation Guide](/docs/stack). > > The stack automatically applies CapabilityScanConfig and ResourceSyncConfig CRs. For RemediationPolicy and Solution CRs, see the [Remediation Guide](remediation-guide.md) and [Solution Guide](solution-guide.md) to configure them based on your needs. > > Continue below if you want to install this component individually (for non-Kubernetes setups or granular control). ### 1. Install Controller ```bash # Set the version from https://github.com/vfarcic/dot-ai-controller/pkgs/container/dot-ai-controller%2Fcharts%2Fdot-ai-controller export DOT_AI_CONTROLLER_VERSION="..." helm install dot-ai-controller oci://ghcr.io/vfarcic/dot-ai-controller/charts/dot-ai-controller \ --version $DOT_AI_CONTROLLER_VERSION \ --namespace dot-ai \ --create-namespace \ --wait ``` This installs all five CRDs (Solution, RemediationPolicy, ResourceSyncConfig, CapabilityScanConfig, and GitKnowledgeSource) and the controller. ### 2. Choose Your Feature **For Event Remediation:** First, install the [DevOps AI Toolkit MCP](/docs/mcp), then: ```bash # Create a RemediationPolicy to handle events kubectl apply --filename - <<'EOF' apiVersion: dot-ai.devopstoolkit.live/v1alpha1 kind: RemediationPolicy metadata: name: auto-remediate namespace: dot-ai spec: eventSelectors: - type: Warning reason: FailedScheduling mode: automatic mcpEndpoint: http://dot-ai-mcp.dot-ai.svc.cluster.local:3456/api/v1/tools/remediate mode: manual EOF ``` See the [Remediation Guide](remediation-guide.md) for complete examples, configuration options, and best practices. **For Resource Visibility:** First, install the [DevOps AI Toolkit MCP](/docs/mcp), then: ```bash # Create a secret with your MCP auth token (if not already created) kubectl create secret generic dot-ai-secrets \ --namespace dot-ai \ --from-literal=auth-token=your-auth-token-here # Create a ResourceSyncConfig to enable semantic search kubectl apply --filename - <<'EOF' apiVersion: dot-ai.devopstoolkit.live/v1alpha1 kind: ResourceSyncConfig metadata: name: default-sync namespace: dot-ai spec: mcpEndpoint: http://dot-ai-mcp.dot-ai.svc.cluster.local:3456/api/v1/resources/sync mcpAuthSecretRef: name: dot-ai-secrets key: auth-token debounceWindowSeconds: 10 resyncIntervalMinutes: 60 EOF ``` See the [Resource Sync Guide](resource-sync-guide.md) for complete examples and semantic search usage. **For Capability Discovery:** First, install the [DevOps AI Toolkit MCP](/docs/mcp), then: ```bash # Create a CapabilityScanConfig to enable autonomous scanning kubectl apply --filename - <<'EOF' apiVersion: dot-ai.devopstoolkit.live/v1alpha1 kind: CapabilityScanConfig metadata: name: default-scan namespace: dot-ai spec: mcp: endpoint: http://dot-ai-mcp.dot-ai.svc.cluster.local:3456/api/v1/tools/manageOrgData authSecretRef: name: dot-ai-secrets key: auth-token EOF ``` See the [Capability Scan Guide](capability-scan-guide.md) for complete examples and configuration options. **For Resource Tracking:** ```bash # Create a Solution CR to track your deployed resources kubectl apply --filename - <<'EOF' apiVersion: dot-ai.devopstoolkit.live/v1alpha1 kind: Solution metadata: name: my-app namespace: default spec: intent: "Production web application with database" resources: - apiVersion: apps/v1 kind: Deployment name: web-app - apiVersion: v1 kind: Service name: web-app-service - apiVersion: apps/v1 kind: StatefulSet name: postgresql EOF ``` See the [Solution Guide](solution-guide.md) for complete examples and usage patterns. ## Documentation - **[Setup Guide](setup-guide.md)** - Installation and prerequisites - **[Remediation Guide](remediation-guide.md)** - Event-driven remediation - **[Resource Sync Guide](resource-sync-guide.md)** - Resource visibility and semantic search - **[Capability Scan Guide](capability-scan-guide.md)** - Autonomous capability discovery - **[Knowledge Source Guide](knowledge-source-guide.md)** - Documentation ingestion from Git - **[Solution Guide](solution-guide.md)** - Resource tracking and lifecycle management - **[Troubleshooting Guide](troubleshooting.md)** - Common issues and solutions ## Architecture ``` ┌─────────────────────────────────────────────────────┐ │ Kubernetes Cluster │ │ │ │ ┌──────────────────────┐ │ │ │ Solution CR │ (Parent Resource) │ │ │ ───────────── │ │ │ │ metadata: │ │ │ │ intent: "..." │ │ │ │ resources: [...] │ │ │ └──────────────────────┘ │ │ ▲ │ │ │ ownerReferences │ │ │ │ │ ┌────────┴──────────┬──────────────┬─────────┐ │ │ │ │ │ │ │ │ ▼ ▼ ▼ ▼ │ │ Deployment Service PVC ConfigMap│ │ (child) (child) (child) (child) │ │ │ │ ┌─────────────────────────────────────────────┐ │ │ │ Controller │ │ │ │ ─────────── │ │ │ │ • Watches Solution CRs │ │ │ │ • Manages ownerReferences │ │ │ │ • Tracks resource health │ │ │ │ • Processes events (RemediationPolicy) │ │ │ │ • Syncs resources to MCP (ResourceSync) │ │ │ │ • Scans capabilities (CapabilityScan) │ │ │ └─────────────────────────────────────────────┘ │ │ │ └─────────────────────────────────────────────────────┘ ``` # Capability Scan Guide This guide covers the CapabilityScanConfig CRD for autonomous capability discovery and scanning in your Kubernetes cluster. ## Overview The CapabilityScanConfig enables: - **Autonomous Discovery**: Automatically detects CRD changes (create, update, delete) - **Event-Driven Scanning**: Triggers capability scans when new CRDs are installed - **Startup Reconciliation**: Syncs cluster state with MCP on controller restart - **Debounced Batching**: Groups rapid CRD changes into efficient batch requests This feature works with the [DevOps AI Toolkit MCP](/docs/mcp) to keep your cluster's capability data up-to-date for AI-powered recommendations. ## Stack Installation If you installed via the [DevOps AI Toolkit Stack](/docs/stack), CapabilityScanConfig is already configured. You can verify with: ```bash kubectl get capabilityscanconfig -n dot-ai ``` Continue below only if you need to customize the configuration or installed the controller individually. ## Prerequisites - Controller installed (see [Setup Guide](setup-guide.md)) - [DevOps AI Toolkit MCP](/docs/mcp) installed and running ## Quick Start 1. Create a secret with your MCP API key (if authentication is required): ```bash kubectl create secret generic dot-ai-secrets \ --namespace dot-ai \ --from-literal=auth-token=your-auth-token-here ``` 2. Create a CapabilityScanConfig to start scanning: ```yaml apiVersion: dot-ai.devopstoolkit.live/v1alpha1 kind: CapabilityScanConfig metadata: name: default-scan namespace: dot-ai spec: mcp: endpoint: http://dot-ai.dot-ai.svc.cluster.local:3456/api/v1/tools/manageOrgData authSecretRef: name: dot-ai-secrets key: auth-token ``` 3. Apply it: ```bash kubectl apply -f capabilityscanconfig.yaml ``` The controller will perform an initial scan of all cluster resources and then watch for CRD changes. ## How It Works ### Startup Reconciliation When the controller starts (or restarts), it performs a diff-and-sync: 1. **List Cluster Resources**: Uses Discovery API to get all resources (core + CRDs) matching include/exclude filters 2. **List MCP Capabilities**: Queries MCP for existing capability IDs 3. **Compute Diff**: - Resources in cluster but not in MCP → trigger targeted scan - Capabilities in MCP but not in cluster → delete orphaned This ensures the controller recovers gracefully from restarts without missing any changes. ### Event-Driven Scanning After startup, the controller watches for CRD events: 1. **CRD Created/Updated**: Queue for capability scan 2. **CRD Deleted**: Queue for capability deletion 3. **Debounce**: Wait for `debounceWindowSeconds` to collect more events 4. **Batch Request**: Send all queued scans in a single request ### Debouncing When operators are installed, many CRDs may be created in rapid succession. Debouncing prevents overwhelming MCP with individual requests: ```text Time 0s: CRD-A created → add to buffer Time 1s: CRD-B created → add to buffer Time 2s: CRD-C created → add to buffer ... Time 10s: Flush buffer → single request: "CRD-A,CRD-B,CRD-C" ``` Configure the window based on your needs: - **Lower values (1-5s)**: Faster scanning, more HTTP requests - **Higher values (30-60s)**: Fewer requests, delayed scanning ### Fire-and-Forget Model The controller uses a fire-and-forget pattern: - Scans are triggered asynchronously (controller doesn't wait for completion) - MCP performs the actual capability analysis in the background - Failed scans are automatically retried on next controller restart ## Configuration ### Spec Fields | Field | Type | Required | Default | Description | |-------|------|----------|---------|-------------| | `mcp.endpoint` | string | Yes | - | Full URL of the MCP manageOrgData endpoint | | `mcp.collection` | string | No | capabilities | Qdrant collection name for storing capabilities | | `mcp.authSecretRef` | SecretReference | Yes | - | Secret containing API key for MCP authentication | | `includeResources` | []string | No | all | Patterns for resources to include in scanning | | `excludeResources` | []string | No | - | Patterns for resources to exclude from scanning | | `retry.maxAttempts` | int | No | 3 | Maximum retry attempts for MCP API calls | | `retry.backoffSeconds` | int | No | 5 | Initial backoff duration in seconds | | `retry.maxBackoffSeconds` | int | No | 300 | Maximum backoff duration in seconds | | `debounceWindowSeconds` | int | No | 10 | Time window to batch CRD events before syncing | ### Resource Filtering Use `includeResources` and `excludeResources` to control which resources are scanned. Filters apply to: - **Initial scan**: All resources discovered via Discovery API (core + CRDs) - **Event-driven scanning**: CRD create/update/delete events **Pattern Format**: - `Kind.group` for grouped resources (e.g., `Deployment.apps`, `RDSInstance.database.aws.crossplane.io`) - `Kind` for core resources (e.g., `Service`, `ConfigMap`) - Wildcards supported: `*.crossplane.io`, `*.apps`, `*` **Example: Whitelist - Scan Only Crossplane Resources**: ```yaml spec: includeResources: - "*.crossplane.io" ``` **Example: Blocklist - Scan Everything Except High-Volume Resources**: ```yaml spec: excludeResources: - "Event" - "Lease.coordination.k8s.io" - "EndpointSlice.discovery.k8s.io" ``` **Example: Combined - Crossplane Resources Except Provider Configs**: ```yaml spec: includeResources: - "*.crossplane.io" excludeResources: - "ProviderConfig.*" ``` **Processing Order**: 1. If `includeResources` is specified, only those patterns are scanned 2. `excludeResources` is applied as a blocklist after includes 3. If neither is specified, all resources are scanned ## Status Check the status to verify scanning is working: ```bash kubectl get capabilityscanconfig default-scan -o yaml ``` ### Status Fields | Field | Description | |-------|-------------| | `initialScanComplete` | Whether startup reconciliation has completed | | `lastScanTime` | Timestamp of last successful scan trigger | | `lastError` | Last error message if any | | `conditions` | Standard Kubernetes conditions | ### Conditions | Type | Description | |------|-------------| | `Ready` | True when controller is watching CRDs and connected to MCP | ## Example: Full Configuration ```yaml apiVersion: dot-ai.devopstoolkit.live/v1alpha1 kind: CapabilityScanConfig metadata: name: production-scan namespace: dot-ai spec: # MCP configuration mcp: endpoint: http://dot-ai.dot-ai.svc.cluster.local:3456/api/v1/tools/manageOrgData collection: capabilities authSecretRef: name: dot-ai-secrets key: auth-token # Only scan Crossplane and ArgoCD resources includeResources: - "*.crossplane.io" - "*.aws.crossplane.io" - "*.gcp.crossplane.io" - "*.azure.crossplane.io" - "applications.argoproj.io" - "applicationsets.argoproj.io" # Exclude internal resources excludeResources: - "*.internal.company.com" # Retry configuration for MCP API calls retry: maxAttempts: 5 backoffSeconds: 10 maxBackoffSeconds: 300 # Batch CRD events for 15 seconds before sending debounceWindowSeconds: 15 ``` ## Use Cases ### Crossplane Provider Installation When you install a Crossplane provider: ```bash kubectl apply -f provider-aws.yaml ``` The controller: 1. Detects new CRDs (`RDSInstance.database.aws.crossplane.io`, `Bucket.s3.aws.crossplane.io`, etc.) 2. Waits for debounce window (batches all CRDs) 3. Sends single scan request to MCP 4. MCP analyzes and stores capabilities MCP can now provide AI recommendations that include the newly available AWS resources. ### Operator Removal When you remove an operator: ```bash kubectl delete -f provider-aws.yaml ``` The controller: 1. Detects CRD deletions 2. Sends delete requests to MCP for each capability 3. MCP removes stale capability data MCP recommendations no longer suggest the removed resources. ### Controller Restart Recovery If the controller pod restarts: 1. Controller performs startup reconciliation 2. Compares cluster CRDs with MCP capabilities 3. Syncs any differences (missed events during downtime) 4. Resumes event watching No manual intervention required. ## Troubleshooting ### Controller Not Starting Check the Ready condition: ```bash kubectl get capabilityscanconfig default-scan -o jsonpath='{.status.conditions}' ``` Common issues: - Invalid `mcp.endpoint` URL - MCP service not reachable - Missing RBAC permissions ### Scans Not Triggering 1. Check if CRD matches include/exclude filters: ```bash # View configured filters kubectl get capabilityscanconfig default-scan -o jsonpath='{.spec.includeResources}' kubectl get capabilityscanconfig default-scan -o jsonpath='{.spec.excludeResources}' ``` 2. Check controller logs: ```bash kubectl logs -l app.kubernetes.io/name=dot-ai-controller -n dot-ai --tail=50 ``` Look for messages about CRD events and filtering decisions. ### MCP Connection Errors Check `lastError` in status: ```bash kubectl get capabilityscanconfig default-scan -o jsonpath='{.status.lastError}' ``` Common issues: - MCP endpoint unreachable (check service/DNS) - Authentication failure (check secret exists and has correct key) - MCP server overloaded (check MCP logs) ### Initial Scan Not Completing Check if initial scan is marked complete: ```bash kubectl get capabilityscanconfig default-scan -o jsonpath='{.status.initialScanComplete}' ``` If false, check controller logs for errors during startup reconciliation. ### Debounce Window Too Long/Short Adjust `debounceWindowSeconds` based on your operator installation patterns: ```yaml spec: # For frequent small changes debounceWindowSeconds: 5 # For large operator installations debounceWindowSeconds: 30 ``` ## Cleanup Delete the CapabilityScanConfig to stop scanning: ```bash kubectl delete capabilityscanconfig default-scan ``` This stops the CRD watcher but does not delete capability data from MCP. To remove capability data, use the MCP `manageOrgData` tool with `operation: deleteAll`. See the [Capability Management Guide](/docs/ai-engine/tools/capability-management) for details. ## Next Steps - Learn about [Resource Sync](resource-sync-guide.md) for semantic search of cluster resources - Explore [Remediation Policies](remediation-guide.md) for event-driven remediation - Check [Troubleshooting Guide](troubleshooting.md) for common issues # Knowledge Source Guide This guide covers the GitKnowledgeSource CRD for automatically syncing documentation from Git repositories into the DevOps AI Toolkit knowledge base. ## Overview The GitKnowledgeSource enables: - **Document Ingestion**: Automatically syncs markdown and other files to the knowledge base - **Change Detection**: Only processes files changed since the last sync - **Scheduled Sync**: Periodically re-syncs to capture updates - **Automatic Cleanup**: Removes documents from knowledge base when the resource is deleted Once documents are synced, they become searchable through the DevOps AI Toolkit's semantic search capabilities. ## Stack Installation If you installed via the [DevOps AI Toolkit Stack](/docs/stack), you can create GitKnowledgeSource resources immediately. Verify the CRD is available: ```bash kubectl get crds gitknowledgesources.dot-ai.devopstoolkit.live ``` Continue below to configure a GitKnowledgeSource for your documentation. ## Prerequisites - Controller installed (see [Setup Guide](setup-guide.md)) - [DevOps AI Toolkit MCP](/docs/mcp) installed and running ## Quick Start 1. Ensure the MCP authentication secret exists: ```bash kubectl get secret dot-ai-secrets -n dot-ai ``` If not, create it: ```bash kubectl create secret generic dot-ai-secrets \ --namespace dot-ai \ --from-literal=auth-token=your-auth-token-here ``` 2. Create a GitKnowledgeSource to sync documentation from a Git repository: ```yaml apiVersion: dot-ai.devopstoolkit.live/v1alpha1 kind: GitKnowledgeSource metadata: name: my-docs namespace: dot-ai spec: repository: url: https://github.com/your-org/your-repo.git branch: main paths: - "docs/**/*.md" - "README.md" mcpServer: url: http://dot-ai.dot-ai.svc:3456 authSecretRef: name: dot-ai-secrets key: auth-token ``` 3. Apply it: ```bash kubectl apply -f gitknowledgesource.yaml ``` 4. Check the sync status: ```bash kubectl get gitknowledgesource my-docs -n dot-ai ``` Expected output: ```text NAME ACTIVE DOCUMENTS LAST SYNC NEXT SYNC my-docs true 9 2026-02-05T16:40:14Z 2026-02-06T16:40:14Z ``` ## How It Works ### Sync Process 1. **Clone**: Controller performs a shallow clone of the repository 2. **Pattern Match**: Finds files matching `paths` patterns, excluding `exclude` patterns 3. **Change Detection**: Compares current commit with `lastSyncedCommit` to find changed files 4. **Ingest**: Sends changed documents to MCP knowledge base with `sourceIdentifier` 5. **Cleanup**: Deletes the local clone (no persistent storage required) 6. **Schedule**: Queues next sync based on `schedule` field ### First Sync vs Incremental Sync - **First sync**: Processes all matching files (full sync) - **Subsequent syncs**: Only processes files changed since `lastSyncedCommit` - **Spec changes**: Modifying `paths` or other spec fields triggers a full sync ### What Gets Synced Each document is ingested to MCP with: - **Content**: The file contents - **URI**: `https://github.com/{org}/{repo}/blob/{branch}/{path}` - **Source Identifier**: `{namespace}/{name}` for bulk operations - **Custom Metadata**: Values from `spec.metadata` field ### Cleanup on Deletion When a GitKnowledgeSource is deleted: 1. Controller detects deletion via finalizer 2. Checks `deletionPolicy` (`Delete` or `Retain`) 3. If `Delete`: Calls MCP to remove all documents with matching `sourceIdentifier` 4. Removes finalizer, allowing CR deletion to complete ## Configuration ### Spec Fields | Field | Type | Required | Default | Description | |-------|------|----------|---------|-------------| | `repository.url` | string | Yes | - | Git repository URL (HTTPS only) | | `repository.branch` | string | No | `main` | Branch to sync | | `repository.depth` | int | No | `1` | Shallow clone depth | | `repository.secretRef` | SecretReference | No | - | Secret with token for private repos | | `paths` | []string | Yes | - | Glob patterns for files to sync (e.g., `docs/**/*.md`) | | `exclude` | []string | No | - | Glob patterns to exclude | | `schedule` | string | No | `@every 24h` | Sync schedule (cron or interval) | | `mcpServer.url` | string | Yes | - | MCP server endpoint URL | | `mcpServer.authSecretRef` | SecretReference | Yes | - | Secret with MCP auth token | | `mcpServer.httpTimeoutSeconds` | int | No | `120` | HTTP timeout in seconds for MCP API calls (min: 5, max: 600) | | `metadata` | map[string]string | No | - | Custom metadata attached to all documents | | `maxFileSizeBytes` | int | No | - | Skip files larger than this size | | `deletionPolicy` | string | No | `Delete` | `Delete` or `Retain` documents on CR deletion | ### Repository Authentication For private repositories, create a secret with a personal access token: ```bash kubectl create secret generic github-token \ --namespace dot-ai \ --from-literal=token=ghp_xxxxxxxxxxxx ``` Reference it in the GitKnowledgeSource: ```yaml spec: repository: url: https://github.com/your-org/private-repo.git secretRef: name: github-token key: token ``` ### Path Patterns The `paths` field uses glob patterns to match files: | Pattern | Matches | |---------|---------| | `docs/**/*.md` | All markdown files under `docs/` recursively | | `README.md` | Only the root README | | `**/*.md` | All markdown files in the repository | | `docs/*.md` | Markdown files directly in `docs/` (not subdirectories) | Use `exclude` to skip specific paths: ```yaml spec: paths: - "docs/**/*.md" exclude: - "docs/internal/**" - "docs/drafts/**" ``` ### Schedule Configuration The `schedule` field accepts cron expressions or interval syntax: | Format | Example | Description | |--------|---------|-------------| | Interval | `@every 24h` | Sync every 24 hours (default) | | Interval | `@every 6h` | Sync every 6 hours | | Interval | `@every 30m` | Sync every 30 minutes | | Cron | `0 3 * * *` | Daily at 3:00 AM | | Cron | `0 */6 * * *` | Every 6 hours | The default `@every 24h` means each GitKnowledgeSource syncs 24 hours after its last sync, naturally staggering syncs based on creation time. **Invalid schedules**: If you specify an invalid schedule expression, the controller will sync once, then set a `ScheduleError` condition and stop scheduling. Fix the schedule to resume. ### File Size Limits Use `maxFileSizeBytes` to skip large files: ```yaml spec: maxFileSizeBytes: 1048576 # 1MB limit ``` Skipped files appear in the status: ```bash kubectl get gitknowledgesource my-docs -n dot-ai -o jsonpath='{.status.skippedFiles}' | jq ``` ### Deletion Policy The `deletionPolicy` controls what happens when the GitKnowledgeSource is deleted: | Value | Behavior | |-------|----------| | `Delete` (default) | Remove all synced documents from MCP knowledge base | | `Retain` | Keep documents in MCP (useful for migrations) | ```yaml spec: deletionPolicy: Retain # Keep docs when CR is deleted ``` ## Status Check the status to monitor sync progress: ```bash kubectl get gitknowledgesource my-docs -n dot-ai -o yaml ``` ### Status Fields | Field | Description | |-------|-------------| | `active` | Whether the source is actively syncing | | `documentCount` | Total documents synced to MCP | | `lastSyncTime` | Timestamp of last successful sync | | `lastSyncedCommit` | Git commit SHA of last sync | | `nextScheduledSync` | When the next sync will occur | | `skippedDocuments` | Count of files skipped (e.g., size limit) | | `skippedFiles` | Details of skipped files with reasons | | `syncErrors` | Count of sync errors | | `lastError` | Most recent error message | | `observedGeneration` | Last processed spec generation | | `conditions` | Standard Kubernetes conditions | ### Conditions | Type | Description | |------|-------------| | `Ready` | True when source is active and configured correctly | | `Synced` | True when last sync completed successfully | | `Scheduled` | True when next sync is scheduled | ### Example Status ```yaml status: active: true documentCount: 9 lastSyncTime: "2026-02-05T16:40:14Z" lastSyncedCommit: "c32655af7f70361835a533e57533caaf4e8b750a" nextScheduledSync: "2026-02-06T16:40:14Z" conditions: - type: Ready status: "True" reason: Active message: "GitKnowledgeSource is active and syncing" - type: Synced status: "True" reason: SyncComplete message: "Successfully synced 9 documents" - type: Scheduled status: "True" reason: Scheduled message: "Next sync scheduled for 2026-02-06T16:40:14Z" ``` ## Troubleshooting ### Sync Not Starting Check the Ready condition: ```bash kubectl get gitknowledgesource my-docs -n dot-ai -o jsonpath='{.status.conditions}' | jq ``` Common issues: - **CloneError**: Invalid repository URL or authentication failure - **MCP unreachable**: Check MCP server URL and network connectivity - **Missing secret**: Verify auth secret exists and has correct keys ### Clone Errors If you see "read-only file system" errors: - Ensure the controller deployment has a writable `/tmp` volume mount If you see authentication errors for private repos: - Verify the secret exists: `kubectl get secret -n dot-ai` - Check the token has read access to the repository - Ensure `secretRef.key` matches the key in the secret ### Documents Not Appearing in Search 1. Check sync completed successfully: ```bash kubectl get gitknowledgesource my-docs -n dot-ai -o jsonpath='{.status.documentCount}' ``` 2. Verify MCP is running: ```bash kubectl get pods -n dot-ai -l app=dot-ai ``` 3. Check for sync errors: ```bash kubectl get gitknowledgesource my-docs -n dot-ai -o jsonpath='{.status.lastError}' ``` ### Schedule Not Working Check the Scheduled condition: ```bash kubectl get gitknowledgesource my-docs -n dot-ai -o jsonpath='{.status.conditions}' | jq '.[] | select(.type=="Scheduled")' ``` If `ScheduleError`, the schedule expression is invalid. Fix the `spec.schedule` field. ## Git Provider Compatibility GitKnowledgeSource uses standard Git HTTPS protocol and should work with any Git provider: - GitHub - GitLab - Bitbucket - Gitea - Self-hosted Git servers Testing has been performed primarily with GitHub. If you encounter issues with other providers, please [report them on GitHub](https://github.com/vfarcic/dot-ai-controller/issues). ## Next Steps - Learn about [Resource Sync](resource-sync-guide.md) for cluster resource visibility - Explore [Remediation Policies](remediation-guide.md) for event-driven remediation - Check [Troubleshooting Guide](troubleshooting.md) for common issues # Remediation Guide This guide covers the RemediationPolicy CRD for event-driven remediation using the DevOps AI Toolkit. ## Overview The RemediationPolicy CRD monitors Kubernetes events and forwards them to the DevOps AI Toolkit MCP for analysis and remediation. It supports: - **Event Watching**: Configurable filtering of Kubernetes events by type, reason, and involved objects - **Automatic Mode**: System detects, analyzes, and fixes issues without human intervention - **Manual Mode**: System detects and analyzes issues, provides recommendations via Slack for humans to execute - **Slack Notifications**: Rich notifications with remediation results and next steps - **Rate Limiting**: Prevents event storms with configurable cooldowns - **Status Reporting**: Comprehensive observability through status updates ## Prerequisites - Controller installed (see [Setup Guide](setup-guide.md)) - **DevOps AI Toolkit MCP installed** - See [DevOps AI Toolkit documentation](/docs/mcp) - Slack webhook URL (optional, for Slack notifications) - Google Chat webhook URL (optional, for Google Chat notifications - requires Google Workspace paid account) ## Create a RemediationPolicy Create a RemediationPolicy to start processing events: ```bash # Create a Secret containing your MCP auth token (required for Kubernetes deployments) # Use the same token configured in the MCP server's DOT_AI_AUTH_TOKEN environment variable kubectl create secret generic dot-ai-secrets \ --from-literal=auth-token="your-mcp-auth-token" \ --namespace dot-ai # Create a Secret containing your Slack webhook URL (optional, for notifications) kubectl create secret generic slack-webhook \ --from-literal=url="https://hooks.slack.com/services/YOUR/WEBHOOK/URL" \ --namespace dot-ai # Apply the policy kubectl apply --filename - < **Recommended**: For the easiest setup, we recommend installing the complete dot-ai stack which includes all components. See the [Stack Installation Guide](/docs/stack). > > The stack automatically applies CapabilityScanConfig and ResourceSyncConfig CRs. For RemediationPolicy and Solution CRs, see the [Remediation Guide](remediation-guide.md) and [Solution Guide](solution-guide.md) to configure them based on your needs. > > Continue below if you want to install this component individually (for non-Kubernetes setups or granular control). ## Prerequisites - **kubectl** v1.11.3+ - **Helm** v3.0+ - **Kubernetes cluster** v1.11.3+ ## Optional: Test Cluster Setup If you don't have a Kubernetes cluster, create one locally using Kind: ```bash # Use isolated kubeconfig export KUBECONFIG=$PWD/kubeconfig.yaml # Create test cluster kind create cluster --name dot-ai-controller-test # Verify cluster access kubectl cluster-info ``` ## Install Controller The controller provides four features: - **Solution CRD**: Resource tracking and lifecycle management (standalone) - **RemediationPolicy CRD**: Event-driven remediation (requires [DevOps AI Toolkit MCP](/docs/mcp)) - **ResourceSyncConfig CRD**: Resource visibility and semantic search (requires [DevOps AI Toolkit MCP](/docs/mcp)) - **CapabilityScanConfig CRD**: Autonomous capability discovery (requires [DevOps AI Toolkit MCP](/docs/mcp)) ### Install via Helm ```bash # Set the version from https://github.com/vfarcic/dot-ai-controller/pkgs/container/dot-ai-controller%2Fcharts%2Fdot-ai-controller export DOT_AI_CONTROLLER_VERSION="..." helm install dot-ai-controller oci://ghcr.io/vfarcic/dot-ai-controller/charts/dot-ai-controller \ --version $DOT_AI_CONTROLLER_VERSION \ --namespace dot-ai \ --create-namespace \ --wait ``` This installs: - Controller deployment - Solution CRD - RemediationPolicy CRD - ResourceSyncConfig CRD - CapabilityScanConfig CRD - RBAC permissions ### Configuration Reference | Parameter | Description | Default | |-----------|-------------|---------| | `annotations` | Global annotations applied to all resources (e.g., `reloader.stakater.com/auto: "true"`) | `{}` | | `image.repository` | Container image repository | `ghcr.io/vfarcic/dot-ai-controller` | | `image.tag` | Container image tag | Chart appVersion | | `image.pullPolicy` | Image pull policy | `IfNotPresent` | | `resources.requests.memory` | Memory request | `128Mi` | | `resources.requests.cpu` | CPU request | `10m` | | `resources.limits.memory` | Memory limit | `512Mi` | | `resources.limits.cpu` | CPU limit | `500m` | ### Verify Installation ```bash # Check controller is running kubectl get pods --selector app.kubernetes.io/name=dot-ai-controller --namespace dot-ai # Check controller logs kubectl logs --selector app.kubernetes.io/name=dot-ai-controller --namespace dot-ai --tail 10 # Verify CRDs are installed kubectl get crds | grep dot-ai.devopstoolkit.live ``` You should see all four CRDs: ```text capabilityscanconfigs.dot-ai.devopstoolkit.live remediationpolicies.dot-ai.devopstoolkit.live resourcesyncconfigs.dot-ai.devopstoolkit.live solutions.dot-ai.devopstoolkit.live ``` ## Optional: Install DevOps AI Toolkit MCP **Required for RemediationPolicy, ResourceSyncConfig, and CapabilityScanConfig features.** If you're only using the Solution CRD, skip this step. For MCP installation instructions, see the [DevOps AI Toolkit documentation](/docs/mcp). The controller expects the MCP service at: - RemediationPolicy: `http://dot-ai-mcp.dot-ai.svc.cluster.local:3456/api/v1/tools/remediate` - ResourceSyncConfig: `http://dot-ai-mcp.dot-ai.svc.cluster.local:3456/api/v1/resources/sync` - CapabilityScanConfig: `http://dot-ai-mcp.dot-ai.svc.cluster.local:3456/api/v1/tools/manageOrgData` ## What's Next Choose which features you want to use: - **Solution CRD**: [Solution Guide](solution-guide.md) - Resource tracking and lifecycle management (works standalone, no MCP needed) - **RemediationPolicy CRD**: [Remediation Guide](remediation-guide.md) - Event-driven remediation (requires MCP) - **ResourceSyncConfig CRD**: [Resource Sync Guide](resource-sync-guide.md) - Resource visibility and semantic search (requires MCP) - **CapabilityScanConfig CRD**: [Capability Scan Guide](capability-scan-guide.md) - Autonomous capability discovery (requires MCP) ## Cleanup ### Remove Controller ```bash # Uninstall controller helm uninstall dot-ai-controller --namespace dot-ai # Delete namespace kubectl delete namespace dot-ai ``` ### Remove Test Cluster If you used Kind: ```bash # Delete cluster kind delete cluster --name dot-ai-controller-test # Remove kubeconfig rm kubeconfig.yaml ``` # Solution CRD Guide ## Video: Kubernetes ownerReferences and Resource Grouping [![Kubernetes ownerReferences and Resource Grouping with the Solution CRD](https://img.youtube.com/vi/UEkhIMx6B6E/maxresdefault.jpg)](https://youtu.be/UEkhIMx6B6E) This video explains the problem of understanding what resources compose an application in Kubernetes, explores how Kubernetes ownership and ownerReferences work for garbage collection, and demonstrates how the Solution CRD provides a better approach by wrapping related resources into logical groups with status, context, and relationships. ## Overview The **Solution CRD** is a Kubernetes Custom Resource Definition that tracks deployed solutions and their constituent resources. It acts as a parent resource that groups all Kubernetes resources (Deployments, Services, ConfigMaps, etc.) that compose a logical solution, preserving metadata and context not available in individual resources. ### Purpose When deploying applications to Kubernetes, the Solution CRD provides: 1. **Resource Grouping**: Links related Kubernetes resources into a logical solution 2. **Intent Preservation**: Stores the original user intent that led to the deployment 3. **Metadata Storage**: Captures information not available in individual resources: - Deployment rationale and decision-making context - Configuration trade-offs and choices - Documentation links - Patterns and policies applied 4. **Health Monitoring**: Aggregates health status across all tracked resources 5. **Automatic Cleanup**: Deleting a Solution CR automatically deletes all child resources ### Key Design Principles - **Kubernetes-Native**: Uses standard Kubernetes patterns and conventions - **Metadata Store**: Captures information NOT in individual resources - **Automatic Ownership**: Controller dynamically manages ownerReferences - **Lifecycle Management**: Tracks solution state from deployment through operation ## Prerequisites - Controller installed (see [Setup Guide](setup-guide.md)) The Solution CRD is automatically installed with the controller. Verify it's available: ```bash # Verify Solution CRD is installed kubectl get crds solutions.dot-ai.devopstoolkit.live # Check controller is running kubectl get pods --selector app.kubernetes.io/name=dot-ai-controller --namespace dot-ai ``` ## Solution CRD Schema ### Spec Fields ```yaml spec: # Original user intent that led to this deployment (required) intent: string # Solution metadata (information not in individual resources) context: createdBy: string # Tool or user that created this solution rationale: string # Why this solution was deployed this way patterns: []string # Organizational patterns applied policies: []string # Policies applied to this solution # List of Kubernetes resources that compose this solution (required) resources: - apiVersion: string # e.g., "apps/v1" kind: string # e.g., "Deployment" name: string # Resource name namespace: string # Optional for cluster-scoped resources # Documentation URL (optional) documentationURL: string # Link to deployment documentation ``` ### Status Fields ```yaml status: # Overall state of the solution state: string # pending, deployed, degraded, failed # Generation tracking observedGeneration: int64 # Resource health summary resources: total: int # Total resources tracked ready: int # Resources that are ready failed: int # Resources that have failed # Standard Kubernetes conditions conditions: - type: Ready status: "True" | "False" | "Unknown" reason: string message: string ``` ## Quick Start: Your First Solution Let's create a simple web application with a PostgreSQL database and track it with a Solution CR. ### Step 1: Create a Namespace ```bash kubectl create namespace my-app ``` ### Step 2: Deploy Application Resources Deploy your application components (Deployment, Service, etc.): ```bash kubectl apply --filename - <<'EOF' apiVersion: v1 kind: Service metadata: name: web-app-service namespace: my-app spec: selector: app: web ports: - port: 80 targetPort: 80 apiVersion: v1 kind: Service metadata: name: postgresql namespace: my-app spec: clusterIP: None selector: app: postgresql ports: - port: 5432 targetPort: 5432 EOF ``` ### Step 3: Create a Solution CR Now create a Solution CR that tracks all these resources: ```bash kubectl apply --filename - <<'EOF' apiVersion: dot-ai.devopstoolkit.live/v1alpha1 kind: Solution metadata: name: my-web-app namespace: my-app spec: # Describe what this solution is for intent: "Deploy a web application with PostgreSQL database for production workloads" # Add context about this deployment context: createdBy: "platform-team" rationale: "Deployed to meet Q1 customer requirements. PostgreSQL chosen for ACID compliance." patterns: - "stateful-workload" - "multi-tier-application" policies: - "production-sla" - "data-retention-90d" # List all resources that compose this solution resources: - apiVersion: apps/v1 kind: Deployment name: web-app namespace: my-app - apiVersion: v1 kind: Service name: web-app-service namespace: my-app - apiVersion: apps/v1 kind: StatefulSet name: postgresql namespace: my-app - apiVersion: v1 kind: Service name: postgresql namespace: my-app EOF ``` ### Step 4: Verify the Solution Check that the Solution was created and the controller has processed it: ```bash # View the Solution kubectl get solutions --namespace my-app # Get detailed status kubectl get solution my-web-app --namespace my-app --output yaml # View controller logs kubectl logs --selector app.kubernetes.io/name=dot-ai-controller --namespace dot-ai --tail 30 ``` Expected output: ``` NAME INTENT STATE RESOURCES AGE my-web-app Deploy a web application with PostgreSQL... deployed 4/4 2m ``` ### Step 5: Verify ownerReferences Were Added The controller automatically adds ownerReferences to all tracked resources: ```bash # Check ownerReference on Deployment kubectl get deployment web-app --namespace my-app --output jsonpath='{.metadata.ownerReferences}' | jq # Check ownerReference on Service kubectl get service web-app-service --namespace my-app --output jsonpath='{.metadata.ownerReferences}' | jq ``` You should see ownerReferences pointing to the Solution CR: ```json [ { "apiVersion": "dot-ai.devopstoolkit.live/v1alpha1", "kind": "Solution", "name": "my-web-app", "uid": "...", "controller": true, "blockOwnerDeletion": true } ] ``` ## Understanding Solution Status The Solution controller monitors all tracked resources and updates the status: ### Status States - **pending**: Initial state or resources not yet ready - **deployed**: All resources are healthy and ready - **degraded**: Some resources are unhealthy or missing - **failed**: Critical failure in resource deployment ### Health Checking The controller uses multiple strategies to determine resource health: 1. **Conditions** (highest priority): Checks for `Ready`, `Available`, `Healthy`, or `Synced` conditions 2. **Replica Counts**: For Deployments, StatefulSets, DaemonSets - compares readyReplicas vs desired 3. **Existence** (fallback): Resource exists in the cluster ### Status Updates The controller reconciles every 30 seconds to keep status current: ```bash # Watch status updates kubectl get solutions --namespace my-app --watch # Get detailed status with conditions kubectl get solution my-web-app --namespace my-app --output jsonpath='{.status}' | jq ``` ## Testing Health Monitoring Let's test how the controller detects unhealthy resources: ### Simulate a Failed Deployment Scale a deployment to an impossible replica count: ```bash # Scale to more replicas than cluster can handle kubectl scale deployment web-app --namespace my-app --replicas=100 # Watch Solution status change to degraded kubectl get solution my-web-app --namespace my-app --watch ``` The Solution status will show: ```yaml status: state: degraded resources: total: 4 ready: 3 failed: 1 conditions: - type: Ready status: "False" reason: ResourcesNotReady message: "Ready: 3/4, Failed: 1" ``` ### Restore Health ```bash # Scale back to normal kubectl scale deployment web-app --namespace my-app --replicas=2 # Watch status return to deployed kubectl get solution my-web-app --namespace my-app --watch ``` ## Garbage Collection One of the most powerful features is automatic cleanup via ownerReferences: ```bash # Delete the Solution CR kubectl delete solution my-web-app --namespace my-app # All tracked resources are automatically deleted kubectl get all --namespace my-app # (should show: No resources found) ``` **Important**: Deleting a Solution CR deletes ALL child resources. This is by design for clean solution removal. ## Advanced Usage ### Tracking Existing Resources You can create a Solution CR for resources that already exist. The controller will add ownerReferences dynamically: ```bash # Create resources first kubectl create deployment nginx --image=nginx --namespace my-app # Then create Solution referencing existing resources kubectl apply --filename - <<'EOF' apiVersion: dot-ai.devopstoolkit.live/v1alpha1 kind: Solution metadata: name: nginx-solution namespace: my-app spec: intent: "Simple nginx web server" resources: - apiVersion: apps/v1 kind: Deployment name: nginx namespace: my-app EOF # ownerReference is added after Solution creation kubectl get deployment nginx --namespace my-app --output jsonpath='{.metadata.ownerReferences}' ``` ### Cross-Namespace Resources Currently, Solutions track resources in their own namespace. All resource references without an explicit namespace default to the Solution's namespace. ### Documentation Links You can link to external documentation: ```yaml spec: intent: "Production web application" documentationURL: "https://docs.example.com/apps/web-app" resources: - ... ``` This field helps teams track where deployment documentation is stored. ## Kubectl Tips ### List All Solutions ```bash # All namespaces kubectl get solutions --all-namespaces # Specific namespace kubectl get solutions --namespace my-app # Watch for changes kubectl get solutions --all-namespaces --watch ``` ### Filter by State ```bash # Find degraded solutions kubectl get solutions --all-namespaces --output json | \ jq -r '.items[] | select(.status.state=="degraded") | "\(.metadata.namespace)/\(.metadata.name)"' # Find solutions with failed resources kubectl get solutions --all-namespaces --output json | \ jq -r '.items[] | select(.status.resources.failed > 0) | "\(.metadata.namespace)/\(.metadata.name): \(.status.resources.failed) failed"' ``` ### Inspect Resource Health ```bash # Get detailed status kubectl get solution my-web-app --namespace my-app --output yaml # Just the state kubectl get solution my-web-app --namespace my-app --output jsonpath='{.status.state}' # Resource counts kubectl get solution my-web-app --namespace my-app --output jsonpath='{.status.resources}' | jq ``` ## Common Patterns ### Pattern 1: Multi-Tier Application ```yaml apiVersion: dot-ai.devopstoolkit.live/v1alpha1 kind: Solution metadata: name: ecommerce-app namespace: production spec: intent: "E-commerce platform with web frontend, API backend, and PostgreSQL database" context: createdBy: "ecommerce-team" patterns: - "three-tier-architecture" - "stateful-backend" resources: - apiVersion: apps/v1 kind: Deployment name: frontend - apiVersion: v1 kind: Service name: frontend - apiVersion: apps/v1 kind: Deployment name: api-backend - apiVersion: v1 kind: Service name: api-backend - apiVersion: apps/v1 kind: StatefulSet name: postgresql - apiVersion: v1 kind: Service name: postgresql - apiVersion: v1 kind: ConfigMap name: app-config ``` ### Pattern 2: Microservice with Dependencies ```yaml apiVersion: dot-ai.devopstoolkit.live/v1alpha1 kind: Solution metadata: name: order-service namespace: microservices spec: intent: "Order processing microservice with Redis cache and message queue" context: createdBy: "platform-team" rationale: "Separated from monolith for scalability" policies: - "auto-scaling-enabled" - "circuit-breaker-required" resources: - apiVersion: apps/v1 kind: Deployment name: order-service - apiVersion: v1 kind: Service name: order-service - apiVersion: apps/v1 kind: Deployment name: redis - apiVersion: v1 kind: Service name: redis - apiVersion: v1 kind: ConfigMap name: order-config - apiVersion: v1 kind: Secret name: order-secrets ``` ### Pattern 3: Data Pipeline ```yaml apiVersion: dot-ai.devopstoolkit.live/v1alpha1 kind: Solution metadata: name: analytics-pipeline namespace: data spec: intent: "Analytics pipeline for processing customer data" context: createdBy: "data-team" patterns: - "batch-processing" - "data-pipeline" policies: - "pii-encryption-required" - "gdpr-compliant" resources: - apiVersion: batch/v1 kind: CronJob name: data-ingestion - apiVersion: apps/v1 kind: StatefulSet name: kafka - apiVersion: v1 kind: Service name: kafka - apiVersion: apps/v1 kind: Deployment name: data-processor - apiVersion: v1 kind: PersistentVolumeClaim name: processed-data ``` ## Troubleshooting ### Solution Shows "degraded" State 1. Check which resources are failing: ```bash kubectl get solution -n -o jsonpath='{.status.resources}' | jq ``` 2. Inspect individual resources: ```bash kubectl get deployment -n kubectl describe deployment -n ``` 3. Check controller logs: ```bash kubectl logs --selector app.kubernetes.io/name=dot-ai-controller --namespace dot-ai --tail 50 ``` ### ownerReferences Not Added 1. Verify resource exists and is in correct namespace: ```bash kubectl get -n ``` 2. Check Solution references match exactly: ```bash kubectl get solution -n -o yaml | grep -A 10 resources: ``` 3. Wait for controller to reconcile (30 seconds) or check logs: ```bash kubectl logs --selector app.kubernetes.io/name=dot-ai-controller --namespace dot-ai | grep ownerReference ``` ### Solution Status Not Updating 1. Verify controller is running: ```bash kubectl get pods --selector app.kubernetes.io/name=dot-ai-controller --namespace dot-ai ``` 2. Check for controller errors: ```bash kubectl logs --selector app.kubernetes.io/name=dot-ai-controller --namespace dot-ai --tail 100 ``` 3. Verify controller has RBAC permissions: ```bash kubectl get clusterrole dot-ai-controller-manager-role -o yaml ``` ### Resources Not Deleted with Solution 1. Check if ownerReferences were added: ```bash kubectl get -n -o jsonpath='{.metadata.ownerReferences}' ``` 2. If missing, controller may not have permission. Check RBAC: ```bash kubectl logs --selector app.kubernetes.io/name=dot-ai-controller --namespace dot-ai | grep -i "forbidden\|permission" ``` ## Current Limitations - **Namespace Scoped**: Solutions only track resources in the same namespace - **Namespaced Resources Only**: Cannot currently track cluster-scoped resources (ClusterRoles, PVs, etc.) - **No Configuration Drift Detection**: Controller only tracks resource health, not configuration changes ## Future Enhancements Planned features for future releases: - **Solution Updates**: Support updating deployed solutions via Solution CR changes - **Rollback Support**: Track solution versions and enable rollback - **Advanced Health Checks**: Custom health checks beyond basic resource status - **Cost Tracking**: Integration with cloud cost APIs - **Cross-Namespace Solutions**: Support for solutions spanning multiple namespaces - **Template System**: Solution templates for common patterns ## Next Steps - Explore the [Remediation Guide](remediation-guide.md) for event-driven remediation - Learn about [Capability Scanning](capability-scan-guide.md) for autonomous capability discovery - Check [Troubleshooting Guide](troubleshooting.md) for common issues # Troubleshooting Guide This guide covers common issues encountered when running the DevOps AI Toolkit Controller and their solutions. ## Common Issues and Solutions ### 1. Controller Pod Not Starting **Symptoms:** ```bash kubectl get pods --namespace dot-ai # Shows controller pod in CrashLoopBackOff or ImagePullBackOff ``` **Diagnosis:** ```bash kubectl logs --selector app.kubernetes.io/name=dot-ai-controller --namespace dot-ai kubectl describe pod --selector app.kubernetes.io/name=dot-ai-controller --namespace dot-ai ``` **Common Causes:** - **RBAC Issues**: Missing leader election permissions (we encountered this during testing) - **Image Issues**: Wrong architecture or missing image - **Resource Constraints**: Insufficient memory/CPU limits **Solution:** ```bash # Check if leader election RBAC is missing (error we fixed during testing): # "leases.coordination.k8s.io is forbidden" kubectl get clusterrole dot-ai-controller-manager-role --output yaml # Add missing leader election permissions if needed: kubectl patch clusterrole dot-ai-controller-manager-role --type='json' \ --patch='[{"op": "add", "path": "/rules/-", "value": {"apiGroups": ["coordination.k8s.io"], "resources": ["leases"], "verbs": ["create", "get", "list", "update"]}}]' ``` ### 2. Events Not Being Processed **Symptoms:** ```bash kubectl logs --selector app.kubernetes.io/name=dot-ai-controller --namespace dot-ai --tail 50 # Shows: "No RemediationPolicies found - event will not be processed" ``` **Diagnosis:** ```bash # Check if RemediationPolicies exist kubectl get remediationpolicies --all-namespaces # Check policy selectors kubectl get remediationpolicies --namespace dot-ai --output yaml ``` **Common Causes:** - No RemediationPolicy created - Event doesn't match policy selectors - Policy in wrong namespace ### 3. MCP Connection Failures **Symptoms:** ```bash # Controller logs show: # "❌ HTTP request failed" or "Failed to send MCP request" ``` **Diagnosis:** ```bash # Check MCP pod status kubectl get pods --namespace dot-ai --selector app.kubernetes.io/name=dot-ai # Test MCP connectivity from controller kubectl exec --namespace dot-ai deployment/dot-ai-controller-manager -- \ curl -v http://dot-ai-mcp.dot-ai.svc.cluster.local:3456/health ``` **Common Causes:** - MCP pod not running - Wrong MCP endpoint URL in RemediationPolicy - Network policies blocking communication ### 4. Slack Notifications Not Working **Symptoms:** ```bash # Controller logs show: # "failed to send Slack start notification" ``` **Diagnosis:** ```bash # Check Slack webhook configuration kubectl get remediationpolicies --namespace dot-ai --output yaml | grep --after-context 5 slack # Test webhook manually curl -X POST -H 'Content-type: application/json' \ --data '{"text":"Test message"}' \ YOUR_SLACK_WEBHOOK_URL ``` **Common Causes:** - Invalid Slack webhook URL - Slack webhook disabled (`enabled: false`) - Network connectivity issues ### 5. Rate Limiting Active **Symptoms:** ```bash # Controller logs show: # "Event processing rate limited" and "cooldown active for Xm Ys more" ``` **This is Expected Behavior:** Rate limiting prevents spam processing of duplicate events. The default settings are: - `eventsPerMinute: 5` - `cooldownMinutes: 15` **To Adjust:** Modify your RemediationPolicy: ```yaml rateLimiting: eventsPerMinute: 10 # Increase if needed cooldownMinutes: 5 # Decrease if needed ``` ### 6. MCP Analysis Failures **Symptoms:** ```bash # Controller logs show: # "MCP remediation failed" or "McpRemediationFailed" events ``` **Diagnosis:** ```bash # Check MCP logs for detailed error messages kubectl logs --namespace dot-ai --selector app.kubernetes.io/name=dot-ai --tail 50 # Check RemediationPolicy status kubectl describe remediationpolicies --namespace dot-ai ``` **Common Causes:** - Invalid Anthropic API key - API rate limits exceeded - Network connectivity to Anthropic services - Malformed event data ### 7. ResourceSyncConfig Not Syncing **Symptoms:** ```bash # ResourceSyncConfig status shows syncErrors or not active kubectl get resourcesyncconfigs --output yaml ``` **Diagnosis:** ```bash # Check ResourceSyncConfig status kubectl get resourcesyncconfigs --output jsonpath='{.items[*].status}' # Check controller logs for sync errors kubectl logs --selector app.kubernetes.io/name=dot-ai-controller --namespace dot-ai | grep -i "resourcesync\|sync" # Verify MCP endpoint is reachable kubectl exec --namespace dot-ai deployment/dot-ai-controller-manager -- \ curl -v http://dot-ai-mcp.dot-ai.svc.cluster.local:3456/api/v1/resources/sync ``` **Common Causes:** - MCP resource sync endpoint not available - Wrong `mcpEndpoint` URL in ResourceSyncConfig - Network policies blocking communication - RBAC permissions missing for resource discovery **Solution:** ```bash # Verify the MCP endpoint URL is correct kubectl get resourcesyncconfigs --output jsonpath='{.items[*].spec.mcpEndpoint}' # Check if watcher is active kubectl get resourcesyncconfigs --output jsonpath='{.items[*].status.active}' # Check watched resource types count kubectl get resourcesyncconfigs --output jsonpath='{.items[*].status.watchedResourceTypes}' ``` ### 8. CapabilityScanConfig Not Scanning **Symptoms:** ```bash # CapabilityScanConfig status shows errors or not ready kubectl get capabilityscanconfigs --output yaml ``` **Diagnosis:** ```bash # Check CapabilityScanConfig status kubectl get capabilityscanconfigs --output jsonpath='{.items[*].status}' # Check controller logs for scan errors kubectl logs --selector app.kubernetes.io/name=dot-ai-controller --namespace dot-ai | grep -i "capabilityscan" # Verify auth secret exists kubectl get secret dot-ai-secrets --namespace dot-ai ``` **Common Causes:** - MCP endpoint not available - Wrong `mcp.endpoint` URL in CapabilityScanConfig - Missing or invalid `mcp.authSecretRef` secret - Resource filters excluding all resources **Solution:** ```bash # Verify the MCP endpoint URL is correct kubectl get capabilityscanconfigs --output jsonpath='{.items[*].spec.mcp.endpoint}' # Check if initial scan completed kubectl get capabilityscanconfigs --output jsonpath='{.items[*].status.initialScanComplete}' # Check last error kubectl get capabilityscanconfigs --output jsonpath='{.items[*].status.lastError}' # Verify include/exclude filters aren't too restrictive kubectl get capabilityscanconfigs --output jsonpath='{.items[*].spec.includeResources}' ``` ### 9. GitKnowledgeSource Not Syncing **Symptoms:** ```bash # GitKnowledgeSource status shows errors or Synced condition is False kubectl get gitknowledgesources --output yaml ``` **Diagnosis:** ```bash # Check GitKnowledgeSource status kubectl get gitknowledgesources -n dot-ai -o jsonpath='{.items[*].status}' # Check controller logs for sync errors kubectl logs --selector app.kubernetes.io/name=dot-ai-controller --namespace dot-ai | grep -i "gitknowledge\|clone" # Verify MCP endpoint is reachable kubectl exec --namespace dot-ai deployment/dot-ai-controller-manager -- \ curl -v http://dot-ai.dot-ai.svc:3456/health ``` **Common Causes:** - **CloneError with "read-only file system"**: Controller deployment missing `/tmp` volume mount - **Authentication failure**: Invalid or missing token for private repositories - **MCP unreachable**: Wrong MCP server URL or network issues - **Invalid path patterns**: Glob patterns not matching any files **Solution:** ```bash # Check for read-only filesystem error (needs /tmp volume) kubectl get gitknowledgesources -n dot-ai -o jsonpath='{.items[*].status.lastError}' # Verify the controller has /tmp volume mounted kubectl get deployment dot-ai-controller-manager -n dot-ai -o jsonpath='{.spec.template.spec.containers[0].volumeMounts}' # If missing, patch to add /tmp volume: kubectl patch deployment dot-ai-controller-manager -n dot-ai --type='json' -p='[ {"op": "add", "path": "/spec/template/spec/volumes", "value": [{"name": "tmp-dir", "emptyDir": {}}]}, {"op": "add", "path": "/spec/template/spec/containers/0/volumeMounts", "value": [{"name": "tmp-dir", "mountPath": "/tmp"}]} ]' # For private repo auth issues, verify secret exists kubectl get secret -n dot-ai -o jsonpath='{.data.}' | base64 -d ``` ### 10. ResourceSync High Traffic or Performance Issues **Symptoms:** - High CPU/memory usage on controller - Frequent sync requests to MCP - Slow cluster performance **Diagnosis:** ```bash # Check sync frequency and resource counts kubectl get resourcesyncconfigs --output yaml | grep -A5 status # Check debounce and resync settings kubectl get resourcesyncconfigs --output yaml | grep -E "debounceWindowSeconds|resyncIntervalMinutes" ``` **Solution:** Adjust debounce and resync intervals in your ResourceSyncConfig: ```yaml spec: debounceWindowSeconds: 30 # Increase to batch more changes resyncIntervalMinutes: 120 # Increase to reduce full resyncs ``` ## Getting Help ### Collect Diagnostic Information When reporting issues, include this diagnostic information: ```bash # Controller status and logs kubectl get pods --namespace dot-ai kubectl logs --selector app.kubernetes.io/name=dot-ai-controller --namespace dot-ai --tail 100 # MCP status and logs kubectl logs --namespace dot-ai --selector app.kubernetes.io/name=dot-ai --tail 50 # RemediationPolicy configuration kubectl get remediationpolicies --namespace dot-ai --output yaml # ResourceSyncConfig configuration and status kubectl get resourcesyncconfigs --all-namespaces --output yaml # CapabilityScanConfig configuration and status kubectl get capabilityscanconfigs --all-namespaces --output yaml # Recent events kubectl get events --namespace dot-ai --sort-by='.lastTimestamp' --field-selector type=Warning ``` ### Enable Debug Logging For more detailed troubleshooting, you can increase log verbosity: ```bash # Edit the controller deployment to add debug flags kubectl patch deployment dot-ai-controller-manager --namespace dot-ai --patch=' { "spec": { "template": { "spec": { "containers": [ { "name": "manager", "args": ["--leader-elect", "--health-probe-bind-address=:8081", "-v=2"] } ] } } } }' ``` ## Resource Requirements The default resource limits are: **Controller:** - Limits: 500m CPU, 128Mi memory - Requests: 10m CPU, 64Mi memory **MCP:** - Limits: 1 CPU, 2Gi memory - Requests: 200m CPU, 512Mi memory These should be sufficient for most use cases, but may need adjustment for high-volume environments. ## Web UI # DevOps AI Toolkit Web UI **Kubernetes dashboard with AI-powered operations and rich visualizations for the DevOps AI Toolkit MCP server.** **DevOps AI Toolkit Web UI** - Rich visualizations for AI-powered DevOps workflows. # Kubernetes Setup Guide **Deploy DevOps AI Toolkit Web UI to Kubernetes using Helm chart.** > **Recommended**: For the easiest setup, install the complete dot-ai stack which includes all components (MCP server, Web UI, and Controller). See the [Stack Installation Guide](/docs/stack). > > Continue below if you want to install this component individually (for non-Kubernetes setups or granular control). ## Prerequisites - Kubernetes cluster (1.19+) with kubectl access - Helm 3.x installed - [dot-ai MCP server](/docs/ai-engine/setup/deployment) deployed and accessible - Auth token for the dot-ai MCP server ## Quick Start ### Step 1: Set Environment Variables ```bash # Set the version from https://github.com/vfarcic/dot-ai-ui/pkgs/container/dot-ai-ui%2Fcharts%2Fdot-ai-ui export DOT_AI_UI_VERSION="..." # Use the same auth token as your dot-ai MCP server export DOT_AI_AUTH_TOKEN="your-dot-ai-auth-token" # Token for UI login export DOT_AI_UI_AUTH_TOKEN="your-ui-access-token" # Ingress class - change to match your ingress controller (traefik, haproxy, etc.) export INGRESS_CLASS_NAME="nginx" ``` ### Step 2: Install the Web UI ```bash helm install dot-ai-ui \ oci://ghcr.io/vfarcic/dot-ai-ui/charts/dot-ai-ui:$DOT_AI_UI_VERSION \ --set dotAi.url="http://dot-ai:3456" \ --set dotAi.auth.token="$DOT_AI_AUTH_TOKEN" \ --set uiAuth.token="$DOT_AI_UI_AUTH_TOKEN" \ --set ingress.enabled=true \ --set ingress.className="$INGRESS_CLASS_NAME" \ --set ingress.host="dot-ai-ui.127.0.0.1.nip.io" \ --namespace dot-ai \ --wait ``` **Notes**: - Replace `dot-ai-ui.127.0.0.1.nip.io` with your desired hostname. - The `dotAi.url` should point to your dot-ai MCP server service. If deployed in the same namespace with default settings, `http://dot-ai:3456` works. - The chart includes nginx timeout annotations by default (10 min). **If using a different ingress controller**, override the annotations (see [Ingress Timeout Configuration](#ingress-timeout-configuration)). - For all available configuration options, see the [Helm values file](https://github.com/vfarcic/dot-ai-ui/blob/main/charts/values.yaml). ### Step 3: Configure dot-ai MCP Server Update your dot-ai MCP server to include visualization URLs in responses: ```bash helm upgrade dot-ai-mcp oci://ghcr.io/vfarcic/dot-ai/charts/dot-ai:$DOT_AI_VERSION \ --set webUi.baseUrl="http://dot-ai-ui.127.0.0.1.nip.io" \ --namespace dot-ai \ --reuse-values ``` ### Step 4: Verify Installation Open your browser and navigate to the Web UI hostname. You should see the DevOps AI Toolkit Web UI home page. ## Configuration Reference | Parameter | Description | Default | |-----------|-------------|---------| | `annotations` | Global annotations applied to all resources (e.g., `reloader.stakater.com/auto: "true"`) | `{}` | | `image.repository` | Container image repository | `ghcr.io/vfarcic/dot-ai-ui` | | `image.tag` | Container image tag | Chart appVersion | | `image.pullPolicy` | Image pull policy | `IfNotPresent` | | `dotAi.url` | URL of the dot-ai MCP server | `http://dot-ai:3456` | | `dotAi.auth.secretRef.name` | Name of existing secret with auth token | `dot-ai-secrets` | | `dotAi.auth.secretRef.key` | Key in existing secret | `auth-token` | | `dotAi.auth.token` | Auth token (if not using secretRef) | `""` | | `uiAuth.secretRef.name` | Existing secret with UI auth token | `""` | | `uiAuth.secretRef.key` | Key in existing secret | `ui-auth-token` | | `uiAuth.token` | UI auth token (if not using secretRef) | `""` | | `ingress.enabled` | Enable Ingress resource | `false` | | `ingress.className` | Ingress class name | `nginx` | | `ingress.host` | Ingress hostname | `dot-ai-ui.127.0.0.1.nip.io` | | `ingress.annotations` | Ingress annotations (includes nginx timeouts by default) | nginx timeout annotations | | `ingress.tls.enabled` | Enable TLS | `false` | | `ingress.tls.secretName` | TLS secret name | `""` | | `ingress.tls.clusterIssuer` | cert-manager ClusterIssuer | `""` | | `resources.requests.memory` | Memory request | `128Mi` | | `resources.requests.cpu` | CPU request | `50m` | | `resources.limits.memory` | Memory limit | `256Mi` | | `resources.limits.cpu` | CPU limit | `200m` | ## Using Secret Reference For production, reference existing secrets instead of passing tokens directly: ```bash helm install dot-ai-ui \ oci://ghcr.io/vfarcic/dot-ai-ui/charts/dot-ai-ui:$DOT_AI_UI_VERSION \ --set dotAi.url="http://dot-ai:3456" \ --set dotAi.auth.secretRef.name="dot-ai-secrets" \ --set dotAi.auth.secretRef.key="auth-token" \ --set uiAuth.secretRef.name="dot-ai-secrets" \ --set uiAuth.secretRef.key="ui-auth-token" \ --set ingress.enabled=true \ --set ingress.className="$INGRESS_CLASS_NAME" \ --set ingress.host="dot-ai-ui.example.com" \ --namespace dot-ai \ --wait ``` ## TLS Configuration Enable HTTPS with cert-manager: ```bash helm install dot-ai-ui \ oci://ghcr.io/vfarcic/dot-ai-ui/charts/dot-ai-ui:$DOT_AI_UI_VERSION \ --set dotAi.url="http://dot-ai:3456" \ --set dotAi.auth.secretRef.name="dot-ai-secrets" \ --set uiAuth.secretRef.name="dot-ai-secrets" \ --set uiAuth.secretRef.key="ui-auth-token" \ --set ingress.enabled=true \ --set ingress.className="$INGRESS_CLASS_NAME" \ --set ingress.host="dot-ai-ui.example.com" \ --set ingress.tls.enabled=true \ --set ingress.tls.clusterIssuer="letsencrypt" \ --namespace dot-ai \ --wait ``` Then update your `.mcp.json` URL to use `https://`. ## Ingress Timeout Configuration The Web UI proxies requests to the MCP server for AI-powered operations (query, remediate, operate, recommend) that can take several minutes. The chart includes **nginx timeout annotations by default** (10 minutes). If you use a different ingress controller, override `ingress.annotations` with the appropriate settings: ### Nginx (default) ```yaml ingress: annotations: nginx.ingress.kubernetes.io/proxy-read-timeout: "600" nginx.ingress.kubernetes.io/proxy-send-timeout: "600" ``` ### Traefik ```yaml ingress: className: traefik annotations: traefik.ingress.kubernetes.io/router.middlewares: dot-ai-timeout@kubernetescrd ``` > **Note**: Traefik requires a separate `Middleware` resource for timeout configuration. See the [Traefik docs](https://doc.traefik.io/traefik/middlewares/http/headers/). ### HAProxy ```yaml ingress: className: haproxy annotations: haproxy.org/timeout-http-request: "600s" ``` ### AWS ALB ```yaml ingress: className: alb annotations: alb.ingress.kubernetes.io/target-group-attributes: idle_timeout.timeout_seconds=600 ``` ## Gateway API (Alternative to Ingress) For Kubernetes 1.26+ with Gateway API support, you can use HTTPRoute instead of Ingress. ### Prerequisites - Kubernetes 1.26+ cluster - Gateway API CRDs installed - Gateway controller running (Istio, Envoy Gateway, Kong, etc.) - Existing Gateway resource ### Reference Existing Gateway ```bash helm install dot-ai-ui \ oci://ghcr.io/vfarcic/dot-ai-ui/charts/dot-ai-ui:$DOT_AI_UI_VERSION \ --set dotAi.url="http://dot-ai:3456" \ --set dotAi.auth.secretRef.name="dot-ai-secrets" \ --set uiAuth.secretRef.name="dot-ai-secrets" \ --set uiAuth.secretRef.key="ui-auth-token" \ --set ingress.enabled=false \ --set gateway.name="cluster-gateway" \ --set gateway.namespace="gateway-system" \ --namespace dot-ai \ --wait ``` ### Gateway Configuration Reference | Parameter | Description | Default | |-----------|-------------|---------| | `gateway.name` | Existing Gateway name to reference | `""` | | `gateway.namespace` | Gateway namespace (for cross-namespace) | `""` | | `gateway.create` | Create new Gateway (dev/testing only) | `false` | | `gateway.className` | GatewayClass name (when create=true) | `""` | | `gateway.annotations` | Annotations for Gateway (when create=true) | `{}` | | `gateway.timeouts.request` | Max time for entire request (HTTPRoute) | `"600s"` | | `gateway.timeouts.backendRequest` | Max time waiting for backend response (HTTPRoute) | `"600s"` | | `gateway.listeners.http.enabled` | Enable HTTP listener on port 80 | `true` | | `gateway.listeners.http.hostname` | Hostname for HTTP listener | `""` | | `gateway.listeners.https.enabled` | Enable HTTPS listener on port 443 | `false` | | `gateway.listeners.https.hostname` | Hostname for HTTPS listener | `""` | | `gateway.listeners.https.secretName` | TLS secret name for HTTPS | `""` | ### Timeout Configuration The chart sets HTTPRoute timeouts to **10 minutes** by default (`gateway.timeouts.request` and `gateway.timeouts.backendRequest`), which accommodates AI-powered operations. You can override these values: ```bash helm install dot-ai-ui ... \ --set gateway.timeouts.request="1800s" \ --set gateway.timeouts.backendRequest="1800s" ``` **Not all Gateway controllers support HTTPRoute timeouts.** For example, GKE's Gateway controller rejects them with error GWCER104. To disable HTTPRoute timeouts, set both values to empty strings: ```bash helm install dot-ai-ui ... \ --set gateway.timeouts.request="" \ --set gateway.timeouts.backendRequest="" ``` ### Cloud Provider Considerations Some cloud providers require provider-specific timeout configuration instead of HTTPRoute timeouts: - **GKE**: Does **not** support HTTPRoute timeouts (GWCER104). You must disable them (set to `""`) and use `GCPBackendPolicy` instead (see below). - **Health check configuration**: Custom health check intervals or thresholds. - **Security policies**: WAF rules, rate limiting at the gateway level. **GKE Example**: Disable HTTPRoute timeouts and create a `GCPBackendPolicy` for timeout configuration: ```yaml apiVersion: networking.gke.io/v1 kind: GCPBackendPolicy metadata: name: dot-ai-ui namespace: dot-ai spec: default: timeoutSec: 3600 targetRef: group: "" kind: Service name: dot-ai-ui ``` Consult your cloud provider's Gateway API documentation for equivalent configurations on other platforms. ## Additional Configuration | Parameter | Description | Default | |-----------|-------------|---------| | `extraEnv` | Additional environment variables | `[]` |