# DevOps AI Toolkit

> AI-powered Kubernetes operations for DevOps teams. Deploy, query, remediate, and operate Kubernetes resources through natural language using MCP, CLI, a Kubernetes controller, and a web UI.

## Getting Started

# DevOps AI Toolkit Stack

**Deploy the complete DevOps AI Toolkit stack with a single Helm command.**

## Overview

The dot-ai-stack umbrella chart installs all DevOps AI Toolkit components with a single command:
- **DevOps AI Toolkit** - MCP server for AI-powered Kubernetes operations
- **DevOps AI Toolkit Controller** - Kubernetes controller for intelligent resource management and autonomous operations
- **DevOps AI Toolkit Web UI** - Web interface for visual cluster management

> **Note:** This guide covers Kubernetes deployment using the umbrella Helm chart. For other installation options (Docker, NPX, individual charts, etc.), see [devopstoolkit.ai](/).

## Prerequisites

- **Kubernetes cluster** with an ingress controller
- **Helm 3.x** installed
- **kubectl** configured with cluster access
- **AI API keys** for AI-powered features (Anthropic and/or OpenAI)

## Step 1: Create a Local Cluster (Optional)

> Skip this step if you already have a Kubernetes cluster with an ingress controller.

Create a Kind cluster with ingress port mappings:

```bash
kind create cluster --name dot-ai-stack --config - <<EOF
kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
nodes:
- role: control-plane
  extraPortMappings:
  - containerPort: 80
    hostPort: 80
    protocol: TCP
  - containerPort: 443
    hostPort: 443
    protocol: TCP
EOF
```

You should see output ending with:

```
Set kubectl context to "kind-dot-ai-stack"
```

Install the nginx ingress controller for Kind:

```bash
kubectl apply \
    --filename https://raw.githubusercontent.com/kubernetes/ingress-nginx/main/deploy/static/provider/kind/deploy.yaml
```

Wait for the ingress controller to be ready:

```bash
kubectl wait --namespace ingress-nginx \
    --for=condition=ready pod \
    --selector=app.kubernetes.io/component=controller \
    --timeout=90s
```

## Step 2: Set Environment Variables

Set your API keys for AI-powered features:

```bash
export ANTHROPIC_API_KEY="your-anthropic-api-key"

export OPENAI_API_KEY="your-openai-api-key"
```

> **Note:** Multiple AI providers are supported. See [AI Model Configuration](/docs/ai-engine/setup/deployment#ai-model-configuration) for all options including Google Gemini, AWS Bedrock, Azure OpenAI, and others.

Generate random authentication tokens for the MCP server and Web UI:

```bash
export DOT_AI_AUTH_TOKEN=$(openssl rand -base64 32)

export DOT_AI_UI_AUTH_TOKEN=$(openssl rand -base64 32)
```

## Step 3: Install the Stack

Install the complete dot-ai stack with a single Helm command:

```bash
helm upgrade --install dot-ai-stack \
    oci://ghcr.io/vfarcic/dot-ai-stack/charts/dot-ai-stack \
    --namespace dot-ai --create-namespace \
    --set dot-ai.secrets.anthropic.apiKey=$ANTHROPIC_API_KEY \
    --set dot-ai.secrets.openai.apiKey=$OPENAI_API_KEY \
    --set dot-ai.secrets.auth.token=$DOT_AI_AUTH_TOKEN \
    --set dot-ai.ingress.enabled=true \
    --set dot-ai.ingress.className=nginx \
    --set dot-ai.ingress.host=dot-ai.127.0.0.1.nip.io \
    --set dot-ai.webUI.baseUrl=http://dot-ai-ui.127.0.0.1.nip.io \
    --set dot-ai-ui.uiAuth.token=$DOT_AI_UI_AUTH_TOKEN \
    --set dot-ai-ui.ingress.enabled=true \
    --set dot-ai-ui.ingress.host=dot-ai-ui.127.0.0.1.nip.io \
    --wait
```

> **Note:** Replace the ingress hosts with your actual domain names for production deployments.

This installs:
- **dot-ai** - MCP server with ingress at `dot-ai.127.0.0.1.nip.io`
- **dot-ai-controller** - Kubernetes controller for autonomous operations
- **dot-ai-ui** - Web interface at `dot-ai-ui.127.0.0.1.nip.io`
- **Qdrant** - Vector database for pattern and policy storage
- **ResourceSyncConfig** - Enables resource discovery
- **CapabilityScanConfig** - Enables cluster capability scanning

## Step 4: Verify Installation

Check that all pods are running:

```bash
kubectl get pods --namespace dot-ai
```

You should see all pods in `Running` status:

```
NAME                                        READY   STATUS    RESTARTS   AGE
dot-ai-577db5b4fc-j8kgf                     1/1     Running   0          50s
dot-ai-controller-manager-c898b5697-dqk2m   1/1     Running   0          50s
dot-ai-stack-qdrant-0                       1/1     Running   0          50s
dot-ai-ui-69d586db8b-ccqrm                  1/1     Running   0          50s
```

Test the MCP server health:

```bash
curl -H "Authorization: Bearer $DOT_AI_AUTH_TOKEN" \
    http://dot-ai.127.0.0.1.nip.io/healthz
```

Expected output:

```json
{"status":"ok"}
```

## Step 5: Choose Your Client

The DevOps AI Toolkit can be accessed through two client options - **MCP** or **CLI**. Both provide AI agent integration with full feature parity.

### Option A: MCP Client

**Best for:** Curated high-level operations designed to minimize context window usage.

Create the MCP client configuration file:

```bash
cat > .mcp.json << EOF
{
  "mcpServers": {
    "dot-ai": {
      "type": "http",
      "url": "http://dot-ai.127.0.0.1.nip.io",
      "headers": {
        "Authorization": "Bearer $DOT_AI_AUTH_TOKEN"
      }
    }
  }
}
EOF
```

> **Note:** This example creates `.mcp.json` in the current directory for Claude Code. Other MCP-enabled agents may expect the configuration in a different location (e.g., `~/.config/` or within the agent's settings). Consult your agent's documentation for the correct path.

**Learn more:** [MCP Setup Documentation](/docs/ai-engine/setup/deployment)

### Option B: CLI Client

**Best for:** Comprehensive API access with lower token overhead for AI agents, plus scripting and automation support.

Install the CLI:

**macOS (Homebrew):**
```bash
brew install vfarcic/tap/dot-ai
```

**Windows (Scoop):**
```powershell
scoop bucket add dot-ai https://github.com/vfarcic/scoop-dot-ai
scoop install dot-ai
```

**Other platforms:** Download from [releases](https://github.com/vfarcic/dot-ai-cli/releases) or see [installation guide](/docs/cli/setup/installation/).

Configure the CLI:

```bash
export DOT_AI_URL="http://dot-ai.127.0.0.1.nip.io"
```

Generate skills for your AI agent:

```bash
# For Claude Code
dot-ai skills generate --agent claude-code

# For Cursor
dot-ai skills generate --agent cursor

# For Windsurf
dot-ai skills generate --agent windsurf
```

**Learn more:** [CLI Quick Start](/docs/cli/quick-start/) | [Installation](/docs/cli/setup/installation/) | [Agent Integration](/docs/cli/guides/skills-generation/)

### Choosing Between MCP and CLI

- **Use MCP** for simpler, high-level operations with minimal tool descriptions
- **Use CLI** for comprehensive API access with lower token costs and better economy for agents executing multiple commands

## Step 6: Start Using

Launch your AI agent:

```bash
claude
```

> **Note:** If your agent doesn't automatically detect the client, explicitly invoke it with "Use dot-ai MCP" or "Use dot-ai CLI" depending on which client you configured.

Try these example prompts:

| What You Want | Example Prompt |
|---------------|----------------|
| Check system status | "Show dot-ai status" |
| Query cluster | "What pods are running in the dot-ai namespace?" |
| List capabilities | "List all capabilities" |
| Deploy an app | "I want to deploy a web application" |
| Fix issues | "Something is wrong with my database" |

## Configuration

Override any component value by prefixing with the chart name:

```bash
--set dot-ai.resources.limits.memory=4Gi
--set dot-ai-controller.resources.limits.memory=1Gi
--set dot-ai-ui.ingress.host=ui.example.com
```

For available options, see each component's documentation:
- [DevOps AI Toolkit values](/docs/ai-engine/setup/deployment)
- [Controller values](/docs/controller/)
- [Web UI values](/docs/ui/)

## Next Steps

- [MCP Tools Overview](/docs/ai-engine/tools/overview) - Complete feature reference
- [Pattern Management](/docs/ai-engine/organizational-data/patterns) - Create organizational patterns
- [Policy Management](/docs/ai-engine/organizational-data/policies) - Define governance policies

## Cleanup

To remove the stack:

```bash
helm uninstall dot-ai-stack --namespace dot-ai

kubectl delete namespace dot-ai
```

To delete the Kind cluster:

```bash
kind delete cluster --name dot-ai-stack
```

# Operate Feature Architecture

This document provides a detailed architecture overview of the Operate feature in the DevOps AI Toolkit.

## Overview

The Operate feature provides AI-powered Day 2 operations for Kubernetes applications. It handles updates, scaling, enhancements, rollbacks, and deletions through natural language intents while applying organizational patterns and policies, validating changes via dry-run, and executing approved operations safely.

## High-Level Architecture

```mermaid
flowchart TB
    subgraph Users["User / AI Agent"]
        Agent["Claude Code, Cursor,<br/>VS Code, etc."]
    end

    subgraph MCP["MCP Server (dot-ai)"]
        Operate["operate Tool"]
        AI["AI Provider"]
        Session["Session<br/>Manager"]
        Kubectl["Kubectl Tools"]
        VectorClient["Vector DB<br/>Client"]
    end

    subgraph External["External Services"]
        LLM["Claude, OpenAI,<br/>Gemini, etc."]
        Qdrant["Qdrant<br/>(Patterns, Policies,<br/>Capabilities)"]
    end

    subgraph K8s["Kubernetes Cluster"]
        API["Kubernetes API"]
        Controller["Controller<br/>(dot-ai-controller)"]
        Resources["Cluster Resources<br/>Deployments, StatefulSets,<br/>CRDs, etc."]
    end

    subgraph WebUI["Web UI (dot-ai-ui)"]
        Viz["Visualization Dashboard<br/>- Current State<br/>- Proposed Changes<br/>- Risk Assessment"]
    end

    Agent <-->|MCP Protocol| Operate
    Operate --> AI
    Operate --> Session
    Operate --> Kubectl
    Operate --> VectorClient
    AI --> LLM
    AI <-->|Tool Loop| Kubectl
    VectorClient --> Qdrant
    Kubectl --> API
    Operate -->|Execute Commands| API
    Controller -->|Scan & Embed| Qdrant
    Controller -->|Watch| Resources
    Agent -.->|User opens<br/>Visualization URL| WebUI
```

## Operation Workflow

The operate tool implements a three-phase workflow with persistent session management:

```mermaid
flowchart TD
    subgraph Phase1["Phase 1: Intent Analysis"]
        Intent["User Intent<br/>(natural language)"]
        CreateSession["Create Session<br/>(opr-{ts}-{uuid})"]
        EmbedContext["Embed Context:<br/>- Search Patterns<br/>- Search Policies<br/>- Search Capabilities"]
        ContextCheck{"Capabilities<br/>Found?"}
        NoCapError["Error: Run<br/>capability scan first"]
        Investigation["AI Investigation Loop<br/>(max 30 iterations)"]
        KubectlTools["kubectl Tools:<br/>get, describe, logs,<br/>patch_dryrun, apply_dryrun,<br/>delete_dryrun, get_crd_schema"]

        Intent --> CreateSession --> EmbedContext
        EmbedContext --> ContextCheck
        ContextCheck -->|No| NoCapError
        ContextCheck -->|Yes| Investigation
        Investigation <-->|Tool Calls| KubectlTools
    end

    subgraph Phase2["Phase 2: Analysis & Validation"]
        ParseResponse["Parse AI Response"]
        DryRunCheck{"Dry-Run<br/>Validated?"}
        RetryAnalysis["AI Iterates with<br/>Corrected Commands"]
        Analysis["Generate Analysis:<br/>- Current State<br/>- Proposed Changes<br/>- Commands<br/>- Risk Assessment"]
        SaveSession["Save Session<br/>(status: analysis_complete)"]
        ReturnAnalysis["Return with<br/>Visualization URL"]

        Investigation --> ParseResponse --> DryRunCheck
        DryRunCheck -->|No| RetryAnalysis --> Investigation
        DryRunCheck -->|Yes| Analysis
        Analysis --> SaveSession --> ReturnAnalysis
    end

    subgraph Phase3["Phase 3: User Approval & Execution"]
        UserChoice["User Approval<br/>(executeChoice=1)"]
        LoadSession["Load Session"]
        ExecuteCommands["Execute Commands<br/>(continue-on-error)"]
        PostValidation{"Has<br/>validationIntent?"}
        CallRemediate["Call remediate<br/>Internally"]
        SaveResults["Save Results<br/>(status: executed_*)"]
        ReturnResults["Return Execution<br/>Results"]

        ReturnAnalysis -.->|"User reviews<br/>and approves"| UserChoice
        UserChoice --> LoadSession --> ExecuteCommands
        ExecuteCommands --> PostValidation
        PostValidation -->|Yes| CallRemediate --> SaveResults
        PostValidation -->|No| SaveResults
        SaveResults --> ReturnResults
    end
```

## Component Details

### MCP Server (dot-ai)

The MCP server provides the core operations engine:

| Component | File | Description |
|-----------|------|-------------|
| `operate` tool | `src/tools/operate.ts` | Entry point, routing, context embedding, formatting |
| Analysis workflow | `src/tools/operate-analysis.ts` | Intent analysis, AI tool loop, response parsing |
| Execution workflow | `src/tools/operate-execution.ts` | Command execution, post-validation, results |
| System Prompt | `prompts/operate-system.md` | AI instructions for operation behavior |
| User Prompt | `prompts/operate-user.md` | Handlebars template with context injection |
| `GenericSessionManager` | `src/core/generic-session-manager.ts` | File-based session persistence |
| `AIProvider` | `src/core/ai-provider.interface.ts` | AI abstraction with tool loop support |
| `kubectl-tools` | `src/core/kubectl-tools.ts` | Kubectl investigation and validation tools |
| Vector Services | `src/services/*-vector-service.ts` | Pattern, policy, capability search |
| `visualization` | `src/core/visualization.ts` | URL generation for web UI |

### Kubectl Investigation & Validation Tools

Tools available during AI analysis:

| Tool | Description |
|------|-------------|
| `kubectl_api_resources` | Discover available resources in cluster |
| `kubectl_get` | List resources with table format |
| `kubectl_describe` | Detailed resource information with events |
| `kubectl_logs` | Container logs for debugging |
| `kubectl_patch_dryrun` | Validate patch operations before execution |
| `kubectl_apply_dryrun` | Validate apply operations before execution |
| `kubectl_delete_dryrun` | Validate delete operations before execution |
| `kubectl_get_crd_schema` | Get CRD schema for custom resources |

### Controller (dot-ai-controller)

The Kubernetes controller provides capability scanning:

| Component | File | Description |
|-----------|------|-------------|
| Capability Scanner | `internal/controller/capability_scanner.go` | Discovers cluster resources and capabilities |
| Embedding Service | `internal/controller/embedding_service.go` | Generates embeddings for semantic search |
| Qdrant Client | `internal/controller/qdrant_client.go` | Stores capabilities in vector database |

### Web UI (dot-ai-ui)

Provides visualization for operation analysis and execution:

| Component | File | Description |
|-----------|------|-------------|
| Visualization Page | `src/pages/Visualization.tsx` | Main page for `/v/{sessionId}` |
| MermaidRenderer | `src/components/renderers/MermaidRenderer.tsx` | Interactive flowcharts |
| CardRenderer | `src/components/renderers/CardRenderer.tsx` | Current state and proposed changes |
| CodeRenderer | `src/components/renderers/CodeRenderer.tsx` | Commands with syntax highlighting |
| InsightsPanel | `src/components/InsightsPanel.tsx` | AI observations and risk assessment |

## Integration Points

```mermaid
flowchart LR
    subgraph MCP["MCP Server"]
        Operate["operate tool"]
        AIProvider["AI Provider"]
        KubectlTools["Kubectl Tools"]
        SessionMgr["Session Manager"]
        VectorSvc["Vector Services"]
    end

    subgraph AI["AI Providers"]
        Anthropic["Claude API"]
        OpenAI["OpenAI API"]
        Google["Gemini API"]
        Others["xAI, Bedrock,<br/>OpenRouter, etc."]
    end

    subgraph VectorDB["Vector Database"]
        Patterns["Patterns<br/>Collection"]
        Policies["Policies<br/>Collection"]
        Capabilities["Capabilities<br/>Collection"]
    end

    subgraph K8s["Kubernetes"]
        API["API Server"]
        Controller["Capability<br/>Scanner"]
    end

    subgraph Storage["Session Storage"]
        Files["File System<br/>~/.dot-ai/sessions/opr-sessions/"]
    end

    subgraph UI["Web UI"]
        Viz["Visualization<br/>Dashboard"]
    end

    AIProvider <-->|Tool Loop| KubectlTools
    AIProvider --> Anthropic
    AIProvider --> OpenAI
    AIProvider --> Google
    AIProvider --> Others

    VectorSvc --> Patterns
    VectorSvc --> Policies
    VectorSvc --> Capabilities

    KubectlTools -->|Investigation| API
    Operate -->|Execute Commands| API
    Controller -->|Scan & Embed| Capabilities

    SessionMgr --> Files
    Operate -.->|Session URL| Viz
```

### MCP Server ↔ AI Provider

- **Tool Loop**: AI iteratively calls kubectl tools (max 30 iterations)
- **Investigation**: Gathers current cluster state to understand resources
- **Dry-Run Validation**: Validates all commands before proposing
- **Analysis**: Generates JSON response with changes, commands, and risk assessment

### MCP Server ↔ Vector Database

- **Patterns**: Organizational patterns for operational best practices
- **Policies**: Policy intents for validation and compliance
- **Capabilities**: Cluster resource capabilities for intelligent recommendations
- Capabilities are **mandatory**; patterns/policies are optional

### MCP Server ↔ Kubernetes API

- **Read Operations**: `kubectl get`, `describe`, `logs`
- **Validation**: `kubectl patch/apply/delete --dry-run=server`
- **Execution**: Sequential command execution via `child_process.exec()`

### MCP Server ↔ Remediate Tool

- **Post-Execution Validation**: Internally calls remediate with `validationIntent`
- **Verification**: Confirms operations completed successfully
- **Error Detection**: Identifies issues introduced by operations

### MCP Server ↔ Web UI

- **Session Storage**: Operation data stored with session IDs
- **Visualization API**: `/api/v1/visualize/{sessionId}` endpoint
- **URL Generation**: `WEB_UI_BASE_URL/v/{sessionId}`

## Session Management

Sessions persist workflow state across tool calls:

```
Session ID Format: opr-{timestamp}-{uuid8}
Example: opr-1704067200000-a1b2c3d4

Session Data:
├── toolName: 'operate'
├── intent: "Update my-api to v2.0 with zero downtime"
├── context:
│   ├── patterns: OrganizationalPattern[]
│   ├── policies: PolicyIntent[]
│   └── capabilities: ResourceCapability[]
├── proposedChanges:
│   ├── create: ResourceChange[]
│   ├── update: ResourceChange[]
│   └── delete: ResourceChange[]
├── commands: ["kubectl set image...", "kubectl patch..."]
├── dryRunValidation:
│   ├── status: 'success' | 'failed'
│   └── details: string
├── patternsApplied: ["Zero-Downtime Rolling Update"]
├── capabilitiesUsed: ["metrics-server", "KEDA"]
├── policiesChecked: ["Production Update Policy"]
├── risks: { level: 'low', description: "..." }
├── validationIntent: "Verify deployment rollout complete"
├── status: 'analyzing' | 'analysis_complete' | 'executing' | 'executed_*'
└── executionResults: [{command, success, output, error}]
```

### Session States

| State | Description |
|-------|-------------|
| `analyzing` | AI is gathering data and generating commands |
| `analysis_complete` | Analysis done, awaiting user approval |
| `executing` | Commands are being executed |
| `executed_successfully` | All commands succeeded |
| `executed_with_errors` | Some commands failed |
| `failed` | Analysis or execution failed |

## Organizational Context Integration

The operate tool integrates organizational knowledge via vector database search:

```mermaid
flowchart TB
    subgraph Input["User Intent"]
        Intent["Update my-api to v2.0<br/>with zero downtime"]
    end

    subgraph Search["Vector DB Search"]
        PatternSearch["Search Patterns<br/>(limit: 5)"]
        PolicySearch["Search Policies<br/>(limit: 5)"]
        CapSearch["Search Capabilities<br/>(limit: 50)"]
    end

    subgraph Results["Search Results"]
        Patterns["Patterns Found:<br/>- Zero-Downtime Rolling Update<br/>- Blue-Green Deployment"]
        Policies["Policies Found:<br/>- Production Update Policy<br/>- Change Management"]
        Caps["Capabilities Found:<br/>- metrics-server<br/>- KEDA Operator<br/>- Argo Rollouts"]
    end

    subgraph Prompt["AI Prompt"]
        Context["Embedded Context:<br/>patterns + policies + capabilities"]
    end

    Intent --> PatternSearch & PolicySearch & CapSearch
    PatternSearch --> Patterns
    PolicySearch --> Policies
    CapSearch --> Caps
    Patterns & Policies & Caps --> Context
```

### Context Priority

1. **Capabilities** (Mandatory): What the cluster can do
2. **Patterns** (Optional): Organizational best practices
3. **Policies** (Optional): Compliance and validation rules

## Output Formats

The operate tool returns structured output at different stages:

### Analysis Response

| Field | Description |
|-------|-------------|
| `status` | `awaiting_user_approval` |
| `sessionId` | Session ID for continuation |
| `visualizationUrl` | URL to view analysis in web UI |
| `currentState` | Current cluster resource state |
| `proposedChanges` | Create, update, delete operations |
| `commands` | Pre-validated kubectl commands |
| `dryRunValidation` | Dry-run validation results |
| `patternsApplied` | Applied organizational patterns |
| `capabilitiesUsed` | Used cluster capabilities |
| `policiesChecked` | Checked policies |
| `risks` | Risk assessment (level + description) |
| `validationIntent` | Post-execution validation instructions |

### Execution Response

| Field | Description |
|-------|-------------|
| `status` | `success` or `failed` |
| `sessionId` | Session ID for reference |
| `results` | Per-command execution results |
| `validation` | Post-execution validation summary |
| `message` | Human-readable summary |

## Error Handling

The operation workflow includes robust error handling:

1. **No Capabilities Found**: Clear guidance to run capability scan first
2. **Session Not Found**: Guidance to start new operation
3. **Dry-Run Failures**: AI iterates to fix commands before proposing
4. **Command Execution Failures**: Continue-on-error, capture all results
5. **Validation Failures**: Report issues via remediate tool integration
6. **AI Service Errors**: Logged with request IDs for debugging
7. **Investigation Timeouts**: Max 30 iterations prevents infinite loops

## Configuration

### Environment Variables

| Variable | Description | Default |
|----------|-------------|---------|
| `AI_PROVIDER` | AI provider selection | `anthropic` |
| `ANTHROPIC_API_KEY` | Anthropic API key | Required if using |
| `OPENAI_API_KEY` | OpenAI API key | Required if using |
| `QDRANT_URL` | Qdrant vector database URL | `http://localhost:6333` |
| `QDRANT_API_KEY` | Qdrant API key | Optional |
| `QDRANT_CAPABILITIES_COLLECTION` | Capabilities collection name | `capabilities` |
| `KUBECONFIG` | Kubernetes config path | Auto-detected |
| `DOT_AI_SESSION_DIR` | Session storage directory | `~/.dot-ai/sessions` |
| `WEB_UI_BASE_URL` | Web UI base URL | Optional |
| `DEBUG_DOT_AI` | Enable debug logging | `false` |

### Supported AI Providers

| Provider | Models | Notes |
|----------|--------|-------|
| Anthropic | Claude Sonnet 4.5, Opus, Haiku | Default, 1M token context |
| OpenAI | GPT-5.1-codex | |
| Google | Gemini 3 Pro, Flash | |
| xAI | Grok-4 | |
| Amazon Bedrock | Various | Uses AWS credential chain |
| OpenRouter | Multi-model | Proxy to multiple providers |
| Custom | Ollama, vLLM, LocalAI | Via `baseURL` config |

## Workflow Example

```
User Intent: "Update my-api deployment in prod to v2.0 with zero downtime"

1. CONTEXT EMBEDDING
   └─ embedContext(intent)
      ├─ Search patterns → "Zero-Downtime Rolling Update"
      ├─ Search policies → "Production Update Requirements"
      └─ Search capabilities → "metrics-server", "KEDA Operator"

2. AI INVESTIGATION LOOP
   └─ AI Tool Loop (30 iterations max)
      ├─ kubectl_get deployment/my-api -n prod
      ├─ kubectl_describe deployment/my-api -n prod
      ├─ kubectl_patch_dryrun (test maxUnavailable: 0)
      └─ kubectl_set_image (test v2.0 image --dry-run=server)

3. ANALYSIS GENERATION
   └─ Session created: opr-1704067200000-a1b2c3d4
      ├─ Status: analysis_complete
      ├─ Current: 3 replicas, my-api:v1.5, maxUnavailable: 1
      ├─ Proposed: image v2.0, maxUnavailable: 0
      ├─ Commands: set image + patch strategy
      ├─ Risk: LOW
      └─ Visualization URL: https://dot-ai-ui/v/opr-1704067200000-a1b2c3d4

4. USER APPROVAL
   └─ User reviews analysis in terminal or web UI
      └─ Calls: operate({ sessionId: 'opr-...', executeChoice: 1 })

5. COMMAND EXECUTION
   └─ executeOperations()
      ├─ Load session (status: analysis_complete)
      ├─ Update status to executing
      ├─ Execute commands sequentially
      │  ├─ kubectl set image deployment/my-api my-api=my-api:v2.0 -n prod
      │  └─ kubectl patch deployment/my-api -n prod -p '{"spec":...}'
      ├─ Call remediate internally for validation
      └─ Update status to executed_successfully

6. RETURN RESULTS
   └─ Results: 2 commands succeeded
      ├─ Validation: "Rollout complete, all pods running v2.0"
      └─ Status: success
```

## See Also

- [MCP Operate Guide](//mcp/operate/)
- [Remediate Architecture](./remediate.md)
- [Recommendation Architecture](./recommendation.md)
- [Controller Documentation](//controller/)
- [Web UI Documentation](//ui/)

# Query Feature Architecture

This document provides a detailed architecture overview of the Query feature in the DevOps AI Toolkit.

## Overview

The Query feature provides a natural language interface for Kubernetes cluster intelligence. Users can ask questions about cluster resources, capabilities, and status in plain English, with the AI autonomously deciding which tools to call to gather the necessary information.

## High-Level Architecture

```mermaid
flowchart TB
    subgraph Users["User / AI Agent"]
        Agent["Claude Code, Cursor,<br/>VS Code, etc."]
    end

    subgraph MCP["MCP Server (dot-ai)"]
        Query["query Tool"]
        AI["AI Provider"]
        Session["Session<br/>Manager"]
        CapTools["Capability<br/>Tools"]
        ResTools["Resource<br/>Tools"]
        KubectlTools["Kubectl<br/>Tools"]
    end

    subgraph External["External Services"]
        LLM["Claude, OpenAI,<br/>Gemini, etc."]
        Qdrant["Qdrant<br/>Vector DB"]
        Embeddings["Embedding<br/>Service"]
    end

    subgraph K8s["Kubernetes Cluster"]
        API["Kubernetes API"]
        Controller["Controller<br/>(dot-ai-controller)"]
        Resources["Cluster Resources<br/>Pods, Deployments,<br/>Services, etc."]
        CRDs["Custom Resource<br/>Definitions"]
    end

    subgraph WebUI["Web UI (dot-ai-ui)"]
        Viz["Visualization Dashboard<br/>- Mermaid Diagrams<br/>- Tables & Cards<br/>- Code Blocks"]
    end

    Agent <-->|MCP Protocol| Query
    Query --> AI
    Query --> Session
    AI --> LLM
    AI <-->|Tool Loop| CapTools
    AI <-->|Tool Loop| ResTools
    AI <-->|Tool Loop| KubectlTools
    CapTools --> Qdrant
    ResTools --> Qdrant
    KubectlTools --> API
    Qdrant --> Embeddings
    Controller -->|Sync Resources| Qdrant
    Controller -->|Watch| Resources
    Controller -->|Watch| CRDs
    Agent -.->|User opens<br/>Visualization URL| WebUI
```

## Query Workflow

The query tool operates as an agentic loop where the AI autonomously decides which tools to call:

```mermaid
flowchart TD
    subgraph Phase1["Phase 1: Input Processing"]
        Intent["User Intent<br/>(natural language)"]
        Validate["Validate Input<br/>(1-1000 chars)"]
        RequestId["Generate RequestId<br/>for tracking"]

        Intent --> Validate --> RequestId
    end

    subgraph Phase2["Phase 2: AI Initialization"]
        LoadPrompt["Load System Prompt<br/>(query-system.md)"]
        InitProvider["Initialize AI Provider<br/>(from environment)"]
        BuildTools["Build Tool Definitions<br/>- search_capabilities<br/>- query_capabilities<br/>- search_resources<br/>- query_resources<br/>- kubectl_*"]

        RequestId --> LoadPrompt --> InitProvider --> BuildTools
    end

    subgraph Phase3["Phase 3: Agentic Tool Loop"]
        ToolLoop["AI Tool Loop<br/>(Vercel AI SDK)"]
        Decision{"AI<br/>Decision"}
        SemanticSearch["Semantic Search<br/>(Qdrant vectors)"]
        FilterQuery["Filter Query<br/>(Qdrant filters)"]
        KubectlExec["Kubectl Execution<br/>(live cluster)"]
        CollectResults["Collect Tool Results"]

        BuildTools --> ToolLoop --> Decision
        Decision -->|search_* tools| SemanticSearch --> CollectResults
        Decision -->|query_* tools| FilterQuery --> CollectResults
        Decision -->|kubectl_* tools| KubectlExec --> CollectResults
        CollectResults -->|Continue| ToolLoop
    end

    subgraph Phase4["Phase 4: Response Generation"]
        MaxIterations{"Max Iterations<br/>Reached?"}
        FinalResponse["AI Final Response<br/>(JSON format)"]
        ParseJSON["Parse Summary<br/>from JSON"]

        Decision -->|Done| FinalResponse
        CollectResults --> MaxIterations
        MaxIterations -->|Yes| FinalResponse
        MaxIterations -->|No| ToolLoop
        FinalResponse --> ParseJSON
    end

    subgraph Phase5["Phase 5: Session & Output"]
        CreateSession["Create Session<br/>(qry-{ts}-{uuid})"]
        StoreData["Store Session Data<br/>- intent<br/>- summary<br/>- toolsUsed<br/>- toolCallsExecuted"]
        GenURL{"WEB_UI_BASE_URL<br/>Configured?"}
        VizURL["Generate Visualization URL<br/>{baseUrl}/v/{sessionId}"]
        BuildOutput["Build QueryOutput<br/>- success: true<br/>- summary<br/>- toolsUsed<br/>- sessionId<br/>- visualizationUrl"]

        ParseJSON --> CreateSession --> StoreData --> GenURL
        GenURL -->|Yes| VizURL --> BuildOutput
        GenURL -->|No| BuildOutput
    end
```

## Component Details

### MCP Server (dot-ai)

The MCP server hosts the query tool and orchestrates AI-driven investigation:

| Component | File | Description |
|-----------|------|-------------|
| `query` tool | `src/tools/query.ts` | Entry point, orchestrates tool loop and session |
| System Prompt | `prompts/query-system.md` | AI instructions for query behavior |
| `GenericSessionManager` | `src/core/generic-session-manager.ts` | File-based session persistence |
| `AIProvider` | `src/core/ai-provider.interface.ts` | AI abstraction with tool loop support |
| `AIProviderFactory` | `src/core/ai-provider-factory.ts` | Multi-provider factory |
| `capability-tools` | `src/core/capability-tools.ts` | Semantic search for capabilities |
| `resource-tools` | `src/core/resource-tools.ts` | Semantic search for resources |
| `kubectl-tools` | `src/core/kubectl-tools.ts` | Kubectl read-only tools |
| `CapabilityVectorService` | `src/core/capability-vector-service.ts` | Capability embeddings storage |
| `ResourceVectorService` | `src/core/resource-vector-service.ts` | Resource embeddings storage |
| `EmbeddingService` | `src/core/embedding-service.ts` | Multi-provider embedding generation |
| `visualization` | `src/core/visualization.ts` | URL generation for web UI |

### Query Tools Available to AI

The AI can autonomously call these tools during investigation:

| Tool | Type | Description |
|------|------|-------------|
| `search_capabilities` | Semantic | Vector similarity search for resource capabilities |
| `query_capabilities` | Filter | Structured filter queries for capabilities |
| `search_resources` | Semantic | Vector similarity search for cluster resources |
| `query_resources` | Filter | Structured filter queries for resources |
| `kubectl_api_resources` | Kubectl | List all available API resources |
| `kubectl_get` | Kubectl | Get resources with current state |
| `kubectl_describe` | Kubectl | Detailed resource information |
| `kubectl_logs` | Kubectl | Container logs from pods |
| `kubectl_events` | Kubectl | Kubernetes events for troubleshooting |
| `kubectl_get_crd_schema` | Kubectl | OpenAPI v3 schema for CRDs |

### Controller (dot-ai-controller)

The Kubernetes controller syncs cluster data to Qdrant for semantic search:

| Component | File | Description |
|-----------|------|-------------|
| `ResourceSyncReconciler` | `internal/controller/resourcesync_controller.go` | Watches all resources, syncs to MCP |
| `CapabilityScanReconciler` | `internal/controller/capabilityscan_controller.go` | Triggers capability scans on CRD changes |
| `ResourceSyncConfig` CRD | `api/v1alpha1/resourcesyncconfig_types.go` | Configuration for resource sync |
| `CapabilityScanConfig` CRD | `api/v1alpha1/capabilityscanconfig_types.go` | Configuration for capability scanning |
| MCP Resource Client | `internal/controller/resourcesync_mcp.go` | HTTP client for `/api/v1/resources/sync` |
| MCP Capability Client | `internal/controller/capabilityscan_mcp.go` | HTTP client for `/api/v1/tools/manageOrgData` |
| Debounce Buffer | `internal/controller/resourcesync_debounce.go` | Batches changes before sync |

### Web UI (dot-ai-ui)

Provides visualization for query results:

| Component | File | Description |
|-----------|------|-------------|
| Visualization Page | `src/pages/Visualization.tsx` | Main page for `/v/{sessionId}` |
| MermaidRenderer | `src/components/renderers/MermaidRenderer.tsx` | Interactive diagrams with zoom/pan |
| CardRenderer | `src/components/renderers/CardRenderer.tsx` | Resource cards in grid layout |
| TableRenderer | `src/components/renderers/TableRenderer.tsx` | Tabular data display |
| CodeRenderer | `src/components/renderers/CodeRenderer.tsx` | Syntax-highlighted code/YAML |
| InsightsPanel | `src/components/InsightsPanel.tsx` | AI observations display |
| TabContainer | `src/components/TabContainer.tsx` | Multi-visualization tabs |
| API Client | `src/api/client.ts` | Data fetching from MCP server |

## Integration Points

```mermaid
flowchart LR
    subgraph MCP["MCP Server"]
        Query["query tool"]
        AIProvider["AI Provider"]
        CapTools["Capability Tools"]
        ResTools["Resource Tools"]
        KubectlTools["Kubectl Tools"]
        SessionMgr["Session Manager"]
    end

    subgraph Vector["Vector Database"]
        Qdrant["Qdrant"]
        CapColl["capabilities<br/>collection"]
        ResColl["resources<br/>collection"]
    end

    subgraph AI["AI Providers"]
        Anthropic["Claude API"]
        OpenAI["OpenAI API"]
        Google["Gemini API"]
        Others["xAI, Bedrock,<br/>OpenRouter, etc."]
    end

    subgraph Embed["Embedding Providers"]
        OpenAIEmbed["OpenAI<br/>text-embedding-3-small"]
        GoogleEmbed["Google<br/>gemini-embedding-001"]
        BedrockEmbed["Bedrock<br/>titan-embed-text-v2"]
    end

    subgraph K8s["Kubernetes"]
        API["API Server"]
        Controller["dot-ai-controller"]
    end

    subgraph Storage["Session Storage"]
        Files["File System<br/>~/.dot-ai/sessions/qry-*"]
    end

    subgraph UI["Web UI"]
        Viz["Visualization<br/>Dashboard"]
    end

    AIProvider <-->|Tool Loop| CapTools
    AIProvider <-->|Tool Loop| ResTools
    AIProvider <-->|Tool Loop| KubectlTools
    AIProvider --> Anthropic
    AIProvider --> OpenAI
    AIProvider --> Google
    AIProvider --> Others

    CapTools --> Qdrant
    ResTools --> Qdrant
    Qdrant --> CapColl
    Qdrant --> ResColl
    Qdrant --> OpenAIEmbed
    Qdrant --> GoogleEmbed
    Qdrant --> BedrockEmbed

    KubectlTools --> API

    Controller -->|POST /api/v1/resources/sync| MCP
    Controller -->|POST /api/v1/tools/manageOrgData| MCP
    Controller -->|Watch| K8s

    SessionMgr --> Files

    Query -.->|Session URL| Viz
```

### MCP Server ↔ AI Provider

- **Tool Loop**: AI iteratively calls tools (max 20 iterations by default)
- **Autonomous Decision**: AI decides which tools to call based on user intent
- **JSON Output**: AI returns structured JSON with summary

### MCP Server ↔ Qdrant

- **Semantic Search**: Vector similarity using embeddings
- **Filter Queries**: Structured queries by namespace, kind, labels, etc.
- **Capability Data**: Resource kinds with capabilities, providers, abstractions
- **Resource Data**: Cluster resources with labels, annotations, timestamps

### Controller ↔ MCP Server

- **Resource Sync**: Controller watches all resources and syncs to MCP
- **Capability Scanning**: Controller triggers scans when CRDs change
- **Debounced Batching**: Changes batched every 10s to reduce API calls
- **Periodic Resync**: Full resync every 60 minutes for consistency

### MCP Server ↔ Web UI

- **Session Storage**: Query results stored with session IDs
- **Visualization API**: `/api/v1/visualize/{sessionId}` endpoint
- **URL Generation**: `WEB_UI_BASE_URL/v/{sessionId}`
- **Cached Visualizations**: AI-generated visualizations cached in session

## Session Management

Sessions persist query data for visualization:

```
Session ID Format: qry-{timestamp}-{uuid8}
Example: qry-1767465086590-a1b2c3d4

Session Data:
├── toolName: 'query'
├── intent: "What databases are running in the cluster?"
├── summary: "Found 3 PostgreSQL clusters and 2 Redis instances..."
├── toolsUsed: ['search_capabilities', 'kubectl_get', ...]
├── iterations: 5
├── toolCallsExecuted:
│   ├── {tool: 'search_capabilities', input: {...}, output: {...}}
│   ├── {tool: 'kubectl_get', input: {...}, output: {...}}
│   └── ...
├── cachedVisualization:
│   ├── title: "Database Resources"
│   ├── visualizations: [...]
│   ├── insights: [...]
│   └── generatedAt: ISO timestamp
└── timestamp: ISO date
```

## Data Flow: Resource Sync

```mermaid
flowchart LR
    subgraph K8s["Kubernetes Cluster"]
        Resources["Resources<br/>(Pods, Deployments, etc.)"]
        CRDs["Custom Resources"]
    end

    subgraph Controller["dot-ai-controller"]
        Informers["Dynamic<br/>Informers"]
        CRDWatcher["CRD<br/>Watcher"]
        DebounceRes["Resource<br/>Debounce Buffer"]
        DebounceCap["Capability<br/>Debounce Buffer"]
    end

    subgraph MCP["MCP Server"]
        SyncAPI["POST /api/v1/<br/>resources/sync"]
        ManageAPI["POST /api/v1/<br/>tools/manageOrgData"]
        VectorSvc["Vector<br/>Services"]
    end

    subgraph Qdrant["Qdrant"]
        ResColl["resources<br/>collection"]
        CapColl["capabilities<br/>collection"]
    end

    Resources -->|Watch| Informers
    CRDs -->|Watch| CRDWatcher
    Informers -->|Queue| DebounceRes
    CRDWatcher -->|Queue| DebounceCap
    DebounceRes -->|Batch every 10s| SyncAPI
    DebounceCap -->|Batch every 10s| ManageAPI
    SyncAPI --> VectorSvc
    ManageAPI --> VectorSvc
    VectorSvc -->|Upsert/Delete| ResColl
    VectorSvc -->|Scan/Delete| CapColl
```

## Output Formats

The query tool returns structured output:

| Field | Description |
|-------|-------------|
| `success` | Boolean indicating query success |
| `summary` | Human-readable summary of findings |
| `toolsUsed` | List of tools called during investigation |
| `iterations` | Number of AI tool loop iterations |
| `sessionId` | Session ID for visualization |
| `visualizationUrl` | URL to view results (if configured) |
| `guidance` | Instructions for agent on presenting results |
| `error` | Error object with code and message (if failed) |

## Error Handling

The query workflow includes robust error handling:

1. **Input Validation**: Intent must be non-empty string (1-1000 chars)
2. **AI Provider Errors**: Logged with request IDs for debugging
3. **Vector DB Unavailable**: Falls back to keyword search or kubectl-only
4. **JSON Parsing Failures**: Original AI response logged for analysis
5. **Tool Execution Errors**: Captured in toolCallsExecuted, investigation continues
6. **Max Iterations**: Default 20 iterations prevents infinite loops

## Configuration

### Environment Variables

| Variable | Description | Default |
|----------|-------------|---------|
| `AI_PROVIDER` | AI provider selection | `anthropic` |
| `ANTHROPIC_API_KEY` | Anthropic API key | Required if using |
| `OPENAI_API_KEY` | OpenAI API key | Required if using |
| `QDRANT_URL` | Qdrant vector DB endpoint | `http://localhost:6333` |
| `QDRANT_API_KEY` | Qdrant authentication | Optional |
| `QDRANT_CAPABILITIES_COLLECTION` | Capabilities collection name | `capabilities` |
| `QDRANT_RESOURCES_COLLECTION` | Resources collection name | `resources` |
| `CUSTOM_EMBEDDINGS_API_KEY` | Embedding service API key | Falls back to provider key |
| `KUBECONFIG` | Kubernetes config path | Auto-detected |
| `WEB_UI_BASE_URL` | Web UI base URL | Optional |
| `DEBUG_DOT_AI` | Enable debug logging | `false` |

### Supported AI Providers

| Provider | Models | Notes |
|----------|--------|-------|
| Anthropic | Claude Sonnet 4.5, Opus, Haiku | Default, 1M token context |
| OpenAI | GPT-5.1-codex | |
| Google | Gemini 3 Pro, Flash | |
| xAI | Grok-4 | |
| Amazon Bedrock | Various | Uses AWS credential chain |
| OpenRouter | Multi-model | Proxy to multiple providers |
| Custom | Ollama, vLLM, LocalAI | Via `baseURL` config |

### Supported Embedding Providers

| Provider | Model | Dimensions |
|----------|-------|------------|
| OpenAI | text-embedding-3-small | 1536 |
| Google | gemini-embedding-001 | 768 |
| Amazon Bedrock | titan-embed-text-v2:0 | 1024 |

## See Also

- [MCP Query Guide](//mcp/query/)
- [Controller Documentation](//controller/)
- [Web UI Documentation](//ui/)
- [Vector Database Setup](//setup/qdrant/)

# Recommendation Feature Architecture

This document provides a detailed architecture overview of the Recommendation feature in the DevOps AI Toolkit.

## Overview

The Recommendation feature provides AI-powered Kubernetes deployment recommendations. It analyzes user intent, discovers cluster capabilities, and generates deployment solutions with full manifest generation and deployment capabilities.

## High-Level Architecture

```mermaid
flowchart TB
    subgraph Users["User / AI Agent"]
        Agent["Claude Code, Cursor,<br/>VS Code, etc."]
    end

    subgraph MCP["MCP Server (dot-ai)"]
        Recommend["recommend Tool"]
        AI["AI Provider"]
        Vector["Vector DB<br/>Client"]
        Discovery["Discovery<br/>Engine"]
    end

    subgraph External["External Services"]
        LLM["Claude, OpenAI,<br/>Ollama, etc."]
        Qdrant["Qdrant<br/>(Semantic Search)"]
    end

    subgraph K8s["Kubernetes Cluster"]
        API["Kubernetes API"]
        Controller["Controller<br/>(dot-ai-controller)"]
        Resources["Deployed Resources<br/>Deployments, Services,<br/>Ingress, HPA, PDB, etc."]
    end

    subgraph WebUI["Web UI (dot-ai-ui)"]
        Viz["Visualization Dashboard<br/>- Solution Comparisons<br/>- Resource Diagrams<br/>- Generated Manifests"]
    end

    Agent <-->|MCP Protocol| Recommend
    Recommend --> AI
    Recommend --> Vector
    Recommend --> Discovery
    AI --> LLM
    Vector -->|Embeddings| LLM
    Vector --> Qdrant
    Discovery --> API
    Recommend --> API
    Controller --> Resources
    Controller -->|Sync with Embeddings| Qdrant
    Agent -.->|User opens<br/>Visualization URL| WebUI
```

## Recommendation Workflow Stages

The recommendation tool operates as a unified multi-stage workflow:

```mermaid
flowchart TD
    subgraph Stage1["Stage 1: recommend"]
        UserIntent["User Intent"]
        IntentCheck{"Intent < 100 chars?"}
        Refine["Return Refinement Guidance"]
        CapSearch["Capability Search<br/>(Vector DB)"]
        AIRank["AI Ranking<br/>(Claude)"]
        CapMatch{"Capability<br/>Match?"}
        ResourceSol["Generate Resource-Based<br/>Solutions"]
        HelmSearch["Search ArtifactHub<br/>for Helm Charts"]
        Solutions["Solutions with solutionIds<br/>+ visualization URL"]

        UserIntent --> IntentCheck
        IntentCheck -->|Yes| Refine
        IntentCheck -->|No| CapSearch
        CapSearch --> AIRank
        AIRank --> CapMatch
        CapMatch -->|Yes| ResourceSol
        CapMatch -->|No| HelmSearch
        ResourceSol --> Solutions
        HelmSearch --> Solutions
    end

    subgraph Stage2["Stage 2: chooseSolution"]
        SelectSol["solutionId"]
        LoadSession["Load Session"]
        GenQuestions["Generate Questions<br/>(if Helm)"]
        ReturnQuestions["Required Questions"]

        SelectSol --> LoadSession --> GenQuestions --> ReturnQuestions
    end

    subgraph Stage3["Stage 3-6: answerQuestion"]
        Required["answerQuestion:required<br/>(name, namespace, image, port)"]
        Basic["answerQuestion:basic<br/>(replicas, resources, ingress)"]
        Advanced["answerQuestion:advanced<br/>(probes, PDB, security)"]
        Open["answerQuestion:open<br/>(free-form, AI enhancement)"]
        Ready["ready_for_manifest_generation"]

        Required --> Basic --> Advanced --> Open --> Ready
    end

    subgraph Stage4["Stage 7: generateManifests"]
        GenType{"Solution<br/>Type?"}

        subgraph Capability["Capability-Based"]
            C1["1. Retrieve Schemas"]
            C2["2. AI Generation"]
            C3["3. YAML Validation"]
            C4["4. kubectl Dry-Run"]
            C5["5. Retry Loop (max 10)"]
            C6["6. Package Output"]
            C1 --> C2 --> C3 --> C4 --> C5 --> C6
        end

        subgraph Helm["Helm-Based"]
            H1["1. Fetch Chart"]
            H2["2. AI Values Gen"]
            H3["3. Helm Dry-Run"]
            H4["4. Retry Loop"]
            H1 --> H2 --> H3 --> H4
        end

        GenType -->|Capability| C1
        GenType -->|Helm| H1
        C6 --> Manifests
        H4 --> Manifests
        Manifests["Manifests + visualization URL"]
    end

    subgraph Stage5["Stage 8: deployManifests"]
        DeployType{"Solution<br/>Type?"}

        CapDeploy["kubectl apply -k<br/>--wait"]
        HelmDeploy["helm upgrade<br/>--install --wait"]
        Status["Deployment Status<br/>+ Next Steps"]

        DeployType -->|Capability| CapDeploy
        DeployType -->|Helm| HelmDeploy
        CapDeploy --> Status
        HelmDeploy --> Status
    end

    Solutions --> SelectSol
    ReturnQuestions --> Required
    Ready --> GenType
    Manifests --> DeployType
```

## Component Details

### MCP Server (dot-ai)

The MCP server is the core recommendation engine:

| Component | File | Description |
|-----------|------|-------------|
| `recommend` tool | `src/tools/recommend.ts` | Entry point, routes to stages, generates solutions |
| `chooseSolution` | `src/tools/choose-solution.ts` | Loads selected solution, returns questions |
| `answerQuestion` | `src/tools/answer-question.ts` | Processes answers, manages stage progression |
| `generateManifests` | `src/tools/generate-manifests.ts` | AI manifest generation with validation loop |
| `deployManifests` | `src/tools/deploy-manifests.ts` | Deploys via kubectl or helm |
| `ResourceRecommender` | `src/core/schema.ts` | AI-powered solution ranking and filtering |
| `CapabilityVectorService` | `src/core/capability-vector-service.ts` | Semantic search for capabilities |
| `PatternVectorService` | `src/core/pattern-vector-service.ts` | Organizational pattern matching |
| `PolicyVectorService` | `src/core/policy-vector-service.ts` | Policy enforcement |
| `GenericSessionManager` | `src/core/generic-session-manager.ts` | Session state management |
| `ArtifactHubService` | `src/core/artifacthub.ts` | Helm chart discovery |

### Controller (dot-ai-controller)

The Kubernetes controller manages deployed solutions:

| CRD | Description |
|-----|-------------|
| `Solution` | Groups related resources, manages ownerReferences, aggregates health |
| `ResourceSyncConfig` | Syncs resource metadata to MCP for semantic search |
| `CapabilityScanConfig` | Scans cluster for resource capabilities |
| `RemediationPolicy` | Event-driven remediation (separate feature) |

### Web UI (dot-ai-ui)

Provides visualization for recommendation results:

- **Visualization Page** (`/v/{sessionId}`) - Renders solution comparisons
- **Mermaid Diagrams** - Architecture and flow diagrams
- **Resource Cards** - Solution component details
- **Code Blocks** - Generated manifests with syntax highlighting
- **Tables** - Configuration summaries

## Integration Points

```mermaid
flowchart LR
    subgraph MCP["MCP Server"]
        Recommend["recommend tool"]
        Schema["ResourceRecommender"]
        CapVec["CapabilityVectorService"]
        PatVec["PatternVectorService"]
        PolVec["PolicyVectorService"]
        Discovery["Discovery Engine"]
        ArtHub["ArtifactHubService"]
    end

    subgraph VectorDB["Qdrant"]
        Capabilities["Capabilities<br/>Collection"]
        Patterns["Patterns<br/>Collection"]
        Policies["Policies<br/>Collection"]
    end

    subgraph AI["AI Provider"]
        Claude["Claude API"]
        OpenAI["OpenAI API"]
    end

    subgraph K8s["Kubernetes"]
        API["API Server"]
        Controller["dot-ai-controller"]
    end

    subgraph External["External"]
        ArtifactHub["ArtifactHub API"]
    end

    subgraph UI["Web UI"]
        Viz["Visualization<br/>Dashboard"]
    end

    CapVec <-->|Semantic Search| Capabilities
    PatVec <-->|Pattern Match| Patterns
    PolVec <-->|Policy Lookup| Policies

    Schema -->|Solution Ranking| Claude
    Schema -->|Solution Ranking| OpenAI

    Discovery -->|kubectl explain| API
    Recommend -->|kubectl apply| API

    ArtHub -->|Chart Search| ArtifactHub

    Recommend -.->|Session URL| Viz
    Controller -->|Watch Resources| API
```

### MCP Server ↔ Vector DB (Qdrant)

- **Capability Storage**: Resource capabilities with semantic embeddings
- **Pattern Storage**: Organizational patterns for solution enhancement
- **Policy Storage**: Policy intents for configuration enforcement
- **Semantic Search**: Natural language queries matched to stored data

### MCP Server ↔ Kubernetes API

- **Resource Discovery**: `kubectl api-resources`, `kubectl explain`
- **Schema Retrieval**: OpenAPI schemas for manifest generation
- **Manifest Validation**: `kubectl apply --dry-run=server`
- **Deployment**: `kubectl apply`, `helm upgrade --install`

### MCP Server ↔ AI Provider

- **Solution Assembly**: Ranking and filtering discovered capabilities
- **Question Generation**: Creating configuration questions from schemas
- **Manifest Generation**: Generating YAML from solution + answers
- **Helm Values**: Generating values.yaml for chart installations

### MCP Server ↔ Web UI

- **Session Storage**: Solution data stored with session IDs
- **Visualization API**: `/api/visualize/{sessionId}` endpoint
- **URL Generation**: `WEB_UI_BASE_URL/v/{sessionId}`

### Controller ↔ MCP Server

- **Resource Sync**: Controller syncs resource metadata to MCP
- **Capability Scan**: Controller triggers capability discovery
- **Solution CR**: MCP generates Solution CR for controller management

## Session Management

Sessions persist workflow state across tool calls:

```
Session ID Format: sol-{timestamp}-{uuid8}
Example: sol-1765409923079-fa3f055c

Session Data:
├── toolName: 'recommend'
├── stage: 'recommend' | 'generateManifests' | ...
├── intent: "Deploy PostgreSQL database"
├── type: 'single' | 'combination' | 'helm'
├── score: 96
├── description: "Multi-cloud PostgreSQL via DevOps Toolkit"
├── resources: [{kind, apiVersion, group, description}]
├── chart: {repository, chartName, version} (if Helm)
├── questions: {required, basic, advanced, open}
├── answers: {questionId: value}
├── appliedPatterns: ["DevOps Toolkit DB Pattern"]
├── generatedManifests: {type, files, helmCommand}
└── timestamp: ISO date
```

## Output Formats

The recommendation tool supports three output formats for capability-based solutions:

| Format | Description | Files Generated |
|--------|-------------|-----------------|
| `raw` | Plain YAML manifests | `manifests.yaml` |
| `helm` | Helm chart structure | `Chart.yaml`, `values.yaml`, `templates/*.yaml` |
| `kustomize` | Kustomize overlay | `kustomization.yaml`, `base/`, `overlays/` |

## Error Handling

The recommendation workflow includes robust error handling:

1. **Intent Refinement**: Vague intents get guidance, not failure
2. **Validation Loops**: Up to 10 retries for manifest generation
3. **Capability Gaps**: Clear error when enhancement isn't possible
4. **Session Expiry**: Graceful handling of expired sessions
5. **AI Service Errors**: Fallback to original solution on enhancement failure

## See Also

- [MCP Recommendation Guide](//mcp/recommend/)
- [Capability Management Guide](//mcp/capability-management/)
- [Pattern Management Guide](//mcp/pattern-management/)
- [Controller Documentation](//controller/)
- [Web UI Documentation](//ui/)

# Remediate Feature Architecture

This document provides a detailed architecture overview of the Remediate feature in the DevOps AI Toolkit.

## Overview

The Remediate feature provides AI-powered Kubernetes issue analysis and remediation. It investigates problems using kubectl tools, identifies root causes with confidence scoring, and executes verified fixes with optional post-execution validation.

## High-Level Architecture

```mermaid
flowchart TB
    subgraph Users["User / AI Agent"]
        Agent["Claude Code, Cursor,<br/>VS Code, etc."]
    end

    subgraph MCP["MCP Server (dot-ai)"]
        Remediate["remediate Tool"]
        AI["AI Provider"]
        Session["Session<br/>Manager"]
        Kubectl["Kubectl Tools"]
    end

    subgraph External["External Services"]
        LLM["Claude, OpenAI,<br/>Gemini, etc."]
    end

    subgraph K8s["Kubernetes Cluster"]
        API["Kubernetes API"]
        Controller["Controller<br/>(dot-ai-controller)"]
        Events["Kubernetes Events"]
        Resources["Cluster Resources<br/>Pods, Deployments,<br/>Services, etc."]
    end

    subgraph WebUI["Web UI (dot-ai-ui)"]
        Viz["Visualization Dashboard<br/>- Investigation Flow<br/>- Root Cause Analysis<br/>- Remediation Commands"]
    end

    subgraph Notifications["Notifications"]
        Slack["Slack"]
        GChat["Google Chat"]
    end

    Agent <-->|MCP Protocol| Remediate
    Remediate --> AI
    Remediate --> Session
    Remediate --> Kubectl
    AI --> LLM
    AI <-->|Tool Loop| Kubectl
    Kubectl --> API
    Remediate -->|Execute Commands| API
    Controller -->|Watch| Events
    Controller -->|RemediationPolicy| Remediate
    Controller -.->|Webhook| Slack
    Controller -.->|Webhook| GChat
    Events --> Resources
    Agent -.->|User opens<br/>Visualization URL| WebUI
```

## Remediation Workflow

The remediate tool operates as a multi-phase workflow with persistent session management:

```mermaid
flowchart TD
    subgraph Phase1["Phase 1: Investigation"]
        Issue["Issue Description"]
        CreateSession["Create Session<br/>(rem-{ts}-{uuid})"]
        Investigation["AI Investigation Loop<br/>(max 30 iterations)"]
        KubectlTools["kubectl Tools:<br/>get, describe, logs,<br/>events, api-resources, patch"]

        Issue --> CreateSession --> Investigation
        Investigation <-->|Tool Calls| KubectlTools
    end

    subgraph Phase2["Phase 2: Analysis"]
        ParseResponse["Parse AI Response"]
        StatusCheck{"Issue<br/>Status?"}
        AlreadyResolved["Return: Issue already<br/>resolved/non-existent"]
        Analysis["Root Cause Analysis<br/>+ Confidence Score<br/>+ Contributing Factors"]

        Investigation --> ParseResponse --> StatusCheck
        StatusCheck -->|resolved| AlreadyResolved
        StatusCheck -->|non_existent| AlreadyResolved
        StatusCheck -->|active| Analysis
    end

    subgraph Phase3["Phase 3: Execution Decision"]
        ModeCheck{"Execution<br/>Mode?"}

        subgraph Manual["Manual Mode (default)"]
            ReturnChoices["Return 2 Choices:<br/>1. Execute via MCP<br/>2. Execute via Agent"]
            WaitApproval["await_user_approval"]
        end

        subgraph Auto["Automatic Mode"]
            ThresholdCheck{"Confidence >= threshold<br/>AND Risk <= maxRisk?"}
            AutoExecute["Execute Automatically"]
            Fallback["Return with<br/>fallbackReason"]
        end

        Analysis --> ModeCheck
        ModeCheck -->|manual| ReturnChoices --> WaitApproval
        ModeCheck -->|automatic| ThresholdCheck
        ThresholdCheck -->|Yes| AutoExecute
        ThresholdCheck -->|No| Fallback
    end

    subgraph Phase4["Phase 4: Command Execution"]
        UserChoice["User Choice<br/>(executeChoice=1 or 2)"]
        Choice1{"Choice?"}
        ExecuteMCP["Execute via MCP<br/>(child_process.exec)"]
        ExecuteAgent["Return Commands<br/>for Agent Execution"]
        LogResults["Log Results<br/>(success/failure/output)"]

        WaitApproval --> UserChoice --> Choice1
        Choice1 -->|1| ExecuteMCP --> LogResults
        Choice1 -->|2| ExecuteAgent
        AutoExecute --> ExecuteMCP
    end

    subgraph Phase5["Phase 5: Validation"]
        ValidationCheck{"All Commands<br/>Succeeded?"}
        HasValidation{"Has validationIntent?"}
        RecursiveCall["Recursive Investigation<br/>with validationIntent"]
        WaitReconcile["Wait 30s<br/>(automatic mode)"]
        FinalStatus["Final Status:<br/>resolved / still_active"]

        LogResults --> ValidationCheck
        ValidationCheck -->|Yes| HasValidation
        ValidationCheck -->|No| FinalStatus
        HasValidation -->|Yes| WaitReconcile --> RecursiveCall --> FinalStatus
        HasValidation -->|No| FinalStatus
    end
```

## Component Details

### MCP Server (dot-ai)

The MCP server is the core remediation engine:

| Component | File | Description |
|-----------|------|-------------|
| `remediate` tool | `src/tools/remediate.ts` | Entry point, orchestrates investigation and execution |
| System Prompt | `prompts/remediate-system.md` | AI instructions for investigation behavior |
| `GenericSessionManager` | `src/core/generic-session-manager.ts` | File-based session persistence |
| `AIProvider` | `src/core/ai-provider.interface.ts` | AI abstraction with tool loop support |
| `AIProviderFactory` | `src/core/ai-provider-factory.ts` | Multi-provider factory (Anthropic, OpenAI, etc.) |
| `kubectl-tools` | `src/core/kubectl-tools.ts` | Kubectl investigation tools |
| `visualization` | `src/core/visualization.ts` | URL generation for web UI |

### Kubectl Investigation Tools

Tools available during AI investigation:

| Tool | Description |
|------|-------------|
| `kubectl_api_resources` | Discover available resources in cluster |
| `kubectl_get` | List resources with table format (compact) |
| `kubectl_describe` | Detailed resource information with events |
| `kubectl_logs` | Container logs (supports `--previous` for crashes) |
| `kubectl_events` | Cluster events for understanding state changes |
| `kubectl_patch_dryrun` | Validate patches before actual execution |

### Controller (dot-ai-controller)

The Kubernetes controller provides event-driven remediation:

| Component | File | Description |
|-----------|------|-------------|
| `RemediationPolicy` CRD | `config/crd/bases/` | Custom resource for remediation rules |
| Policy Controller | `internal/controller/remediationpolicy_controller.go` | Event matching and MCP dispatch |
| Rate Limiter | `internal/controller/remediationpolicy_ratelimit.go` | Per-object cooldowns and rate limits |
| MCP Client | `internal/controller/remediationpolicy_mcp.go` | HTTP client for remediate tool |
| Cooldown State | ConfigMaps | Persistent cooldown state across restarts |

### Web UI (dot-ai-ui)

Provides visualization for remediation results:

| Component | File | Description |
|-----------|------|-------------|
| Visualization Page | `src/pages/Visualization.tsx` | Main page for `/v/{sessionId}` |
| MermaidRenderer | `src/components/renderers/MermaidRenderer.tsx` | Interactive flowcharts (collapsible) |
| CardRenderer | `src/components/renderers/CardRenderer.tsx` | Issue/solution cards |
| CodeRenderer | `src/components/renderers/CodeRenderer.tsx` | Commands and logs with syntax highlighting |
| InsightsPanel | `src/components/InsightsPanel.tsx` | AI observations display |
| API Client | `src/api/client.ts` | Data fetching from MCP server |

## Integration Points

```mermaid
flowchart LR
    subgraph MCP["MCP Server"]
        Remediate["remediate tool"]
        AIProvider["AI Provider"]
        KubectlTools["Kubectl Tools"]
        SessionMgr["Session Manager"]
    end

    subgraph AI["AI Providers"]
        Anthropic["Claude API"]
        OpenAI["OpenAI API"]
        Google["Gemini API"]
        Others["xAI, Bedrock,<br/>OpenRouter, etc."]
    end

    subgraph K8s["Kubernetes"]
        API["API Server"]
        Controller["RemediationPolicy<br/>Controller"]
        Events["Kubernetes Events"]
    end

    subgraph Storage["Session Storage"]
        Files["File System<br/>tmp/sessions/*.json"]
        ConfigMaps["ConfigMaps<br/>(cooldown state)"]
    end

    subgraph UI["Web UI"]
        Viz["Visualization<br/>Dashboard"]
    end

    subgraph Notif["Notifications"]
        Slack["Slack"]
        GChat["Google Chat"]
    end

    AIProvider <-->|Tool Loop| KubectlTools
    AIProvider --> Anthropic
    AIProvider --> OpenAI
    AIProvider --> Google
    AIProvider --> Others

    KubectlTools -->|Investigation| API
    Remediate -->|Execute Commands| API

    Controller -->|Watch| Events
    Controller -->|HTTP POST| Remediate
    Controller -.->|Webhook| Slack
    Controller -.->|Webhook| GChat

    SessionMgr --> Files
    Controller --> ConfigMaps

    Remediate -.->|Session URL| Viz
```

### MCP Server ↔ AI Provider

- **Tool Loop**: AI iteratively calls kubectl tools (max 30 iterations)
- **Investigation**: Gathers cluster data to understand the issue
- **Analysis**: Parses JSON response with root cause, confidence, and remediation steps
- **Validation**: Optional recursive investigation after command execution

### MCP Server ↔ Kubernetes API

- **Read Operations**: `kubectl get`, `describe`, `logs`, `events`
- **Validation**: `kubectl patch --dry-run=server`
- **Execution**: `child_process.exec()` for remediation commands

### Controller ↔ MCP Server

- **Event-Driven**: Controller watches Kubernetes events
- **Policy Matching**: Events matched against RemediationPolicy selectors
- **HTTP Dispatch**: POST to MCP `/api/v1/tools/remediate`
- **Rate Limiting**: Per-object cooldowns prevent remediation storms

### MCP Server ↔ Web UI

- **Session Storage**: Remediation data stored with session IDs
- **Visualization API**: `/api/v1/visualize/{sessionId}` endpoint
- **URL Generation**: `WEB_UI_BASE_URL/v/{sessionId}`

### Controller ↔ Notifications

- **Slack Webhooks**: Controller sends remediation events to Slack channels
- **Google Chat Webhooks**: Controller sends remediation events to Google Chat spaces
- **Secret References**: Webhook URLs stored securely in Kubernetes Secrets
- **Event Types**: Notifications sent on remediation start, success, and failure

## Session Management

Sessions persist workflow state across tool calls:

```
Session ID Format: rem-{timestamp}-{uuid8}
Example: rem-1767465086590-11029192

Session Data:
├── toolName: 'remediate'
├── issue: "Pod my-app is crashing with OOMKilled"
├── mode: 'manual' | 'automatic'
├── interaction_id: (for evaluation dataset)
├── status: 'investigating' | 'analysis_complete' | 'executed_*' | ...
├── finalAnalysis:
│   ├── rootCause: "Container memory limit too low"
│   ├── confidence: 0.92
│   ├── factors: ["High memory usage", "No HPA"]
│   ├── remediation:
│   │   ├── summary: "Increase memory limit"
│   │   ├── actions: [{description, command, risk, rationale}]
│   │   └── risk: 'low' | 'medium' | 'high'
│   └── validationIntent: "Verify pod is running"
├── executionResults: [{command, success, output, error}]
└── timestamp: ISO date
```

### Session States

| State | Description |
|-------|-------------|
| `investigating` | AI is gathering data via kubectl tools |
| `analysis_complete` | Analysis done, awaiting user approval |
| `failed` | Investigation failed (error) |
| `executed_successfully` | All commands succeeded |
| `executed_with_errors` | Some commands failed |
| `cancelled` | User cancelled the remediation |

## RemediationPolicy CRD

The controller uses a CRD for event-driven remediation:

```yaml
apiVersion: dot-ai.devopstoolkit.live/v1alpha1
kind: RemediationPolicy
metadata:
  name: oom-killer-policy
spec:
  eventSelectors:
    - type: Warning
      reason: OOMKilled
      involvedObjectKind: Pod
      namespace: production
      message: ".*memory.*"  # Regex support
      mode: automatic        # Override per selector
      confidenceThreshold: 0.9
      maxRiskLevel: low

  mcpEndpoint: https://mcp.example.com/api/v1/tools
  mcpAuthSecretRef:
    name: mcp-auth
    key: token
  mcpTool: remediate

  mode: manual              # Default mode
  confidenceThreshold: 0.8
  maxRiskLevel: low

  rateLimiting:
    eventsPerMinute: 10
    cooldownMinutes: 5

  notifications:
    slack:
      webhookSecretRef:
        name: slack-webhook
        key: url
      channel: "#alerts"
    googleChat:
      webhookSecretRef:
        name: gchat-webhook
        key: url

status:
  totalEventsProcessed: 150
  successfulRemediations: 142
  failedRemediations: 8
  rateLimitedEvents: 25
  lastProcessedEvent: "2025-01-07T10:30:00Z"
```

## Output Formats

The remediate tool returns structured output:

| Field | Description |
|-------|-------------|
| `status` | `success`, `failed`, or `awaiting_user_approval` |
| `sessionId` | Session ID for continuation or visualization |
| `investigation.iterations` | Number of AI tool loop iterations |
| `investigation.dataGathered` | List of kubectl tools called |
| `analysis.rootCause` | Identified root cause |
| `analysis.confidence` | Confidence score (0-1) |
| `analysis.factors` | Contributing factors |
| `remediation.summary` | Human-readable summary |
| `remediation.actions` | Commands with risk levels |
| `remediation.risk` | Overall risk level |
| `validationIntent` | Post-execution validation instructions |
| `executionChoices` | Available execution options |
| `results` | Execution results (if executed) |

## Error Handling

The remediation workflow includes robust error handling:

1. **Session Not Found**: Clear guidance to start new investigation
2. **AI Service Errors**: Logged with request IDs for debugging
3. **JSON Parsing Failures**: Original AI response logged for analysis
4. **Command Execution Failures**: Individual command results tracked
5. **Validation Failures**: Recursive investigation continues despite errors
6. **Investigation Timeouts**: Max 30 iterations prevents infinite loops

## Configuration

### Environment Variables

| Variable | Description | Default |
|----------|-------------|---------|
| `AI_PROVIDER` | AI provider selection | `anthropic` |
| `ANTHROPIC_API_KEY` | Anthropic API key | Required if using |
| `OPENAI_API_KEY` | OpenAI API key | Required if using |
| `KUBECONFIG` | Kubernetes config path | Auto-detected |
| `DOT_AI_SESSION_DIR` | Session storage directory | `./tmp/sessions` |
| `WEB_UI_BASE_URL` | Web UI base URL | Optional |
| `DEBUG_DOT_AI` | Enable debug logging | `false` |

### Supported AI Providers

| Provider | Models | Notes |
|----------|--------|-------|
| Anthropic | Claude Sonnet 4.5, Opus, Haiku | Default, 1M token context |
| OpenAI | GPT-5.1-codex | |
| Google | Gemini 3 Pro, Flash | |
| xAI | Grok-4 | |
| Amazon Bedrock | Various | Uses AWS credential chain |
| OpenRouter | Multi-model | Proxy to multiple providers |
| Custom | Ollama, vLLM, LocalAI | Via `baseURL` config |

## See Also

- [MCP Remediate Guide](//mcp/remediate/)
- [Controller Documentation](//controller/)
- [Web UI Documentation](//ui/)
- [RemediationPolicy Reference](//controller/remediationpolicy/)

## AI Engine


# DevOps AI Toolkit

**AI-powered platform engineering and DevOps automation through intelligent Kubernetes operations and conversational workflows.**


# REST API Gateway

**Complete guide for accessing DevOps AI Toolkit functionality via standard HTTP REST endpoints.**

## Overview

The REST API Gateway provides HTTP access to all DevOps AI Toolkit capabilities without requiring MCP protocol implementation. Perfect for:

- **DevOps Engineers**: Integrate tools into automation scripts and CI/CD pipelines
- **Platform Teams**: Add AI-powered Kubernetes assistance to existing HTTP-based infrastructure  
- **QA/Testing Teams**: Create comprehensive integration tests with simple HTTP calls
- **Kubernetes Controller Developers**: Access AI tools via familiar REST patterns

### What it provides:
- **Universal Tool Access**: All 9 tools available via `POST /api/v1/tools/{toolName}` endpoints
- **Auto-Generated Documentation**: OpenAPI 3.0 specification with interactive documentation
- **Zero Maintenance**: Automatically stays current when new tools are added
- **Standard HTTP**: JSON request/response with proper status codes
- **Tool Discovery**: Searchable catalog of available tools and capabilities

## Setup

The REST API is automatically available when you deploy the DevOps AI Toolkit. **No special configuration is needed** - the same server provides MCP protocol, CLI, and REST API access simultaneously.

**🎯 Follow the [Deployment Guide](../setup/deployment.md) to deploy the server.**

The Kubernetes deployment automatically provides:
- ✅ **REST API endpoints** on the configured ingress URL  
- ✅ **MCP protocol access** for AI development tools
- ✅ **All dependencies** including Qdrant vector database
- ✅ **Both protocols simultaneously** with no conflicts

## Quick Start

### 1. Discover Available Tools

```bash
curl -s http://localhost:3456/api/v1/tools | jq '.data.tools[] | {name, description, category}'
```

**Response**: List of 9 available tools with descriptions:
```json
[
  {
    "name": "version",
    "description": "Get comprehensive system status including version information, Vector DB connection status, embedding service capabilities, Anthropic API connectivity, Kubernetes cluster connectivity, Kyverno policy engine status, and pattern management health check",
    "category": "System"
  },
  {
    "name": "recommend", 
    "description": "Deploy, create, setup, install, or run applications, infrastructure, and services on Kubernetes with AI recommendations",
    "category": "AI Tools"
  },
  {
    "name": "remediate",
    "description": "AI-powered Kubernetes issue analysis that provides root cause identification and actionable remediation steps",
    "category": "Troubleshooting"
  }
]
```

### 2. Get System Status

```bash
curl -s -X POST http://localhost:3456/api/v1/tools/version \
  -H "Content-Type: application/json" \
  -d '{}'
```

**Response**: Comprehensive system status (truncated):
```json
{
  "success": true,
  "data": {
    "result": {
      "status": "success",
      "system": {
        "version": {
          "version": "0.90.0",
          "nodeVersion": "v23.11.0",
          "platform": "darwin"
        },
        "vectorDB": {
          "connected": true,
          "collections": {
            "patterns": {"exists": true, "documentsCount": 3},
            "policies": {"exists": true, "documentsCount": 3},
            "capabilities": {"exists": true, "documentsCount": 157}
          }
        },
        "kubernetes": {
          "connected": true,
          "context": "kind-kind"
        }
      }
    }
  }
}
```

### 3. Access Interactive API Documentation

```bash
# Get OpenAPI specification
curl -s http://localhost:3456/api/v1/openapi | jq '.info'
```

**Response**: OpenAPI 3.0 specification with all tool schemas:
```json
{
  "title": "DevOps AI Toolkit REST API",
  "description": "REST API gateway for DevOps AI Toolkit tools",
  "version": "1.0.0",
  "contact": {
    "name": "Viktor Farcic",
    "url": "https://devopstoolkit.live/"
  }
}
```

## API Reference

### Base URL and Versioning

**Base URL**: `http://localhost:3456/api/v1`

All endpoints use `/api/v1/` prefix for versioning. Future API changes will use new version paths (`/api/v2/`) maintaining backward compatibility.

### Core Endpoints

#### Tool Discovery
```http
GET /api/v1/tools
```

**Query Parameters**:
- `category` - Filter by tool category (e.g., "AI Tools", "System", "Troubleshooting")
- `tag` - Filter by a single tag (e.g., "kubernetes", "deployment", "analysis") 
- `search` - Search tool names and descriptions

**Examples**:
```bash
# Get all AI-powered tools
curl "http://localhost:3456/api/v1/tools?category=AI%20Tools"

# Find deployment-related tools
curl "http://localhost:3456/api/v1/tools?tag=deployment"

# Search for troubleshooting tools
curl "http://localhost:3456/api/v1/tools?search=troubleshoot"
```

#### Tool Execution
```http
POST /api/v1/tools/{toolName}
Content-Type: application/json
```

**Request Body**: Tool-specific parameters (see OpenAPI spec for schemas)
**Response**: Standardized JSON response with tool results

#### OpenAPI Documentation
```http
GET /api/v1/openapi
```

Returns complete OpenAPI 3.0 specification with:
- All tool endpoints and schemas
- Parameter descriptions and validation rules
- Response formats and error codes
- Interactive documentation support

### Response Format

All REST API responses follow this standard format:

```json
{
  "success": boolean,
  "data": {
    "result": any,           // Tool execution result
    "tool": string,          // Tool name
    "executionTime": number  // Execution time in milliseconds
  },
  "error": {                 // Only present if success: false
    "code": string,
    "message": string,
    "details": any
  },
  "meta": {
    "timestamp": string,     // ISO 8601 timestamp
    "requestId": string,     // Unique request identifier
    "version": string        // API version
  }
}
```

### HTTP Status Codes

- **200 OK**: Successful tool execution
- **400 Bad Request**: Invalid request parameters or missing required fields
- **404 Not Found**: Tool does not exist
- **405 Method Not Allowed**: Invalid HTTP method (tools only accept POST)
- **500 Internal Server Error**: Tool execution failure or server error

## Tool Discovery and Filtering

Instead of listing all tools statically, use the API to discover available tools dynamically:

### Discover All Tools
```bash
curl http://your-ingress-url/api/v1/tools | jq '.data.tools[] | {name, description, category}'
```

### Filter by Category
```bash
# Get all AI-powered tools
curl "http://your-ingress-url/api/v1/tools?category=AI%20Tools"

# Available categories: AI Tools, Deployment, Management, System, Troubleshooting, Documentation
```

### Filter by Tags
```bash
# Find deployment-related tools
curl "http://your-ingress-url/api/v1/tools?tags=deployment"

# Find troubleshooting tools
curl "http://your-ingress-url/api/v1/tools?tags=troubleshooting"
```

### Search Tools
```bash
# Search for specific functionality
curl "http://your-ingress-url/api/v1/tools?search=kubernetes"
curl "http://your-ingress-url/api/v1/tools?search=database"
```

### Get Complete Tool Documentation
For detailed parameter schemas and usage instructions:
```bash
# Get OpenAPI specification with all tool schemas
curl http://your-ingress-url/api/v1/openapi | jq '.paths'
```


## Workflows and Use Cases

The REST API provides the same workflows as MCP and CLI. The only difference is using HTTP POST requests.

**For complete workflow patterns and use cases, see the tool-specific guides:**
- **Deployment Workflows**: [Kubernetes Deployment Recommendations](../tools/recommend.md)
- **Troubleshooting Workflows**: [AI-Powered Issue Remediation](../tools/remediate.md)
- **Capability Management**: [Capability Management Guide](../tools/capability-management.md)
- **Pattern Management**: [Pattern Management Guide](../organizational-data/patterns.md)

**Example: REST API call**
```bash
# REST API:
curl -X POST http://your-ingress-url/api/v1/tools/recommend \
  -H "Content-Type: application/json" \
  -d '{"intent": "deploy PostgreSQL database"}'
```

The business logic, parameters, responses, and multi-step workflows are identical across all access methods.

# Observability Guide

**Complete guide for distributed tracing and observability in the DevOps AI Toolkit.**

## Overview

**What it does**: Provides OpenTelemetry-based distributed tracing for debugging complex workflows, measuring AI provider performance, and understanding Kubernetes operation latency.

**Use when**: You need to understand where time is spent in multi-step workflows, debug performance issues, or monitor AI/Kubernetes operations in production.

**📖 Full Guide**: This document covers tracing setup, configuration, backend integration, and trace interpretation specific to the DevOps AI Toolkit.

### What is Distributed Tracing?

Learn about distributed tracing concepts and OpenTelemetry:
- [OpenTelemetry Documentation](https://opentelemetry.io/docs/)
- [OpenTelemetry Tracing Concepts](https://opentelemetry.io/docs/concepts/observability-primer/#distributed-tracing)

This guide focuses on DevOps AI Toolkit-specific tracing implementation, configuration, and usage patterns.

## Prerequisites

- DevOps AI Toolkit deployed (see [Deployment Guide](../setup/deployment.md))
- Basic understanding of distributed tracing concepts (optional but helpful)
- Backend for viewing traces (Jaeger, Grafana Tempo, vendor service) or use console output

## Quick Start

### Environment Variables

Add tracing environment variables to your deployment configuration (see [Deployment Guide](../setup/deployment.md) for how to configure environment variables).

| Variable | Required | Default | Description |
|----------|----------|---------|-------------|
| `OTEL_TRACING_ENABLED` | **Yes** | `false` | Enable/disable tracing |
| `OTEL_SERVICE_NAME` | No | `dot-ai-mcp` | Service name in traces |
| `OTEL_EXPORTER_TYPE` | No | `console` | Exporter type: `console`, `otlp`, `jaeger`, `zipkin` |
| `OTEL_EXPORTER_OTLP_ENDPOINT` | Required for OTLP | - | OTLP endpoint URL (e.g., `http://localhost:4318/v1/traces`) |
| `OTEL_SAMPLING_PROBABILITY` | No | `1.0` | Sampling rate: `0.0` to `1.0` (1.0 = 100%, 0.1 = 10%) |
| `OTEL_DEBUG` | No | `false` | Enable debug logging for tracing |

### Verify Tracing Status

After configuring, verify tracing status:

```md
User: Show me the system status

Agent: The system is healthy and all components are operational:

...

Tracing: Enabled
- Exporter: console
- Service Name: dot-ai-mcp
- Status: initialized
```

The agent will report tracing configuration as part of the system status.

## What Gets Traced

The DevOps AI Toolkit automatically traces all operations without requiring code changes:

### Tool Execution
- All tools (recommendations, remediation, capability management, etc.)
- Tool parameters and execution duration
- Success/failure status
- Session IDs for workflow correlation

### AI Provider Operations
- **Chat completions**: Claude, OpenAI, Google, xAI, and custom endpoints
- **Tool loop iterations**: Multi-step AI workflows with per-iteration visibility
- **Embeddings generation**: Vector embeddings for semantic search
- **Token usage**: Input tokens, output tokens, cache metrics
- **Model information**: Provider names and specific model versions

### Kubernetes Operations
- **API client calls**: All Kubernetes API operations through the client library
- **kubectl commands**: CLI command execution with operation details
- **Resource information**: Resource types, namespaces, and operation latency

### Vector Database Operations
- **Search queries**: Semantic and keyword searches with result counts
- **Document operations**: Upserts, deletions, and retrievals
- **Collection management**: Collection operations and health checks
- **Performance metrics**: Query latency and result quality scores

## Backend Integration

### Jaeger

Jaeger is an open-source distributed tracing platform. Run Jaeger locally with Docker:

```bash
docker run -d --name jaeger \
  -p 16686:16686 \
  -p 4318:4318 \
  jaegertracing/all-in-one:latest
```

Configure the server to send traces to Jaeger:

```bash
OTEL_TRACING_ENABLED=true
OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4318/v1/traces
```

Access the Jaeger UI at http://localhost:16686 to view traces.

### Other Backends

Any tracing backend that supports OpenTelemetry OTLP protocol should work with the same configuration pattern:

```
OTEL_TRACING_ENABLED=true
OTEL_EXPORTER_OTLP_ENDPOINT=<your-backend-otlp-endpoint>
```

Refer to your backend's documentation for the specific OTLP endpoint URL.

## Viewing Traces

### Jaeger UI

Open Jaeger UI at http://localhost:16686 (if using local Jaeger setup).

**Finding Traces:**
1. Select `dot-ai-mcp` from the Service dropdown
2. Click "Find Traces" button
3. View list of recent traces with duration and span count

**Trace Details:**
- Click on a trace to see the complete request flow
- Spans are displayed in a waterfall timeline showing parent-child relationships
- Each span shows operation name, duration, and timing relative to the trace start
- Click on individual spans to see detailed attributes

### Understanding Trace Information

**Tool Execution Span:**
- Operation name: `execute_tool <tool-name>`
- Shows total time for tool execution
- Contains session ID and tool parameters

**AI Provider Spans:**
- Operation names: `chat <model>`, `tool_loop <model>`, `embeddings <model>`
- Token usage: `gen_ai.usage.input_tokens`, `gen_ai.usage.output_tokens`
- Cache metrics: `gen_ai.usage.cache_read_tokens`, `gen_ai.usage.cache_creation_tokens`
- Model details: `gen_ai.request.model`, `gen_ai.provider.name`

**Kubernetes Operation Spans:**
- Operation names: API method names or `kubectl <command>`
- Attributes: `k8s.api`, `k8s.method`, `k8s.operation`, `k8s.resource`
- Shows latency for Kubernetes API calls

**Vector Database Spans:**
- Operation names: `search`, `upsert`, `delete`, `list`, etc.
- Attributes: `db.operation.name`, `db.collection.name`
- Result metrics: `db.query.result_count`, `db.vector.top_score`

### Trace Hierarchy

All spans from a single tool invocation share the same trace ID and follow this hierarchy:

```
execute_tool <tool-name>                    (root span)
├── chat <model>                           (AI operation)
│   └── POST https://api.anthropic.com     (HTTP call)
├── search                                 (vector DB query)
│   └── POST http://localhost:6333         (HTTP call)
└── k8s.listNamespacedDeployment          (Kubernetes API)
    └── GET https://kubernetes/apis/apps   (HTTP call)
```

This hierarchy helps identify which operations are taking the most time and where bottlenecks occur.

# Telemetry Guide

**Anonymous usage analytics to help improve DevOps AI Toolkit.**


# Organizational Data Management Concepts


**Understanding the relationship between Capabilities, Patterns, and Policies in the DevOps AI Toolkit.**

## Overview

The DevOps AI Toolkit manages three types of organizational knowledge that work together to provide intelligent, compliant, and standardized Kubernetes deployments:

- **Capabilities** - What resources can do (semantic understanding)
- **Patterns** - What resources to deploy (organizational preferences)  
- **Policies** - How resources should be configured (governance requirements)

## The Three Pillars of Organizational Knowledge

### Capabilities: Resource Understanding
**Purpose**: Discover and understand what Kubernetes resources actually do

- **Function**: Semantic understanding of cluster resources and their capabilities
- **Required for**: All intelligent recommendations and resource discovery
- **Example**: Teaches AI that `sqls.devopstoolkit.live` provides PostgreSQL database capabilities
- **When to use**: First step - scan your cluster to teach AI about available resources
- **Goal**: Make AI smarter about your cluster's existing resources and operators

**Key Characteristics**:
- **Automatic discovery** through cluster scanning and AI analysis
- **Resource-specific** understanding of what each CRD and operator provides
- **Foundation layer** that enables all other intelligent features
- **Continuously updated** as new resources are deployed to cluster

### Patterns: Deployment Guidance  
**Purpose**: Define organizational preferences for resource combinations

- **Function**: Organizational best practices for what resources work well together
- **Required for**: Enhanced recommendations that follow team standards
- **Example**: Defines that web applications should include Deployment + Service + Ingress + HPA
- **When to use**: After capabilities - create patterns for your common deployment scenarios
- **Goal**: Make AI follow your team's deployment standards and architecture decisions

**Key Characteristics**:
- **Platform team authored** based on organizational experience and standards
- **Resource combination focused** on what to deploy together
- **Suggestion-based** enhancement of AI recommendations (not enforcement)
- **Use case specific** patterns for different types of applications and workloads

### Policies: Configuration Governance
**Purpose**: Ensure resources are configured according to governance requirements

- **Function**: Proactive compliance that guides users toward correct configurations
- **Required for**: Governance compliance and security enforcement
- **Example**: Ensures all containers have resource limits, images from trusted registries  
- **When to use**: Throughout deployment - policies guide configuration decisions
- **Goal**: Make AI recommend compliant configurations from the start, preventing violations

**Key Characteristics**:
- **Security/platform team authored** based on compliance and governance needs
- **Configuration focused** on how resources should be set up
- **Proactive guidance** that prevents violations rather than blocking after creation
- **Optionally enforceable** through generated Kyverno policies for cluster-level blocking

## How They Work Together

### The AI Recommendation Pipeline

```
User Intent → Capability Discovery → Pattern Enhancement → Policy Compliance → Final Configuration
```

1. **User Intent**: "Deploy a web application with a database"

2. **Capability Discovery**: 
   - AI searches cluster capabilities
   - Finds: `apps/v1/Deployment`, `sqls.devopstoolkit.live/SQL`, `networking.k8s.io/Ingress`
   - Understanding: Deployment for apps, SQL CRD for databases, Ingress for traffic

3. **Pattern Enhancement**:
   - AI searches organizational patterns  
   - Finds: "Web Application Pattern" (Deployment + Service + Ingress + HPA)
   - Enhancement: Adds HPA and Service to the recommendation

4. **Policy Compliance**:
   - AI searches policy intents
   - Finds: "Resource Limits Policy", "Image Registry Policy"  
   - Integration: Questions include required resource limits and trusted image defaults

5. **Final Configuration**:
   - User gets questions with policy-driven requirements and pattern-enhanced suggestions
   - Generated manifests are compliant and follow organizational standards from the start

### Practical Example

**Scenario**: Developer requests "Deploy a Node.js API"

**Without organizational data**:
```
Questions: 
- Application name?
- Container image?  
- Port?
Basic Deployment + Service created
```

**With full organizational data**:
```
Capabilities found: Deployment, Service, Ingress, HPA available
Pattern matched: "Web Application Pattern" 
Policies found: "Resource Limits Policy", "Image Registry Policy"

Enhanced questions:
- Application name?
- Container image? (⚠️ must be from registry.company.com - policy requirement)
- Port?
- CPU limit? (⚠️ required by Resource Limits Policy) [default: 500m]
- Memory limit? (⚠️ required by Resource Limits Policy) [default: 512Mi]
- Enable autoscaling? (suggested by Web Application Pattern) [default: yes]

Generated resources: Deployment + Service + Ingress + HPA
All with policy-compliant configurations and organizational best practices
```

## When to Use Each Type

### Capabilities (Start Here - Required)
**Always required** for intelligent recommendations.

**Preferred: Use the controller** for automatic, event-driven scanning. See [Capability Management Guide](../tools/capability-management.md#method-1-controller-based-scanning-recommended).

**Alternative: Manual scanning** when controller cannot reach the server:
```text
"Scan my cluster capabilities"
```

**Manual scanning use cases**:
- Server running locally (not accessible from cluster)
- One-time scanning without controller setup

### Patterns (Optional but Recommended)
**Enhance recommendations** with organizational standards:

```  
"I want to create a deployment pattern for web applications"
```

**Use when**:
- Your team has established deployment standards
- You want consistent resource combinations across projects
- Developers frequently ask "what resources do I need for X?"
- You have architectural best practices to encode

### Policies (As Needed for Governance)
**Enforce compliance** requirements proactively:

```
"I want to create a policy for container resource limits"
```

**Use when**:
- You have security or compliance requirements to enforce
- Manual policy enforcement is error-prone or slow
- You want to guide users toward compliance rather than block them
- Governance teams need to ensure consistent configuration standards

## Setup and Workflow Order

### Recommended Implementation Order

1. **Start with Capabilities** (Required foundation):
   ```
   "Scan cluster capabilities"
   ```
   - Enables all intelligent features
   - Takes 5-10 minutes for initial scan
   - Should be done before patterns or policies

2. **Add Patterns** (Organizational enhancement):
   ```
   "Create organizational patterns for our common use cases"
   ```  
   - Start with 3-5 most common deployment types
   - Gather feedback from development teams
   - Iterate based on usage and effectiveness

3. **Implement Policies** (Governance requirements):
   ```
   "Create policy intents for our compliance requirements"
   ```
   - Focus on your most critical governance needs first
   - Test policy integration with real deployment scenarios
   - Consider Kyverno enforcement for critical policies

### Prerequisites for Each Type

**All types require**:
- DevOps AI Toolkit deployed
- Vector DB service (Qdrant) for semantic storage
- API keys for AI models and embedding providers (see [AI Model Configuration](../setup/deployment.md#ai-model-configuration))

**Additionally for Policies**:
- Kyverno installed (optional - only needed for cluster enforcement)
- kubectl access (optional - only needed for policy deployment)

## Best Practices

### Integration Strategy
- **Start simple**: Begin with capabilities, add patterns for your top 3 use cases, implement 1-2 critical policies
- **Iterate based on feedback**: Gather input from development teams on what's helpful vs. burdensome
- **Maintain consistency**: Ensure patterns and policies complement rather than conflict with each other

### Team Collaboration  
- **Capabilities**: Platform team manages (automated scanning)
- **Patterns**: Platform + development teams collaborate (based on real usage)
- **Policies**: Security + platform teams own (based on compliance requirements)

### Quality and Maintenance
- **Review quarterly**: Ensure organizational data reflects current standards and needs
- **Update incrementally**: Add new patterns/policies as needs emerge rather than trying to cover everything upfront  
- **Measure effectiveness**: Track whether recommendations become more useful and compliant over time

## FAQ

**Q: Do I need all three types?**
A: Capabilities are required for intelligent recommendations. Patterns and policies are optional enhancements that add organizational consistency and compliance.

**Q: Can they conflict with each other?**  
A: They're designed to be complementary. Patterns suggest what to deploy, policies ensure it's configured correctly. The AI balances both when making recommendations.

**Q: What happens if I only have capabilities?**
A: You get intelligent resource discovery and semantic matching, but without organizational context or governance guidance.

**Q: How do I know if my organizational data is working?**
A: Test with real deployment requests. The AI should mention organizational context and policy requirements in its recommendations.

**Q: Can I use this without Vector DB?**
A: No, all three types require Vector DB for semantic storage and retrieval. This enables intelligent matching based on user intent.

## See Also

- **[Capability Management Guide](../tools/capability-management.md)** - Cluster resource discovery and understanding
- **[Pattern Management Guide](patterns.md)** - Creating organizational deployment standards
- **[Policy Management Guide](policies.md)** - Implementing governance and compliance requirements
- **[Deployment Guide](../setup/deployment.md)** - Server deployment and configuration


# Pattern Management Guide

**Complete guide for creating and managing organizational deployment patterns with the DevOps AI Toolkit.**


## Overview

Pattern Management enables platform engineers and architects to capture organizational deployment knowledge as reusable patterns. These patterns automatically enhance AI deployment recommendations, ensuring consistency with your team's best practices and organizational standards.

### What is Pattern Management?

Pattern Management allows you to:
- **Create deployment patterns** that capture your organization's preferred resource combinations
- **Store patterns semantically** using Vector DB technology for intelligent matching
- **Enhance AI recommendations** with organizational context and best practices
- **Share institutional knowledge** across teams through standardized deployment approaches

### Understanding Organizational Data Types

Pattern Management works alongside Capability Management and Policy Management to provide comprehensive organizational intelligence for AI recommendations.

**Quick Overview**:
- **Capabilities**: What resources can do (required foundation)
- **Patterns**: What resources to deploy together (organizational preferences)  
- **Policies**: How resources should be configured (governance requirements)

For a complete understanding of how these three types work together, see the **[Organizational Data Concepts Guide](concepts.md)**.

### How It Works

1. **Pattern Creation** → Platform engineers define deployment patterns with triggers and resource recommendations
2. **Semantic Storage** → Patterns are stored with AI-generated embeddings for intelligent matching
3. **AI Integration** → When users request deployments, relevant patterns automatically enhance recommendations
4. **Continuous Learning** → Patterns improve recommendation quality over time through usage (planned for future versions)

## Prerequisites


Before using Pattern Management, ensure you have:

### Required Setup
- **DevOps AI Toolkit deployed** (see [Deployment Guide](../setup/deployment.md))
- **Vector DB service** (Qdrant) for pattern storage
- **API keys** for AI models and embedding providers (see [AI Model Configuration](../setup/deployment.md#ai-model-configuration)) can be set as environment variables or in your deployment configuration

For complete setup instructions, see the [Deployment Guide](../setup/deployment.md).

## Configuration

### Vector Database Setup

Pattern Management requires a Vector DB service for semantic pattern storage and retrieval.

#### Qdrant Cloud (Recommended)

1. **Sign up** at [Qdrant Cloud](https://cloud.qdrant.io/)
2. **Create a cluster** (free tier works fine for pattern storage)
3. **Get connection details**:
   - URL from cluster dashboard
   - API key from cluster settings
4. **Add to `.mcp.json`** environment variables

#### Local Qdrant

```bash
# Run local Qdrant with Docker (detached/background)
docker container run --detach --name qdrant --publish 6333:6333 --volume $(pwd)/tmp/qdrant_storage:/qdrant/storage qdrant/qdrant
```

**Configuration**:
```json
"env": {
  "QDRANT_URL": "http://localhost:6333"
}
```

**Test Connectivity**:
```bash
# Verify Qdrant is responding
curl http://localhost:6333/healthz
```

### Embedding Service Setup

For semantic pattern matching, configure an embedding provider:

#### Embedding Provider Configuration

The toolkit supports multiple embedding providers (OpenAI, Google, Amazon Bedrock) for enhanced semantic search.

📖 **[Complete Embedding Provider Setup Guide](../setup/deployment.md#embedding-provider-configuration)** - Choose your provider, set API keys, and configure dimensions.

**Note**: An embedding provider is **required** for pattern management. The system cannot store or search patterns without embedding generation capabilities.

## Pattern Management Operations

All pattern management is done through natural conversation with your AI assistant. The system provides a guided workflow for creating comprehensive patterns.

### Creating Organizational Patterns

**Start Pattern Creation**:
```
I want to create a new organizational deployment pattern
```

The AI assistant will guide you through a step-by-step workflow:

#### Step 1: Pattern Description
The assistant will ask for a clear description of your pattern:

**Example**:
```
Assistant: What deployment capability does this pattern provide? I need a capability name (2-4 words).

You: stateless web applications
```

#### Step 2: Trigger Keywords
Define initial keywords that should activate this pattern:

**Example**:
```
Assistant: What keywords or phrases should trigger this pattern? Please provide keywords separated by commas.

You: stateless application, web app, api service, microservice, rest api, frontend application
```

#### Step 3: Trigger Expansion
The system suggests additional related terms:

**Example**:
```
Assistant: I found these additional terms that might also trigger your pattern:
- application
- service
- workload

Which should also trigger this pattern? You can select specific ones, include all, or skip.

You: include: application, service, workload
```

#### Step 4: Suggested Resources
Specify Kubernetes resources this pattern should recommend:

**Example**:
```
Assistant: Which Kubernetes resources should be suggested for stateless web applications? Please list the resource types separated by commas.

You: Deployment, Service, Ingress, HorizontalPodAutoscaler
```

#### Step 5: Rationale
Explain why these resources work well together:

**Example**:
```
Assistant: Why does this combination of resources work well together for stateless web applications?

You: This combination provides a complete stateless application deployment: Deployment manages replicated pods without persistent state, Service provides stable network access, Ingress handles external routing and SSL termination, and HorizontalPodAutoscaler automatically scales based on demand. Together they create resilient, scalable applications that can handle traffic fluctuations and failures gracefully.
```

#### Step 6: Pattern Ownership
Specify who created the pattern:

**Example**:
```
Assistant: What is your name or team identifier? This helps track pattern ownership.

You: platform-team
```

#### Step 7: Review and Confirmation
Review the complete pattern before creation:

**Example**:
```
Assistant: Please review your pattern:

**Description**: stateless web applications
**Triggers**: stateless application, web app, api service, microservice, rest api, frontend application, application, service, workload
**Suggested Resources**: Deployment, Service, Ingress, HorizontalPodAutoscaler
**Rationale**: This combination provides a complete stateless application deployment...
**Created By**: platform-team

Does this look correct? Type 'confirm' to create the pattern, or 'modify' to make changes.

You: confirm
```

The system will then create and store your pattern with semantic embeddings for intelligent matching.

**Note**: The conversation examples above are simplified for readability. The actual workflow uses structured prompts, specific response formats, and includes additional validation steps, but the core information flow remains the same.

### Listing Patterns

**View All Patterns**:
```
Show me all organizational patterns
```

**Example Response**:
```
Found 3 organizational deployment patterns:

1. **Stateless Applications** (Created: 2025-01-15)
   - Triggers: 8 keywords including "web app", "api service", "microservice"
   - Resources: Deployment, Service, Ingress, HorizontalPodAutoscaler
   - Created by: platform-team

2. **Database Workloads** (Created: 2025-01-14)  
   - Triggers: 6 keywords including "database", "persistent storage", "stateful"
   - Resources: StatefulSet, PersistentVolumeClaim, Secret
   - Created by: data-team

3. **Network Security** (Created: 2025-01-13)
   - Triggers: 5 keywords including "network policy", "security", "isolation"
   - Resources: NetworkPolicy
   - Created by: security-team

Search mode: semantic+keyword hybrid search enabled
```

### Viewing Pattern Details

**Get Specific Pattern**:
```
Show me details for the stateless applications pattern
```

**Example Response**:
```
Pattern: Stateless Applications

**Description**: Stateless web applications
**Created**: 2025-01-15 by platform-team

**Triggers** (8):
- stateless application
- web app
- api service  
- microservice
- rest api
- frontend application
- http service
- web service

**Suggested Resources** (4):
- Deployment
- Service
- Ingress
- HorizontalPodAutoscaler

**Rationale**: 
This combination provides a complete stateless application deployment: Deployment manages replicated pods without persistent state, Service provides stable network access, Ingress handles external routing and SSL termination, and HorizontalPodAutoscaler automatically scales based on demand. Together they create resilient, scalable applications that can handle traffic fluctuations and failures gracefully.
```

### Deleting Patterns

**Remove Pattern**:
```
Delete the network security pattern
```

The assistant will confirm the deletion and remove the pattern from storage.

## AI Integration

Patterns automatically enhance deployment recommendations when users request deployments. For complete examples of how patterns influence recommendations, see the [Recommendation Guide](../tools/recommend.md).

The recommendation system automatically:
- Searches for relevant patterns using semantic matching
- Includes pattern context in AI prompts  
- Balances organizational consistency with specific user needs

**Important**: Patterns serve as suggestions to enhance AI decision-making, not rigid requirements.

## Pattern Examples

### Example 1: Stateless Application Pattern

**Use Case**: Standard web applications, APIs, and microservices

```yaml
Description: Stateless web applications
Triggers:
  - stateless application
  - web app
  - api service
  - microservice
  - rest api
  - frontend application
Resources:
  - Deployment
  - Service  
  - Ingress
  - HorizontalPodAutoscaler
Rationale: Provides complete stateless deployment with scaling, networking, and external access
```

**When It Activates**: User requests for "web app", "API deployment", "microservice setup", etc.

### Example 2: Database Workload Pattern

**Use Case**: Persistent databases and stateful services

```yaml
Description: Database and persistent storage workloads
Triggers:
  - database
  - persistent storage
  - stateful service
  - data store
  - mysql
  - postgresql
Resources:
  - StatefulSet
  - PersistentVolumeClaim
  - Secret
  - Service
Rationale: Ensures data persistence, ordered deployment, and secure credential management
```

**When It Activates**: User requests for "database deployment", "persistent storage", "MySQL setup", etc.

### Example 3: Network Security Pattern

**Use Case**: Security-focused deployments with network isolation

```yaml
Description: Network security and isolation policies
Triggers:
  - network security
  - network isolation
  - security policy
  - network policy
  - microsegmentation
Resources:
  - NetworkPolicy
  - ServiceAccount
  - Role
  - RoleBinding
Rationale: Implements defense-in-depth with network segmentation and RBAC controls
```

**Note**: This pattern uses Pod Security Standards (successor to deprecated PodSecurityPolicy) for pod-level security controls, which are configured at the namespace level rather than as individual resources.

**When It Activates**: User mentions "security", "network isolation", "compliance requirements", etc.

### Example 4: Monitoring Pattern

**Use Case**: Applications requiring observability and monitoring

```yaml
Description: Monitoring and observability setup
Triggers:
  - monitoring
  - observability
  - metrics
  - logging
  - tracing
  - prometheus
Resources:
  - ServiceMonitor
  - PodMonitor
  - PrometheusRule
  - ConfigMap
Rationale: Provides comprehensive observability with metrics collection, alerting, and dashboard configuration
```

**Note**: Resources like ServiceMonitor, PodMonitor, and PrometheusRule are Custom Resource Definitions (CRDs) provided by the Prometheus Operator and require it to be installed in your cluster. ConfigMap stores dashboard configurations for Grafana.

**When It Activates**: User requests including "monitoring", "observability", "metrics collection", etc.

## Best Practices

### Pattern Creation Guidelines

#### 1. Focus and Composability
- **Create focused patterns** for specific use cases rather than trying to cover everything
- **Make patterns composable** so multiple patterns can enhance a single recommendation
- **Avoid overlapping triggers** that might cause confusion between patterns

**Good Example**:
```yaml
# Focused pattern
Description: "Load balancer setup"
Triggers: ["load balancer", "external access", "ingress"]
Resources: ["Service", "Ingress"]
```

**Avoid**:
```yaml  
# Too broad
Description: "Complete application setup"  
Triggers: ["application", "app", "deploy", "service", "database", "monitoring"]
Resources: ["Deployment", "Service", "Ingress", "StatefulSet", "PVC", "Secret", "ServiceMonitor"]
```

#### 2. Clear Trigger Keywords
- **Use specific triggers** that clearly indicate when the pattern applies
- **Include common variations** and synonyms users might employ
- **Add technical terms** your team commonly uses

**Effective Triggers**:
```yaml
Triggers:
  - "stateless application"    # Specific architecture term
  - "web app"                  # Common colloquial term  
  - "api service"              # Technical specification
  - "microservice"             # Architecture pattern
  - "rest api"                 # Implementation detail
```

#### 3. Meaningful Rationales  
- **Explain the why** behind resource combinations
- **Describe interactions** between suggested resources
- **Include failure scenarios** the pattern addresses

**Strong Rationale Example**:
```
"This combination provides complete stateless deployment: Deployment manages replicated pods without persistent state, Service provides stable network access, Ingress handles external routing and SSL termination, and HorizontalPodAutoscaler automatically scales based on demand. Together they create resilient applications that can handle traffic fluctuations and pod failures gracefully."
```

### Organizational Adoption

#### 1. Start Simple
- **Begin with 3-5 core patterns** covering your most common deployment types
- **Validate with actual deployments** before expanding the pattern library
- **Gather feedback** from development teams on pattern usefulness

#### 2. Team Collaboration
- **Involve multiple teams** in pattern creation (platform, security, development)
- **Document pattern ownership** for future updates and maintenance
- **Create patterns for team-specific needs** (data team patterns, frontend patterns, etc.)

#### 3. Iterative Improvement
- **Gather feedback** from teams on pattern effectiveness and usage
- **Update patterns** based on changing organizational needs
- **Archive outdated patterns** that no longer reflect best practices

### Pattern Quality Guidelines

These are manual best practices for creating effective patterns. The system performs basic validation (required fields) but does not automatically warn about quality issues.

#### 1. Resource Selection
- **Include complementary resources** that work well together
- **Focus on the core resources** needed for the pattern's use case
- **Consider resource relationships** when selecting combinations

#### 2. Trigger Optimization  
- **Test triggers** with real user language from past deployment requests
- **Include both formal and informal terms** teams actually use
- **Avoid overly generic triggers** that match unrelated requests

#### 3. Maintenance
- **Review patterns quarterly** to ensure they remain current
- **Update resources** when new Kubernetes features become available
- **Validate rationales** against current architectural decisions

## Troubleshooting

### Common Issues

#### Pattern Creation Fails

**Symptom**: Error when creating patterns through AI assistant

**Possible Causes**:
- Vector DB connection issues
- Missing required environment variables
- Qdrant collection initialization problems

**Solutions**:
1. **Check system status**:
   ```
   What's the status of the pattern management system?
   ```

2. **Verify Vector DB connection**:
   - Confirm `QDRANT_URL` and `QDRANT_API_KEY` in `.mcp.json`
   - Test Qdrant accessibility from your network
   - Check Qdrant cluster status in dashboard

3. **Validate environment variables**:
   - Check that your AI model API key is configured (see [AI Model Configuration](../setup/deployment.md#ai-model-configuration))
   - Verify `QDRANT_URL` and `QDRANT_API_KEY` are properly set
   - Confirm all required environment variables are loaded in your configuration

#### Patterns Not Found During Search

**Symptom**: Relevant patterns don't appear in recommendations

**Possible Causes**:
- Trigger keywords don't match user language
- Embedding service not configured properly
- Pattern storage issues

**Solutions**:
1. **Review pattern triggers**:
   - Check if triggers match the language users actually employ
   - Add more trigger variations and synonyms
   - Test trigger effectiveness with common user requests

2. **Check search capabilities**:
   ```
   Show me search capabilities for patterns
   ```

3. **Verify embedding service**:
   - Confirm embedding provider API key is set (see [Embedding Provider Configuration](../setup/deployment.md#embedding-provider-configuration))
   - Test that embedding service is responding
   - Check if patterns have embeddings stored

#### Semantic Search Not Working

**Symptom**: Only exact keyword matches work, semantic similarities missed

**Possible Causes**:
- OpenAI API key missing or invalid
- Patterns created without embeddings
- Embedding service connectivity issues

**Solutions**:
1. **Verify OpenAI configuration**:
   - Confirm valid embedding provider API key is set (see [Embedding Provider Configuration](../setup/deployment.md#embedding-provider-configuration))
   - Test OpenAI API accessibility
   - Check API key permissions and usage limits

2. **Check embedding status**:
   ```
   What's the status of the embedding service?
   ```

3. **Recreate patterns if needed**:
   - Patterns created without embedding service may need recreation
   - New patterns will automatically include embeddings if service is available

### System Diagnostics

#### Check Overall System Health

**Command**:
```
What's the current system status?
```

**Expected Response**:
```json
{
  "status": "success",
  "system": {
    "version": {
      "version": "0.42.0",
      "nodeVersion": "v23.11.0",
      "platform": "darwin",
      "arch": "arm64"
    },
    "vectorDB": {
      "connected": true,
      "url": "http://localhost:6333",
      "collectionName": "patterns",
      "patternsCount": 5
    },
    "embedding": {
      "available": true,
      "provider": "openai",
      "model": "text-embedding-3-small",
      "dimensions": 1536
    },
    "anthropic": {
      "connected": true,
      "keyConfigured": true
    }
  },
  "summary": {
    "overall": "healthy",
    "patternSearch": "semantic+keyword",
    "capabilities": [
      "semantic-search",
      "ai-recommendations"
    ]
  },
  "timestamp": "2025-08-01T23:10:26.691Z"
}
```

#### Verify Pattern Storage

**Command**:
```
List all organizational patterns
```

**Check For**:
- Patterns are being returned successfully
- Pattern count matches expectations
- Search capabilities indicate semantic or keyword mode

#### Test Pattern Matching

**Test Method**:
1. Create a test pattern with specific triggers
2. Make a deployment request using those triggers
3. Verify the pattern influences the recommendation
4. Check that AI mentions organizational context

## FAQ

### General Questions

**Q: Do I need Qdrant's paid embedding service?**  
A: No! The system uses OpenAI to generate embeddings and stores them in Qdrant. The free Qdrant tier works perfectly for pattern storage and search.

**Q: Can I use pattern management without an embedding service?**  
A: No. Pattern management requires an embedding provider. See [Embedding Provider Configuration](../setup/deployment.md#embedding-provider-configuration) for setup options.

**Q: How many patterns should I create?**  
A: Start with 3-5 core patterns covering your most common deployment types. Expand based on team feedback and usage patterns.

**Q: Can multiple patterns match a single request?**  
A: Yes! The AI can use multiple relevant patterns to create comprehensive recommendations that combine organizational best practices.

### Technical Questions

**Q: What happens if Vector DB is unavailable?**  
A: Pattern operations will fail gracefully. Deployment recommendations continue working but without organizational pattern enhancement.

**Q: Can I backup my patterns?**  
A: Currently, patterns are stored in your Qdrant instance. Back up your Qdrant data to preserve patterns. Export functionality is planned for future versions.

**Q: How do I update an existing pattern?**  
A: Currently, delete the old pattern and create a new one. In-place editing is planned for future versions.

**Q: Can I see which patterns influenced a recommendation?**  
A: The AI will mention when recommendations are enhanced by organizational patterns, though detailed pattern attribution is not yet available.

### Performance Questions

**Q: How fast is pattern search?**  
A: Pattern retrieval typically takes under 100ms for semantic search with reasonable pattern volumes (under 100 patterns).

**Q: Does pattern management slow down recommendations?**  
A: No significant impact. Pattern search runs in parallel with other recommendation analysis and adds minimal latency.

**Q: How many patterns can the system handle?**  
A: The system is tested with 100+ patterns. Qdrant can scale to much larger volumes if needed.

## Support

### Getting Help

**For setup issues**:
- Review the [Deployment Guide](../setup/deployment.md) for foundational configuration
- Check environment variable configuration in `.mcp.json`
- Verify Vector DB connectivity and credentials

**For pattern creation problems**:
- Use system diagnostics to check service health
- Review best practices for trigger keyword selection
- Test with simple patterns first before creating complex ones

**For AI integration questions**:
- Verify that patterns are being stored successfully
- Test pattern matching with known trigger keywords
- Check that AI mentions organizational context in recommendations

### Community

**Documentation**: Complete guides available in `docs/` directory  
**Issues**: Report bugs and feature requests at [GitHub Issues](https://github.com/vfarcic/dot-ai/issues)

## See Also

- **[Deployment Guide](../setup/deployment.md)** - Server deployment and configuration
- **[Tools and Features Overview](../tools/overview.md)** - Browse all available tools and features


# Policy Management Guide


**Complete guide for creating and managing organizational policy intents with the DevOps AI Toolkit.**


## Overview

Policy Management enables platform engineers and security teams to create governance policies that proactively guide users toward compliant configurations. Unlike traditional policy enforcement that blocks manifests after they're created, this system integrates policies into AI recommendations, ensuring compliance from the start.

### What is Policy Management?

Policy Management allows you to:
- **Create policy intents** that capture your organization's security and governance requirements  
- **Store policies semantically** using Vector DB technology for intelligent matching
- **Guide AI recommendations** with policy requirements as part of the configuration process
- **Generate Kyverno policies** optionally from policy intents for cluster-level enforcement
- **Share governance knowledge** across teams through standardized policy approaches

### Understanding Organizational Data Types

Policy Management works alongside Capability Management and Pattern Management to provide comprehensive organizational intelligence for AI recommendations.

**Quick Overview**:
- **Capabilities**: What resources can do (required foundation)
- **Patterns**: What resources to deploy together (organizational preferences)  
- **Policies**: How resources should be configured (governance requirements)

For a complete understanding of how these three types work together, see the **[Organizational Data Concepts Guide](concepts.md)**.

### How It Works

1. **Policy Creation** → Security teams define policy intents describing governance requirements
2. **Semantic Storage** → Policies are stored with AI-generated embeddings for intelligent matching  
3. **AI Integration** → When users configure resources, relevant policies enhance questions with compliance requirements
4. **Proactive Guidance** → Users get compliant configurations with policy-driven defaults and validation
5. **Optional Enforcement** → Policy intents can generate Kyverno policies for cluster-level blocking

## See Policy Management in Action

[![Policy Management with DevOps AI Toolkit](https://img.youtube.com/vi/hLK9j2cn6c0/maxresdefault.jpg)](https://youtu.be/hLK9j2cn6c0)

This video demonstrates how Policy Management transforms organizational governance by integrating security requirements directly into AI-powered deployment recommendations. Watch how platform engineers create policy intents that guide users toward compliant configurations from the start, rather than blocking non-compliant deployments after the fact.

## Prerequisites

Before using Policy Management, ensure you have:

### Required Setup
- **DevOps AI Toolkit deployed** (see [Deployment Guide](../setup/deployment.md))
- **Vector DB service** (Qdrant) for policy storage
- **API keys** for AI models and embedding providers (see [AI Model Configuration](../setup/deployment.md#ai-model-configuration)) can be set as environment variables or in your deployment configuration

### Optional (for Kyverno enforcement)
- **[Kyverno](https://kyverno.io/)** installed in your cluster for optional policy enforcement
- **kubectl access** to your cluster for policy deployment

For complete setup instructions, see the [Deployment Guide](../setup/deployment.md).

## Section 1: Policy Management Basics

### Creating Policy Intents

**Start Policy Creation**:
```
I want to create a new policy intent
```

The AI assistant will guide you through a step-by-step workflow:

#### Step 1: Policy Description
The assistant asks for the policy requirement:

**Example**:
```
Assistant: Please describe the policy intent you want to create. What should this policy enforce or ensure in your Kubernetes deployments?

Examples:
- "All containers must have resource limits defined"  
- "Images must be from trusted registries only"
- "Pods must not run as root user"
- "All services must have network policies"

What policy requirement do you want to enforce?

You: All containers must have CPU and memory resource limits defined to prevent resource starvation and ensure fair resource allocation
```

#### Step 2: Infrastructure Types  
Define what types of infrastructure this applies to:

**Example**:
```
Assistant: What types of infrastructure should this apply to? Think about the categories of resources and systems this covers, not technical implementation details.

Examples by domain:
- **Applications**: web applications, microservices, APIs, frontends, backends
- **Databases**: databases, SQL databases, NoSQL, data stores, persistent storage  
- **Networking**: load balancers, ingresses, services, network policies, DNS
- **Storage**: persistent volumes, storage classes, backup systems
- **Security**: secrets, certificates, authentication, authorization, RBAC
- **Infrastructure**: clusters, nodes, compute resources, cloud resources

Please provide infrastructure type keywords separated by commas.

You: applications, microservices, web applications, containerized applications, workloads, pods
```

#### Step 3: Trigger Expansion
The system suggests additional related terms:

**Example**:  
```
Assistant: I found these additional terms that might also trigger your policy:
- containers
- deployments
- services  
- jobs
- cronjobs
- daemonsets
- statefulsets
- replicasets

Which should also trigger this policy? You can select specific ones, include all, or skip.

You: applications, microservices, web applications, containerized applications, workloads, pods, containers, deployments
```

#### Step 4: Policy Rationale
Explain why this policy is important:

**Example**:
```
Assistant: Why is this policy important for your organization? Please explain the rationale behind this policy requirement - what risks does it mitigate or what benefits does it provide?

You: Resource limits prevent any single container from consuming excessive CPU or memory, which could starve other applications and degrade cluster performance. This ensures fair resource allocation and prevents noisy neighbor problems in multi-tenant environments.
```

#### Step 5: Policy Ownership
Specify who created the policy:

**Example**:
```
Assistant: Who should be credited as the creator of this policy intent? (This can be your name, team name, or organization.)

You: platform-team
```

#### Step 6: Namespace Scope Selection (Kyverno Only)
If Kyverno is installed in your cluster, you can choose which namespaces the policy should apply to:

**Example**:
```
Assistant: # Policy Namespace Scope

Your policy can be applied cluster-wide or limited to specific namespaces.

## Available Namespaces in Your Cluster:
production, staging, development, default, kube-system, kube-public

## Choose the scope for your policy:

1. **Apply to all namespaces** (cluster-wide enforcement)
   - Type: `all` or `1`

2. **Apply only to specific namespaces** (inclusive list)
   - Type: `include: namespace1, namespace2, namespace3`
   - Example: `include: production, staging`

3. **Apply to all namespaces EXCEPT specific ones** (exclusion list)
   - Type: `exclude: namespace1, namespace2`
   - Example: `exclude: kube-system, kube-public`

You: include: production, staging
```

**Namespace Options**:
- **All namespaces**: Policy applies cluster-wide with no restrictions
- **Include specific**: Policy applies only to the namespaces you specify
- **Exclude specific**: Policy applies everywhere except the namespaces you specify

**Note**: This step only appears when Kyverno is installed. For policy intents without Kyverno, skip to Step 7.

#### Step 7: Kyverno Policy Generation
The system generates a Kyverno policy and presents options:

**Example**:
```
Assistant: Please review your policy intent:

**Description**: All containers must have CPU and memory resource limits defined to prevent resource starvation and ensure fair resource allocation
**Triggers**: applications, microservices, web applications, containerized applications, workloads, pods, containers, deployments  
**Rationale**: Resource limits prevent any single container from consuming excessive CPU or memory, which could starve other applications and degrade cluster performance. This ensures fair resource allocation and prevents noisy neighbor problems in multi-tenant environments.
**Created By**: platform-team

I've also generated a Kyverno ClusterPolicy that enforces this requirement:

**Generated Kyverno Policy**:
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
  name: require-container-resource-limits-d53b3447
  labels:
    policy-intent/id: d53b3447-0eed-42a0-8a8b-651d18374325
spec:
  validationFailureAction: Enforce
  background: false
  rules:
  - name: require-pod-container-resource-limits
    match:
      any:
      - resources:
          kinds:
          - Pod
          operations:
          - CREATE
          - UPDATE
    validate:
      cel:
        expressions:
        - expression: >-
            object.spec.containers.all(container,
              has(container.resources) && has(container.resources.limits) &&
              has(container.resources.limits.cpu) && has(container.resources.limits.memory) &&
              container.resources.limits.cpu != '' && container.resources.limits.memory != ''
            )
          message: "All containers must have CPU and memory resource limits defined..."
  # ... (additional rules for Deployment, StatefulSet, ReplicaSet, DaemonSet, ReplicationController)

**Choose what to do:**

1. **Apply Kyverno policy to cluster** - Store policy intent AND deploy enforcement to cluster
2. **Store policy intent only** - Save for AI guidance without cluster enforcement  
3. **Cancel** - Do nothing

⚠️ **Warning**: Option 1 will deploy active policy enforcement to your cluster.

You: 1
```

#### Step 8: Policy Review and Deployment Choice  
The system shows the generated policy and asks what you want to do:

**Example**:
```
Assistant: Please review your policy intent:

**Description**: All containers must have CPU and memory resource limits defined to prevent resource starvation and ensure fair resource allocation
**Triggers**: applications, microservices, web applications, containerized applications, workloads, pods, containers, deployments  
**Rationale**: Resource limits prevent any single container from consuming excessive CPU or memory, which could starve other applications and degrade cluster performance. This ensures fair resource allocation and prevents noisy neighbor problems in multi-tenant environments.
**Created By**: platform-team

I've also generated a Kyverno ClusterPolicy that enforces this requirement:

**Generated Kyverno Policy**:
```yaml
# YAML policy content shown here with namespace restrictions based on Step 6 choice
```

**Choose what to do:**

1. **Apply Kyverno policy to cluster** - Store policy intent AND deploy enforcement to cluster
2. **Store policy intent only** - Save for AI guidance without cluster enforcement  
3. **Cancel** - Do nothing

⚠️ **Warning**: Option 1 will deploy active policy enforcement to your cluster.

You: 1
```

#### Step 9: Policy Deployment (if chosen)
If you chose option 1, the system validates and deploys the policy to your cluster:

**Example**:
```
Assistant: **Deploying policy to cluster...**

✅ Kyverno policy validation successful  
✅ Policy applied to cluster: require-container-resource-limits-d53b3447
✅ Policy intent stored in database

**Policy Intent and Enforcement Created Successfully!**

**Policy ID**: d53b3447-0eed-42a0-8a8b-651d18374325
**Status**: Active - both AI guidance and cluster enforcement enabled
**Kyverno Policy**: require-container-resource-limits-d53b3447

The policy intent will now guide AI recommendations during deployments, and the Kyverno policy enforces compliance at the cluster level.
```

The policy intent is now available to guide AI recommendations during resource configuration.

### Listing Policy Intents

**View All Policies**:
```
Show me all policy intents
```

**Example Response**:
```
Found 1 policy intents (showing 1)

1. **Resource Limits Policy** (Created: 2025-08-25)
   - Description: All containers must have CPU and memory resource limits defined to prevent resource starvation and ensure fair resource allocation
   - Triggers: 8 keywords including "applications", "microservices", "containers", "deployments"
   - Created by: platform-team
   - Status: Intent only (no Kyverno policies deployed)

Search mode: semantic+keyword hybrid search enabled
```

### Viewing Policy Details

**Get Specific Policy** (using policy ID):
```
Show me policy intent d53b3447-0eed-42a0-8a8b-651d18374325
```

**Example Response**:
```
Policy intent retrieved successfully:

**Policy ID**: d53b3447-0eed-42a0-8a8b-651d18374325
**Description**: All containers must have CPU and memory resource limits defined to prevent resource starvation and ensure fair resource allocation
**Created**: 2025-08-25 by platform-team
**Status**: Intent only (no Kyverno policies deployed)

**Triggers** (8):
- applications
- microservices  
- web applications
- containerized applications
- workloads
- pods
- containers
- deployments

**Rationale**:
Resource limits prevent any single container from consuming excessive CPU or memory, which could starve other applications and degrade cluster performance. This ensures fair resource allocation and prevents noisy neighbor problems in multi-tenant environments.
```

### Searching Policy Intents

**Search Policies** (using semantic search):
```
Find policies about "resource limits"
```

**Example Response**:
```
Found 1 policy intents matching "resource limits"

1. **Resource Limits Policy** (Score: 0.40)
   - Description: All containers must have CPU and memory resource limits defined...
   - Triggers: 8 keywords including "applications", "containers", "deployments"  
   - Created: 2025-08-25 by platform-team
   - Status: Intent only

Semantic search found relevant policies based on description and trigger matching.
```

**Search Features**:
- **Semantic matching**: Finds policies based on meaning, not just exact keywords
- **Score ranking**: Results ordered by relevance score 
- **Comprehensive search**: Searches policy descriptions, triggers, and rationales

### Deleting Policy Intents

#### Delete Single Policy

**Remove Specific Policy** (requires policy ID):
```
Delete policy intent d53b3447-0eed-42a0-8a8b-651d18374325
```

**Note**: You need the specific policy ID for deletion. Use `list` or `search` operations first to find the policy ID you want to delete.

For policies with no deployed Kyverno policies:

**Example Response**:
```
Policy intent deleted successfully (no Kyverno policies to cleanup)

**Deleted Policy**:
- ID: d53b3447-0eed-42a0-8a8b-651d18374325
- Description: All containers must have CPU and memory resource limits defined...
- Created: 2025-08-25 by platform-team
```

For policies with deployed Kyverno policies, the system will ask for confirmation:

**Example Confirmation**:
```
Policy intent has deployed Kyverno policies that need cleanup decision:

Policy intent "All containers must have resource limits..." has 1 deployed Kyverno policies in your cluster: require-container-resource-limits-d53b3447

**Choose what to do:**

1. **Delete everything** - Remove policy intent AND delete Kyverno policies from cluster
2. **Keep Kyverno policies** - Remove policy intent only, preserve cluster policies

⚠️ **Warning**: Option 1 will remove active policy enforcement from your cluster.

**What would you like to do?**
```

#### Delete All Policies

**Remove All Policies**:
```
Delete all policy intents
```

If no policies exist:

**Example Response**:
```
No policy intents found to delete
```

If policies exist with deployed Kyverno policies, you'll get a batch confirmation similar to single delete, allowing you to choose whether to preserve or remove all cluster policies.

**Note**: The conversation examples above are simplified for readability. The actual workflow uses structured prompts and includes additional validation steps, but the core information flow remains the same.

## AI Integration

Policy intents automatically enhance deployment recommendations when users request deployments. For complete examples of how policies influence recommendations, see the [Recommendation Guide](../tools/recommend.md).

The recommendation system automatically:
- Searches for relevant policy intents using semantic matching
- Includes policy requirements as REQUIRED questions with compliance indicators
- Provides policy-compliant defaults and validation guidance
- Balances governance requirements with specific user needs

**Important**: Policy intents guide users toward compliant configurations proactively, rather than blocking manifests reactively.

## Troubleshooting

### Common Issues

#### Policy Creation Fails

**Symptom**: Error when creating policy intents through AI assistant

**Possible Causes**:
- Vector DB connection issues
- Missing required environment variables
- Qdrant collection initialization problems
- Embedding service unavailable

**Solutions**:
1. **Check system status**:
   ```
   Show dot-ai status
   ```

2. **Verify Vector DB connection**:
   - Confirm `QDRANT_URL` and `QDRANT_API_KEY` in `.mcp.json`
   - Test Qdrant accessibility from your network
   - Check Qdrant cluster status in dashboard

3. **Validate environment variables**:
   See [AI Model Configuration](../setup/deployment.md#ai-model-configuration) for AI model and embedding provider setup with all available options.

#### Kyverno Policy Generation Fails

**Symptom**: Policy intent created successfully but Kyverno generation fails

**Possible Causes**:
- Kyverno not installed in cluster  
- Invalid policy description that can't be converted to Kyverno rules
- Cluster connectivity issues
- Missing kubectl access

**Solutions**:
1. **Check Kyverno installation**:
   ```bash
   kubectl get pods -n kyverno
   ```

2. **Verify cluster access**:
   ```bash
   kubectl cluster-info
   ```

3. **Simplify policy description**:
   - Use clear, specific requirements
   - Avoid complex conditional logic
   - Focus on single validation rules

4. **Manual policy application** (if generation fails):
   The system saves generated policies to files even on deployment failures. Check the error message for the file path and apply manually:
   ```bash
   kubectl apply -f /path/to/kyverno-policy.yaml
   ```

#### Policy Search Not Working

**Symptom**: Relevant policies don't appear in AI recommendations

**Possible Causes**:
- Trigger keywords don't match user language
- Embedding service not configured properly
- Policy storage issues
- Vector DB connectivity problems

**Solutions**:
1. **Review policy triggers**:
   - Check if triggers match the language users actually employ
   - Add more trigger variations and synonyms
   - Test with different request phrasings

2. **Check embedding service**:
   - Verify embedding provider is configured (see [Embedding Provider Configuration](../setup/deployment.md#embedding-provider-configuration))
   - Confirm embedding service API key is properly set

3. **Verify policy storage**:
   ```
   List all policy intents
   ```

4. **Test policy matching**:
   - Create a simple test policy with obvious triggers
   - Make a deployment request using those exact triggers
   - Check if AI mentions the policy in recommendations

#### Kyverno Policy Deployment Fails

**Symptom**: Policy intent and YAML generated successfully but deployment fails

**Possible Causes**:
- Insufficient cluster permissions
- Network connectivity issues
- Cluster API unavailable
- Kyverno admission controller not ready

**Solutions**:
1. **Check cluster permissions**:
   ```bash
   kubectl auth can-i create clusterpolicies
   kubectl auth can-i get clusterpolicies
   ```

2. **Verify Kyverno status**:
   ```bash
   kubectl get pods -n kyverno
   kubectl logs -n kyverno -l app.kubernetes.io/name=kyverno
   ```

3. **Manual deployment**:
   Use the generated policy file (path provided in error message):
   ```bash
   kubectl apply -f /path/to/generated-policy.yaml
   ```

#### Semantic Search Not Working

**Symptom**: Only exact keyword matches work, semantic similarities missed

**Possible Causes**:
- OpenAI API key missing or invalid
- Policies created without embeddings
- Embedding service connectivity issues

**Solutions**:
1. **Verify OpenAI configuration**:
   - Confirm valid embedding provider API key is set (see [Embedding Provider Configuration](../setup/deployment.md#embedding-provider-configuration))
   - Test API key with simple request
   - Check API usage limits and billing status

2. **Check embedding status in system status**:
   ```
   Show dot-ai status
   ```
   Look for embedding service availability

3. **Recreate policies if needed**:
   - Policies created without embedding service may need recreation
   - New policies automatically include embeddings if service is available

### System Diagnostics

#### Check Overall System Health

**Command**:
```
What's the current system status?
```

**Expected Response** (healthy system):
```json
{
  "status": "success",
  "system": {
    "vectorDB": {
      "connected": true,
      "url": "http://localhost:6333",
      "collectionName": "policies",
      "policiesCount": 3
    },
    "embedding": {
      "available": true,
      "provider": "openai",
      "model": "text-embedding-3-small"
    },
    "kyverno": {
      "installed": true,
      "policyGenerationReady": true,
      "version": "1.10.0"
    },
    "anthropic": {
      "connected": true,
      "keyConfigured": true
    }
  }
}
```

#### Verify Policy Storage

**Command**:
```
List all policy intents
```

**Check For**:
- Policies are being returned successfully
- Policy count matches expectations
- Search capabilities indicate semantic mode

#### Test Policy Integration

**Test Method**:
1. Create a test policy with specific, unique triggers
2. Make a deployment request using those exact triggers
3. Verify the policy appears in AI questions with compliance indicators
4. Check that generated manifests follow policy requirements

## FAQ

### General Questions

**Q: Do policies affect all deployment recommendations?**  
A: Yes, relevant policies automatically enhance AI recommendations when users deploy resources that match the policy triggers and rationale.

**Q: Can I use policy management without Kyverno?**  
A: Yes! Policy intents provide AI guidance without requiring Kyverno. The cluster enforcement is optional.

**Q: How many policies should I create?**  
A: Start with 3-5 core governance policies covering your most critical requirements (security, resources, compliance). Expand based on organizational needs.

**Q: Can multiple policies apply to a single request?**  
A: Yes! The AI can apply multiple relevant policies to create comprehensive compliance guidance.

### Technical Questions

**Q: What happens if Vector DB is unavailable?**  
A: Policy operations will fail. Deployment recommendations continue working but without policy guidance until Vector DB is restored.

**Q: Can I backup my policies?**  
A: Currently, policies are stored in your Qdrant instance. Back up your Qdrant data to preserve policies. Export functionality is planned for future versions.

**Q: How do I update an existing policy?**  
A: Currently, delete the old policy and create a new one. In-place editing is planned for future versions.

**Q: What happens to Kyverno policies when I delete a policy intent?**  
A: The system asks whether you want to delete the cluster policies too, or preserve them while removing only the AI guidance.

### Security Questions

**Q: Are policy intents stored securely?**  
A: Policy intents are stored in your Vector DB with standard Qdrant security. Use appropriate access controls and network security for your Vector DB instance.

**Q: Can policies access sensitive cluster data?**  
A: No, policies only define validation rules. They don't have access to secret data or cluster state beyond what Kyverno normally validates.

**Q: Do generated Kyverno policies follow security best practices?**  
A: Yes, generated policies use modern Kyverno CEL expressions and follow least-privilege principles. However, always review generated policies before deployment.

## See Also

- **[Deployment Guide](../setup/deployment.md)** - Server deployment and configuration
- **[Tools and Features Overview](../tools/overview.md)** - Browse all available tools and features


# Quick Start

**Get started with DevOps AI Toolkit in minutes — deploy applications, manage policies, and remediate issues using AI-powered Kubernetes workflows.**

> **For the easiest setup**, we recommend installing the complete dot-ai stack which includes all components pre-configured. See the [Stack Installation Guide](/docs/stack).
>
> Continue below if you want to install components individually (for non-Kubernetes setups or granular control over configuration).

## Overview

**What it does**: DevOps AI Toolkit provides AI-powered Kubernetes deployment, remediation, policy management, and capability discovery through conversational workflows via [MCP](/docs/mcp) or the [CLI](/docs/cli).

**Use when**: You want intelligent Kubernetes operations without memorizing commands, need AI-powered troubleshooting, or want to establish governance policies across your cluster.

**📖 Full Guide**: See the [Deployment Guide](setup/deployment.md) for detailed configuration options and the [Tools Overview](tools/overview.md) for complete feature reference.

## Prerequisites

**Works without AI keys:**
- ✅ **Shared prompts library** - No API key needed, works with any MCP-enabled coding agent

**For AI-powered features (deployment, remediation, patterns, policies, capabilities):**
- **AI Model API key** - Required for AI analysis and intelligent recommendations
  - **Multiple AI models supported** - see [AI Model Configuration](setup/deployment.md#ai-model-configuration) for all options and setup
  - **Quick setup**: Claude (default) - `export ANTHROPIC_API_KEY=your_key_here`

**For Kubernetes deployment recommendations:**
- **kubectl** configured with cluster access
  - Verify cluster access with: `kubectl get nodes`
  - Should show your cluster nodes without authentication errors

**For organizational pattern management:**
- **Vector DB service** (Qdrant) for pattern storage and semantic search
- **Embedding provider API key** - Required for semantic pattern matching:
  - OpenAI: `OPENAI_API_KEY`
  - Google: `GOOGLE_API_KEY`
  - Amazon Bedrock: AWS credentials via environment variables or `~/.aws/credentials`

**For policy management and governance:**
- **Vector DB service** (Qdrant) for policy storage and semantic search
- **Embedding provider API key** - Required for semantic policy matching (same options as above)
- **Optional**: Kyverno installed in cluster for active policy enforcement

## Installation

DevOps AI Toolkit runs as a server on Kubernetes. AI agents connect to it via [MCP](/docs/mcp) or the [CLI](/docs/cli). This guide walks through server deployment and MCP client setup.

## Usage

**🎯 Recommended: Kubernetes Setup (Full Features)**
Production-ready deployment with autonomous capability scanning via controller:

### Step 0: Create a Kubernetes Cluster (Optional)

Skip this step if you already have a Kubernetes cluster with an ingress controller.

**Prerequisites:** [Install Kind](https://kind.sigs.k8s.io/docs/user/quick-start/#installation) if you don't have it.

**Create a Kind cluster with ingress support:**
```bash
# Create Kind cluster configuration
cat > kind-config.yaml << 'EOF'
kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
nodes:
- role: control-plane
  extraPortMappings:
  - containerPort: 80
    hostPort: 80
    protocol: TCP
  - containerPort: 443
    hostPort: 443
    protocol: TCP
EOF

# Create the cluster
kind create cluster --name dot-ai --config kind-config.yaml

# Install nginx ingress controller for Kind
kubectl apply -f https://raw.githubusercontent.com/kubernetes/ingress-nginx/main/deploy/static/provider/kind/deploy.yaml

# Wait for ingress controller to be ready
kubectl wait --namespace ingress-nginx \
  --for=condition=ready pod \
  --selector=app.kubernetes.io/component=controller \
  --timeout=90s
```

### Step 1: Set Environment Variables
```bash
export ANTHROPIC_API_KEY="sk-ant-api03-your-key-here"
export OPENAI_API_KEY="sk-proj-your-openai-key-here"
export DOT_AI_AUTH_TOKEN=$(openssl rand -base64 32)

# Ingress class - change to match your ingress controller (traefik, haproxy, etc.)
export INGRESS_CLASS_NAME="nginx"
```

### Step 2: Install via Helm
```bash
# Set versions from GitHub packages
export DOT_AI_VERSION="..."  # https://github.com/vfarcic/dot-ai/pkgs/container/dot-ai%2Fcharts%2Fdot-ai
export DOT_AI_CONTROLLER_VERSION="..."  # https://github.com/vfarcic/dot-ai-controller/pkgs/container/dot-ai-controller%2Fcharts%2Fdot-ai-controller

# Install controller (enables autonomous capability scanning)
helm install dot-ai-controller \
  oci://ghcr.io/vfarcic/dot-ai-controller/charts/dot-ai-controller:$DOT_AI_CONTROLLER_VERSION \
  --namespace dot-ai --create-namespace --wait

# Install MCP server
helm install dot-ai-mcp oci://ghcr.io/vfarcic/dot-ai/charts/dot-ai:$DOT_AI_VERSION \
  --set secrets.anthropic.apiKey="$ANTHROPIC_API_KEY" \
  --set secrets.openai.apiKey="$OPENAI_API_KEY" \
  --set secrets.auth.token="$DOT_AI_AUTH_TOKEN" \
  --set ingress.enabled=true \
  --set ingress.className="$INGRESS_CLASS_NAME" \
  --set ingress.host="dot-ai.127.0.0.1.nip.io" \
  --set controller.enabled=true \
  --namespace dot-ai --wait
```

### Step 3: Create MCP Configuration

Create the MCP client configuration file with your auth token:

```bash
cat > .mcp.json << EOF
{
  "mcpServers": {
    "dot-ai": {
      "type": "http",
      "url": "http://dot-ai.127.0.0.1.nip.io",
      "headers": {
        "Authorization": "Bearer $DOT_AI_AUTH_TOKEN"
      }
    }
  }
}
EOF
```

**Note:** The `$DOT_AI_AUTH_TOKEN` variable is expanded when creating the file. Make sure you're in the same terminal session where you set the environment variables in Step 1.

### Step 4: Start Your MCP Client

```bash
claude  # or your preferred MCP-enabled AI tool
```

Verify everything works by asking:
```
Show dot-ai status
```

You should see a status report showing all components are healthy.

**What you get:**
- ✅ **Full Features**: All capabilities including autonomous scanning via controller
- ✅ **Production-Ready**: Scalable deployment with proper resource management
- ✅ **Automatic Capability Discovery**: Controller watches for CRD changes and scans automatically
- ✅ **Team Collaboration**: Shared MCP server accessible by multiple developers

**Full Configuration**: See the [Deployment Guide](setup/deployment.md) for advanced configuration options.

### Step 5: Start Using Conversational Workflows

Try these example prompts to explore the toolkit:

| What You Want | Example Prompt | Guide |
|---------------|----------------|-------|
| Scan capabilities | Use controller (recommended) or "Scan my cluster for capabilities" | [Capability Management](tools/capability-management.md) |
| Query cluster | "What databases are running?" | [Cluster Query](tools/query.md) |
| Deploy an app | "I want to deploy a web application" | [Recommendation Guide](tools/recommend.md) |
| Operate resources | "Scale my database to 3 replicas" | [Operations Guide](tools/operate.md) |
| Fix issues | "Something is wrong with my database" | [Remediation Guide](tools/remediate.md) |
| Create patterns | "Create a pattern for database deployments" | [Pattern Management](organizational-data/patterns.md) |
| Create policies | "Create a policy requiring resource limits" | [Policy Management](organizational-data/policies.md) |
| Setup project | "Help me setup governance files" | [Project Setup Guide](tools/project-setup.md) |
| Use prompts | `/dot-ai:prd-create` | [Prompts Guide](tools/prompts.md) |

## Next Steps

📖 **[Deployment Guide →](setup/deployment.md)** - Detailed configuration, troubleshooting, and examples

📖 **[Complete Tools & Features Reference →](tools/overview.md)** - Comprehensive guide to all available tools, workflows, and advanced features
# AI Engine Deployment

**Deploy the DevOps AI Toolkit Engine to Kubernetes using Helm chart — production-ready deployment.**

> **For the easiest setup**, we recommend installing the complete dot-ai stack which includes all components pre-configured. See the [Stack Installation Guide](/docs/stack).
>
> Continue below if you want to install components individually (for granular control over configuration).

## Overview

The DevOps AI Toolkit Engine provides:

1. **Kubernetes Deployment Recommendations** — AI-powered application deployment assistance with enhanced semantic understanding
2. **Cluster Query** — Natural language interface for querying cluster resources, status, and health
3. **Capability Management** — Discover and store semantic resource capabilities for intelligent recommendation matching
4. **Pattern Management** — Organizational deployment patterns that enhance AI recommendations
5. **Policy Management** — Governance policies that guide users toward compliant configurations with optional Kyverno enforcement
6. **Kubernetes Issue Remediation** — AI-powered root cause analysis and automated remediation
7. **Shared Prompts Library** — Centralized prompt sharing via native slash commands
8. **REST API Gateway** — HTTP endpoints for all toolkit capabilities

Access these tools through [MCP clients](/docs/mcp) or the [CLI](/docs/cli).

## What You Get

- **Production Kubernetes Deployment** — Scalable deployment with proper resource management
- **Integrated Qdrant Database** — Vector database for capability and pattern management
- **External Access** — Ingress configuration for team collaboration
- **Resource Management** — Proper CPU/memory limits and requests
- **Security** — RBAC and ServiceAccount configuration

## Prerequisites

- Kubernetes cluster (1.19+) with kubectl access
- Helm 3.x installed
- AI model API key (default: Anthropic). See [AI Model Configuration](#ai-model-configuration) for available model options.
- OpenAI API key (required for vector embeddings)
- Ingress controller (any standard controller)

## Quick Start (5 Minutes)

### Step 1: Set Environment Variables

Export your API keys and auth token:

```bash
# Required
export ANTHROPIC_API_KEY="sk-ant-api03-..."
export OPENAI_API_KEY="sk-proj-..."
export DOT_AI_AUTH_TOKEN=$(openssl rand -base64 32)

# Ingress class - change to match your ingress controller (traefik, haproxy, etc.)
export INGRESS_CLASS_NAME="nginx"
```

### Step 2: Install the Controller

Install the dot-ai-controller to enable autonomous cluster operations:

```bash
# Set the controller version from https://github.com/vfarcic/dot-ai-controller/pkgs/container/dot-ai-controller%2Fcharts%2Fdot-ai-controller
export DOT_AI_CONTROLLER_VERSION="..."

# Install controller (includes CRDs for Solution and RemediationPolicy)
helm install dot-ai-controller \
  oci://ghcr.io/vfarcic/dot-ai-controller/charts/dot-ai-controller:$DOT_AI_CONTROLLER_VERSION \
  --namespace dot-ai \
  --create-namespace \
  --wait
```

The controller provides CRDs for autonomous cluster operations. Create Custom Resources like CapabilityScanConfig, Solution, RemediationPolicy, or ResourceSyncConfig to enable features such as capability scanning, solution tracking, and more. See the [Controller Setup Guide](/docs/controller/setup-guide) for complete details.

### Step 3: Install the Server

Install the server using the published Helm chart:

```bash
# Set the version from https://github.com/vfarcic/dot-ai/pkgs/container/dot-ai%2Fcharts%2Fdot-ai
export DOT_AI_VERSION="..."

helm install dot-ai-mcp oci://ghcr.io/vfarcic/dot-ai/charts/dot-ai:$DOT_AI_VERSION \
  --set secrets.anthropic.apiKey="$ANTHROPIC_API_KEY" \
  --set secrets.openai.apiKey="$OPENAI_API_KEY" \
  --set secrets.auth.token="$DOT_AI_AUTH_TOKEN" \
  --set ingress.enabled=true \
  --set ingress.className="$INGRESS_CLASS_NAME" \
  --set ingress.host="dot-ai.127.0.0.1.nip.io" \
  --set controller.enabled=true \
  --namespace dot-ai \
  --wait
```

**Notes**:
- Replace `dot-ai.127.0.0.1.nip.io` with your desired hostname for external access.
- For enhanced security, create a secret named `dot-ai-secrets` with keys `anthropic-api-key`, `openai-api-key`, and `auth-token` instead of using `--set` arguments.
- For all available configuration options, see the [Helm values file](https://github.com/vfarcic/dot-ai/blob/main/charts/values.yaml).
- **Global annotations**: Add annotations to all Kubernetes resources using `annotations` in your values file (e.g., for [Reloader](https://github.com/stakater/Reloader) integration: `reloader.stakater.com/auto: "true"`).
- **Custom endpoints** (OpenRouter, self-hosted): See [Custom Endpoint Configuration](#custom-endpoint-configuration) for environment variables, then use `--set` or values file with `ai.customEndpoint.enabled=true` and `ai.customEndpoint.baseURL`.
- **Observability/Tracing**: Add tracing environment variables via `extraEnv` in your values file. See [Observability Guide](../operations/observability.md) for complete configuration.
- **User-Defined Prompts**: Load custom prompts from your git repository via `extraEnv`. See [User-Defined Prompts](../tools/prompts.md#user-defined-prompts) for configuration.

### Step 4: Connect a Client

With the server running, connect using your preferred access method:

- **[MCP Client Setup](/docs/mcp)** — Connect via MCP protocol from Claude Code, Cursor, or other MCP clients
- **[CLI](/docs/cli)** — Use the command-line interface for terminal and CI/CD pipelines

## Capability Scanning for AI Recommendations

Many MCP tools depend on **capability data** to function:

- **recommend**: Uses capabilities to find resources matching your deployment intent
- **manageOrgData** (patterns): References capabilities when applying organizational patterns
- **manageOrgData** (policies): Validates resources against stored capability metadata

Without capability data, these tools may not work or will produce poor results.

### Enabling Capability Scanning

Create a `CapabilityScanConfig` CR to enable autonomous capability discovery. The controller watches for CRD changes and automatically scans new resources. See the [Capability Scan Guide](/docs/controller/capability-scan-guide) for setup instructions.

## AI Model Configuration

The DevOps AI Toolkit supports multiple AI models. Choose your model by setting the `AI_PROVIDER` environment variable.

### Model Requirements

All AI models must meet these minimum requirements:
- **Context window**: 200K+ tokens (some tools like capability scanning use large context)
- **Output tokens**: 8K+ tokens (for YAML generation and policy creation)
- **Function calling**: Required for MCP tool interactions

### Available Models

| Provider | Model | AI_PROVIDER | API Key Required | Recommended |
|----------|-------|-------------|------------------|-------------|
| **Anthropic** | Claude Haiku 4.5 | `anthropic_haiku` | `ANTHROPIC_API_KEY` | Yes |
| **Anthropic** | Claude Opus 4.6 | `anthropic_opus` | `ANTHROPIC_API_KEY` | Yes |
| **Anthropic** | Claude Sonnet 4.6 | `anthropic` | `ANTHROPIC_API_KEY` | Yes |
| **AWS** | Amazon Bedrock | `amazon_bedrock` | AWS credentials ([see setup](https://docs.aws.amazon.com/cli/latest/userguide/cli-configure-files.html)) | Yes |
| **Google** | Gemini 3.1 Pro | `google` | `GOOGLE_GENERATIVE_AI_API_KEY` | Yes (might be slow) |
| **Google** | Gemini 3 Flash | `google_flash` | `GOOGLE_GENERATIVE_AI_API_KEY` | Yes (preview) |
| **Host** | Host Environment LLM | `host` | None (uses host's AI) | Yes (if supported) |
| **Moonshot AI** | Kimi K2 | `kimi` | `MOONSHOT_API_KEY` | Yes |
| **Moonshot AI** | Kimi K2 Thinking | `kimi_thinking` | `MOONSHOT_API_KEY` | Yes (might be slow) |
| **OpenAI** | GPT-5.1 Codex | `openai` | `OPENAI_API_KEY` | No * |
| **xAI** | Grok-4 | `xai` | `XAI_API_KEY` | No * |

\* **Note**: These models may not perform as well as other providers for complex DevOps reasoning tasks.

### Models Not Supported

| Provider | Model | Reason |
|----------|-------|--------|
| **DeepSeek** | DeepSeek V3.2 (`deepseek-chat`) | 128K context limit insufficient for heavy workflows |
| **DeepSeek** | DeepSeek R1 (`deepseek-reasoner`) | 64K context limit insufficient for most workflows |

**Why DeepSeek is not supported**: Integration testing revealed that DeepSeek's context window limitations (128K for V3.2, 64K for R1) cause failures in context-heavy operations like Kyverno policy generation, which can exceed 130K tokens. The toolkit requires 200K+ context for reliable operation across all features.

### Helm Configuration

Set AI provider in your Helm values:

```yaml
ai:
  provider: anthropic_haiku  # or anthropic, anthropic_opus, google, etc.

secrets:
  anthropic:
    apiKey: "your-api-key"
```

Or via `--set`:

```bash
helm install dot-ai-mcp oci://ghcr.io/vfarcic/dot-ai/charts/dot-ai:$DOT_AI_VERSION \
  --set ai.provider=anthropic_haiku \
  --set secrets.anthropic.apiKey="$ANTHROPIC_API_KEY" \
  # ... other settings
```

**AI Keys Are Optional**: The MCP server starts successfully without AI API keys. Tools like **Shared Prompts Library** and **REST API Gateway** work without AI. AI-powered tools (deployment recommendations, remediation, pattern/policy management, capability scanning) require AI keys (unless using the `host` provider) and will show helpful error messages when accessed without configuration.

## Embedding Provider Configuration

The DevOps AI Toolkit supports multiple embedding providers for semantic search capabilities in pattern management, capability discovery, and policy matching.

### Available Embedding Providers

| Provider | EMBEDDINGS_PROVIDER | Model | Dimensions | API Key Required |
|----------|-------------------|-------|------------|------------------|
| **Amazon Bedrock** | `amazon_bedrock` | `amazon.titan-embed-text-v2:0` | 1024 | AWS credentials |
| **Google** | `google` | `text-embedding-004` (deprecated) | 768 | `GOOGLE_API_KEY` |
| **Google** | `google` | `gemini-embedding-001` | 768 | `GOOGLE_API_KEY` |
| **OpenAI** | `openai` (default) | `text-embedding-3-small` | 1536 | `OPENAI_API_KEY` |

### Helm Configuration

Set embedding provider via `extraEnv` in your values file:

```yaml
extraEnv:
  - name: EMBEDDINGS_PROVIDER
    value: "google"
  - name: GOOGLE_API_KEY
    valueFrom:
      secretKeyRef:
        name: dot-ai-secrets
        key: google-api-key
```

**Notes:**
- **Same Provider**: If using the same provider for both AI models and embeddings (e.g., `AI_PROVIDER=google` and `EMBEDDINGS_PROVIDER=google`), you only need to set one API key
- **Mixed Providers**: You can use different providers for AI models and embeddings (e.g., `AI_PROVIDER=anthropic` with `EMBEDDINGS_PROVIDER=google`)
- **Embedding Support**: Not all AI model providers support embeddings. Anthropic does not provide embeddings; use OpenAI, Google, or Amazon Bedrock for embeddings
- **Google Deprecation**: `text-embedding-004` will be discontinued on January 14, 2026. Use `gemini-embedding-001` for new deployments. When switching models, you must delete and recreate all embeddings (patterns, capabilities, policies) as vectors from different models are not compatible

## Custom Endpoint Configuration

You can configure custom OpenAI-compatible endpoints for AI models. This enables using alternative providers like OpenRouter, self-hosted models, or air-gapped deployments.

### In-Cluster Ollama Example

Deploy with a self-hosted Ollama service running in the same Kubernetes cluster:

**Create a `values.yaml` file:**
```yaml
ai:
  provider: openai
  model: "llama3.3:70b"  # Your self-hosted model
  customEndpoint:
    enabled: true
    baseURL: "http://ollama-service.default.svc.cluster.local:11434/v1"

secrets:
  customLlm:
    apiKey: "ollama"  # Ollama doesn't require authentication
  openai:
    apiKey: "your-openai-key"  # Still needed for vector embeddings
```

**Install with custom values:**
```bash
helm install dot-ai-mcp oci://ghcr.io/vfarcic/dot-ai/charts/dot-ai:$DOT_AI_VERSION \
  --values values.yaml \
  --create-namespace \
  --namespace dot-ai \
  --wait
```

### Other Self-Hosted Options

**vLLM (Self-Hosted):**
```yaml
ai:
  provider: openai
  model: "meta-llama/Llama-3.1-70B-Instruct"
  customEndpoint:
    enabled: true
    baseURL: "http://vllm-service:8000/v1"

secrets:
  customLlm:
    apiKey: "dummy"  # vLLM may not require authentication
  openai:
    apiKey: "your-openai-key"
```

**LocalAI (Self-Hosted):**
```yaml
ai:
  provider: openai
  model: "your-model-name"
  customEndpoint:
    enabled: true
    baseURL: "http://localai-service:8080/v1"

secrets:
  customLlm:
    apiKey: "dummy"
  openai:
    apiKey: "your-openai-key"
```

### OpenRouter Example

OpenRouter provides access to 100+ LLM models from multiple providers:

```yaml
ai:
  provider: openai
  model: "anthropic/claude-3.5-sonnet"
  customEndpoint:
    enabled: true
    baseURL: "https://openrouter.ai/api/v1"

secrets:
  customLlm:
    apiKey: "sk-or-v1-your-key-here"
  openai:
    apiKey: "your-openai-key"  # Still needed for embeddings
```

**Note**: OpenRouter does not support embedding models. Use OpenAI, Google, or Amazon Bedrock for embeddings.

Get your OpenRouter API key at [https://openrouter.ai/](https://openrouter.ai/)

### Important Notes

- **Context window**: 200K+ tokens recommended
- **Output tokens**: 8K+ tokens minimum
- **Function calling**: Must support OpenAI-compatible function calling

**Testing Status:**
- Validated with OpenRouter (alternative SaaS provider)
- Not yet tested with self-hosted Ollama, vLLM, or LocalAI
- We need your help testing! Report results in [issue #193](https://github.com/vfarcic/dot-ai/issues/193)

**Notes:**
- OpenAI API key is still required for vector embeddings (Qdrant operations)
- If model requirements are too high for your setup, please open an issue
- Configuration examples are based on common patterns but not yet validated

## TLS Configuration

To enable HTTPS, add these values (requires [cert-manager](https://cert-manager.io/) with a ClusterIssuer):

```yaml
ingress:
  tls:
    enabled: true
    clusterIssuer: letsencrypt  # Your ClusterIssuer name
```

Then update your `.mcp.json` URL to use `https://`.

## Web UI Visualization

Enable rich visualizations of query results by connecting to a [DevOps AI Web UI](https://github.com/vfarcic/dot-ai-ui) instance.

When configured, the query tool includes a `visualizationUrl` field in responses that opens interactive visualizations (resource topology, relationships, health status) in your browser.

### Configuration

Add the Web UI base URL to your Helm values:

```yaml
webUI:
  baseUrl: "https://dot-ai-ui.example.com"  # Your Web UI instance URL
```

Or via `--set`:

```bash
helm install dot-ai-mcp oci://ghcr.io/vfarcic/dot-ai/charts/dot-ai:$DOT_AI_VERSION \
  --set webUI.baseUrl="https://dot-ai-ui.example.com" \
  # ... other settings
```

### Feature Toggle Behavior

- **Not configured** (default): Query responses contain only text summaries. No `visualizationUrl` field is included.
- **Configured**: Query responses include a `visualizationUrl` field (format: `{baseUrl}/v/{sessionId}`) that opens the visualization in the Web UI.

### Example Query Response

When `webUI.baseUrl` is configured, query responses include:

```text
**View visualization**: https://dot-ai-ui.example.com/v/abc123-session-id
```

This URL opens an interactive visualization of the query results in the Web UI.

## Gateway API (Alternative to Ingress)

For Kubernetes 1.26+, you can use **Gateway API v1** for advanced traffic management with role-oriented design (platform teams manage Gateways, app teams create routes).

### When to Use

**Use Gateway API when:**
- Running Kubernetes 1.26+ with Gateway API support
- Need advanced routing (weighted traffic, header-based routing)
- Prefer separation of infrastructure and application concerns

**Use Ingress when:**
- Running Kubernetes < 1.26
- Simpler requirements met by Ingress features

### Prerequisites

- Kubernetes 1.26+ cluster
- Gateway API CRDs installed: `kubectl apply -f https://github.com/kubernetes-sigs/gateway-api/releases/download/v1.4.1/standard-install.yaml`
- Gateway controller running (Istio, Envoy Gateway, Kong, etc.)
- Existing Gateway resource created by platform team (reference pattern)

### Quick Start (Reference Pattern - RECOMMENDED)

Reference an existing platform-managed Gateway:

```bash
helm install dot-ai-mcp oci://ghcr.io/vfarcic/dot-ai/charts/dot-ai:$DOT_AI_VERSION \
  --set secrets.anthropic.apiKey="$ANTHROPIC_API_KEY" \
  --set secrets.openai.apiKey="$OPENAI_API_KEY" \
  --set secrets.auth.token="$DOT_AI_AUTH_TOKEN" \
  --set ingress.enabled=false \
  --set gateway.name="cluster-gateway" \
  --set gateway.namespace="gateway-system" \
  --namespace dot-ai \
  --wait
```

### Configuration Reference

```yaml
# Reference pattern (RECOMMENDED)
gateway:
  name: "cluster-gateway"           # Existing Gateway name
  namespace: "gateway-system"       # Gateway namespace (optional)
  timeouts:
    request: "3600s"                # SSE streaming timeout
    backendRequest: "3600s"

# Creation pattern (development/testing only)
gateway:
  create: true                      # Create Gateway (NOT for production)
  className: "istio"                # GatewayClass name
```

### Complete Guide

See **[Gateway API Deployment Guide](gateway-api.md)** for:
- Platform team Gateway setup (HTTP and HTTPS)
- Application team deployment steps
- Cross-namespace access (ReferenceGrant)
- Development/testing creation pattern
- Troubleshooting and verification
- Migration from Ingress

## Next Steps

Once the server is running:

### 1. Explore Tools
- **[Tools Overview](../tools/overview.md)** — Complete guide to all available tools, how they work together, and recommended usage flow

### 2. Enable Observability (Optional)
- **[Observability Guide](../operations/observability.md)** — Distributed tracing with OpenTelemetry for debugging workflows, measuring AI performance, and monitoring Kubernetes operations

### 3. Production Considerations
- Consider backup strategies for vector database content (organizational patterns and capabilities)
- Review [TLS Configuration](#tls-configuration) for HTTPS

## Support

- **Bug Reports**: [GitHub Issues](https://github.com/vfarcic/dot-ai/issues)

# Gateway API Deployment Guide

**Deploy DevOps AI Toolkit using Kubernetes Gateway API v1 with production-ready reference pattern or development/testing creation pattern.**

## When to Use Gateway API

**Use Gateway API when:**
- Running Kubernetes 1.26+ with Gateway API support (GKE Autopilot, EKS, AKS, etc.)
- Need advanced routing (weighted traffic, header-based routing, mirroring)
- Prefer role-oriented design (platform team manages Gateways, app teams create routes)
- Want standardized configuration across different Gateway implementations

**Use Ingress instead when:**
- Running Kubernetes < 1.26
- Gateway API CRDs not available
- Simpler requirements met by Ingress features

**Learn more:** [Gateway API official documentation](https://gateway-api.sigs.k8s.io/)

## Prerequisites

**All deployments:**
- Kubernetes 1.26+ cluster
- Gateway API CRDs v1.2+ installed
- Gateway controller running (Istio, Envoy Gateway, Kong, etc.)
- Helm 3.x

**Reference pattern (RECOMMENDED):**
- Existing Gateway resource created by platform team
- GatewayClass resource available
- Optional: ReferenceGrant for cross-namespace access

**Creation pattern (development/testing only):**
- GatewayClass resource available

### Install Gateway API CRDs

```bash
kubectl apply -f https://github.com/kubernetes-sigs/gateway-api/releases/download/v1.4.1/standard-install.yaml
```

**Output:**
```
customresourcedefinition.apiextensions.k8s.io/gatewayclasses.gateway.networking.k8s.io created
customresourcedefinition.apiextensions.k8s.io/gateways.gateway.networking.k8s.io created
customresourcedefinition.apiextensions.k8s.io/httproutes.gateway.networking.k8s.io created
customresourcedefinition.apiextensions.k8s.io/referencegrants.gateway.networking.k8s.io created
customresourcedefinition.apiextensions.k8s.io/grpcroutes.gateway.networking.k8s.io created
customresourcedefinition.apiextensions.k8s.io/tcproutes.gateway.networking.k8s.io created
customresourcedefinition.apiextensions.k8s.io/udproutes.gateway.networking.k8s.io created
customresourcedefinition.apiextensions.k8s.io/backendtlspolicies.gateway.networking.k8s.io created
```

Verify installation:
```bash
kubectl get crd gateways.gateway.networking.k8s.io httproutes.gateway.networking.k8s.io
```

**Output:**
```
NAME                                   CREATED AT
gateways.gateway.networking.k8s.io     2025-12-22T09:37:24Z
httproutes.gateway.networking.k8s.io   2025-12-22T09:37:24Z
```

Check available GatewayClasses:
```bash
kubectl get gatewayclass
```

**Output:**
```
NAME       CONTROLLER              ACCEPTED   AGE
kgateway   kgateway.dev/kgateway   True       22s
```

## Reference Pattern (HTTP) - Development/Testing Only

> ⚠️ **SECURITY WARNING**: This HTTP-only configuration transmits all traffic unencrypted, including authentication tokens and API keys. **DO NOT use in production.** Use the [HTTPS pattern](#reference-pattern-https---recommended) for production deployments.

The reference pattern follows Gateway API best practices where platform teams manage shared Gateway infrastructure and application teams create HTTPRoutes.

### Step 1: Platform Team Creates Gateway (ONCE)

Create a shared Gateway in a dedicated namespace:

> **Note:** This HTTP-only example is for development/testing. For production, skip to the [HTTPS pattern](#reference-pattern-https---recommended).

```yaml
apiVersion: gateway.networking.k8s.io/v1
kind: Gateway
metadata:
  name: cluster-gateway
  namespace: gateway-system
spec:
  gatewayClassName: kgateway
  listeners:
  - name: http
    protocol: HTTP
    port: 80
    allowedRoutes:
      namespaces:
        from: All  # Allow routes from all namespaces
EOF
```

**Output:**
```
gateway.gateway.networking.k8s.io/cluster-gateway created
```

Wait for Gateway to be Programmed:

```bash
kubectl wait --for=condition=Programmed gateway/cluster-gateway -n gateway-system --timeout=300s
```

**Output:**
```
gateway.gateway.networking.k8s.io/cluster-gateway condition met
```

Verify Gateway status:

```bash
kubectl describe gateway cluster-gateway -n gateway-system
```

**Output:**
```
Name:         cluster-gateway
Namespace:    gateway-system
Labels:       <none>
Annotations:  <none>
API Version:  gateway.networking.k8s.io/v1
Kind:         Gateway
Metadata:
  Creation Timestamp:  2025-12-22T09:38:15Z
Spec:
  Gateway Class Name:  kgateway
  Listeners:
    Allow Dropped Headers:
      Forward:  false
    Allow Dropped Headers:
      Forward:  false
    Allowed Routes:
      Namespaces:
        From:  All
    Name:      http
    Port:      80
    Protocol:  HTTP
Status:
  Addresses:
    Type:   IPAddress
    Value:  XXX.XXX.XXX.XXX
  Conditions:
    Last Transition Time:  2025-12-22T09:38:17Z
    Message:               Resource is valid
    Observed Generation:   1
    Reason:                Accepted
    Status:                True
    Type:                  Accepted
    Last Transition Time:  2025-12-22T09:38:17Z
    Message:               Listeners are ready
    Observed Generation:   1
    Reason:                Programmed
    Status:                True
    Type:                  Programmed
  Listeners:
    Attached Routes:  0
    Conditions:
      Last Transition Time:  2025-12-22T09:38:17Z
      Message:               Listener is ready
      Observation Timestamp: 2025-12-22T09:38:17Z
      Reason:                Ready
      Status:                True
      Type:                  Ready
    Name:                    http
    Supported Kinds:
      Group:  gateway.networking.k8s.io
      Kind:   HTTPRoute
      Group:  gateway.networking.k8s.io
      Kind:   GRPCRoute
```

### Step 2: Application Team Deploys dot-ai

Create Helm values file (`gateway-http-values.yaml`):

```yaml
# Deployment method
deployment:
  method: standard

# Disable traditional Ingress
ingress:
  enabled: false

# Reference existing Gateway (RECOMMENDED)
gateway:
  name: "cluster-gateway"
  namespace: "gateway-system"
  timeouts:
    request: "3600s"
    backendRequest: "3600s"

# Secrets configuration
secrets:
  name: dot-ai-secrets
  auth:
    token: ""  # SET THIS: openssl rand -base64 32
  anthropic:
    apiKey: ""  # SET THIS: sk-ant-api03-...
  openai:
    apiKey: ""  # SET THIS: sk-proj-...

# AI provider configuration
ai:
  provider: anthropic

# Qdrant vector database
qdrant:
  enabled: true
```

Deploy with Helm:

```bash
helm install dot-ai \
  oci://ghcr.io/vfarcic/dot-ai/charts/dot-ai:0.168.0 \
  --namespace dot-ai \
  --create-namespace \
  -f gateway-http-values.yaml \
  --wait
```

**Output:**
```
NAME: dot-ai
LAST DEPLOYED: Sun Dec 22 09:39:45 2025
NAMESPACE: dot-ai
STATUS: deployed
REVISION: 1
TEST SUITE: None
NOTES:
DevOps AI Toolkit (dot-ai) has been deployed successfully.

Access the service:
- HTTP: kubectl port-forward -n dot-ai svc/dot-ai 8080:8080
- Gateway API: Configure your DNS to point to the Gateway IP (XXX.XXX.XXX.XXX)

For more information, visit: https://github.com/vfarcic/dot-ai
```

Or using `--set` flags:

```bash
export DOT_AI_AUTH_TOKEN=$(openssl rand -base64 32)

helm install dot-ai \
  oci://ghcr.io/vfarcic/dot-ai/charts/dot-ai:0.168.0 \
  --namespace dot-ai \
  --create-namespace \
  --set gateway.name=cluster-gateway \
  --set gateway.namespace=gateway-system \
  --set secrets.auth.token="$DOT_AI_AUTH_TOKEN" \
  --set secrets.anthropic.apiKey="$ANTHROPIC_API_KEY" \
  --set secrets.openai.apiKey="$OPENAI_API_KEY" \
  --wait
```

**Output:**
```
NAME: dot-ai
LAST DEPLOYED: Sun Dec 22 09:39:45 2025
NAMESPACE: dot-ai
STATUS: deployed
REVISION: 1
```

Verify deployment:

```bash
kubectl get pods -n dot-ai
```

**Output:**
```
NAME                                    READY   STATUS    RESTARTS   AGE
dot-ai-7c5f8d9b4-xyz9w                  1/1     Running   0          30s
qdrant-0                                 1/1     Running   0          25s
```

## Reference Pattern (HTTPS) - RECOMMENDED

For production HTTPS deployments with cert-manager and wildcard certificates.

### Step 1: Platform Team Creates Gateway with HTTPS (ONCE)

Create wildcard certificate and Gateway with HTTPS listener:

```yaml
# Gateway with HTTP and HTTPS listeners
apiVersion: gateway.networking.k8s.io/v1
kind: Gateway
metadata:
  name: cluster-gateway
  namespace: gateway-system
spec:
  gatewayClassName: istio
  listeners:
  - name: http
    protocol: HTTP
    port: 80
    allowedRoutes:
      namespaces:
        from: All
  - name: https
    protocol: HTTPS
    port: 443
    tls:
      mode: Terminate
      certificateRefs:
      - kind: Secret
        name: wildcard-tls
    allowedRoutes:
      namespaces:
        from: All
```

Apply and wait:

```bash
kubectl apply -f - <<EOF
# Gateway with HTTP and HTTPS listeners
apiVersion: gateway.networking.k8s.io/v1
kind: Gateway
metadata:
  name: cluster-gateway
  namespace: gateway-system
spec:
  gatewayClassName: istio
  listeners:
  - name: http
    protocol: HTTP
    port: 80
    allowedRoutes:
      namespaces:
        from: All
  - name: https
    protocol: HTTPS
    port: 443
    tls:
      mode: Terminate
      certificateRefs:
      - kind: Secret
        name: wildcard-tls
    allowedRoutes:
      namespaces:
        from: All
EOF

kubectl wait --for=condition=Programmed gateway/cluster-gateway -n gateway-system --timeout=300s
```

### Step 2: Application Team Deploys dot-ai

Same as HTTP deployment - the HTTPRoute will work with both HTTP and HTTPS listeners:

```bash
helm install dot-ai \
  oci://ghcr.io/vfarcic/dot-ai/charts/dot-ai:0.168.0 \
  --namespace dot-ai \
  --create-namespace \
  --set gateway.name=cluster-gateway \
  --set gateway.namespace=gateway-system \
  --set secrets.auth.token="$DOT_AI_AUTH_TOKEN" \
  --set secrets.anthropic.apiKey="$ANTHROPIC_API_KEY" \
  --set secrets.openai.apiKey="$OPENAI_API_KEY" \
  --wait
```

## Configuration Reference

### Reference Pattern Values

```yaml
gateway:
  name: "cluster-gateway"           # Name of existing Gateway (required)
  namespace: "gateway-system"       # Gateway namespace (optional, omit if same namespace)
  timeouts:
    request: "3600s"                # SSE streaming timeout
    backendRequest: "3600s"
```

> **Note:** GKE Gateway Controller and potentially other implementations do not support HTTPRoute `timeouts`. If you encounter error `GWCER104 "Timeouts are not supported"`, set timeouts to empty strings and configure them via [GCPBackendPolicy](https://cloud.google.com/kubernetes-engine/docs/how-to/configure-gateway-resources#configure_timeout) instead:
> ```yaml
> gateway:
>   timeouts:
>     request: ""
>     backendRequest: ""
> ```

### Creation Pattern Values (Development/Testing Only)

```yaml
gateway:
  create: true                      # Create Gateway (NOT for production)
  className: "istio"                # GatewayClass name (required)
  annotations: {}                   # Gateway annotations
  listeners:
    http:
      enabled: true                 # HTTP listener port 80
      hostname: ""                  # Optional hostname
    https:
      enabled: false                # HTTPS listener port 443
      hostname: ""                  # Optional hostname
      secretName: ""                # TLS secret name
  timeouts:
    request: "3600s"
    backendRequest: "3600s"
```

## Cross-Namespace Gateway Access (ReferenceGrant)

If the Gateway uses `allowedRoutes.namespaces.from: Same`, create a ReferenceGrant to allow cross-namespace access.

**When is ReferenceGrant needed?**

```yaml
# NO ReferenceGrant needed
listeners:
- name: http
  allowedRoutes:
    namespaces:
      from: All  # Allows all namespaces

# ReferenceGrant REQUIRED
listeners:
- name: http
  allowedRoutes:
    namespaces:
      from: Same  # Only same namespace (requires ReferenceGrant for cross-namespace)
```

### ReferenceGrant Example

Platform team creates in Gateway namespace:

```yaml
apiVersion: gateway.networking.k8s.io/v1beta1
kind: ReferenceGrant
metadata:
  name: allow-dot-ai-routes
  namespace: gateway-system  # Gateway namespace
spec:
  from:
  - group: gateway.networking.k8s.io
    kind: HTTPRoute
    namespace: dot-ai  # Application namespace
  to:
  - group: gateway.networking.k8s.io
    kind: Gateway
    # Optional: name: cluster-gateway  # Restrict to specific Gateway
```

Apply:

```bash
kubectl apply -f - <<EOF
apiVersion: gateway.networking.k8s.io/v1beta1
kind: ReferenceGrant
metadata:
  name: allow-dot-ai-routes
  namespace: gateway-system  # Gateway namespace
spec:
  from:
  - group: gateway.networking.k8s.io
    kind: HTTPRoute
    namespace: dot-ai  # Application namespace
  to:
  - group: gateway.networking.k8s.io
    kind: Gateway
    # Optional: name: cluster-gateway  # Restrict to specific Gateway
EOF
```

**Learn more:** [Gateway API ReferenceGrant documentation](https://gateway-api.sigs.k8s.io/api-types/referencegrant/)

## Development/Testing Pattern (Creation Mode)

⚠️ **WARNING:** This creates a Gateway per application. NOT RECOMMENDED for production.

Use ONLY for:
- Development environments
- Testing Gateway API functionality
- Single-application clusters
- When platform team doesn't provide shared Gateway

### Example: Creation Mode with external-dns

```yaml
# Deployment method
deployment:
  method: standard

# Disable Ingress
ingress:
  enabled: false

# Create Gateway (NOT RECOMMENDED for production)
gateway:
  create: true
  className: "istio"
  
  # external-dns annotations
  annotations:
    external-dns.alpha.kubernetes.io/hostname: "dot-ai.example.com"
  
  listeners:
    http:
      enabled: true
      hostname: "dot-ai.example.com"
    https:
      enabled: false
  
  timeouts:
    request: "3600s"
    backendRequest: "3600s"

# Secrets configuration
secrets:
  name: dot-ai-secrets
  auth:
    token: ""  # SET THIS
  anthropic:
    apiKey: ""  # SET THIS
  openai:
    apiKey: ""  # SET THIS

ai:
  provider: anthropic

qdrant:
  enabled: true
```

Deploy:

```bash
helm install dot-ai \
  oci://ghcr.io/vfarcic/dot-ai/charts/dot-ai:0.168.0 \
  --namespace dot-ai \
  --create-namespace \
  --set gateway.create=true \
  --set gateway.className=istio \
  --set 'gateway.annotations.external-dns\.alpha\.kubernetes\.io/hostname=dot-ai.example.com' \
  --set gateway.listeners.http.hostname=dot-ai.example.com \
  --set secrets.auth.token="$DOT_AI_AUTH_TOKEN" \
  --set secrets.anthropic.apiKey="$ANTHROPIC_API_KEY" \
  --set secrets.openai.apiKey="$OPENAI_API_KEY" \
  --wait
```

**Note:** Created Gateway has `-http` suffix (e.g., `dot-ai-http`) to prevent kGateway Envoy deployment naming conflicts.

## MCP Client Configuration

Configure your MCP client (e.g., Claude Desktop) to connect to the deployed server.

### Get Gateway Address

```bash
# Reference mode
kubectl get gateway cluster-gateway -n gateway-system -o jsonpath='{.status.addresses[0].value}'

# Creation mode
kubectl get gateway -n dot-ai -o jsonpath='{.items[0].status.addresses[0].value}'
```

### Configure DNS

Point your hostname to the Gateway IP address:
- Manual DNS: Create A record pointing to Gateway IP
- With external-dns: DNS records created automatically

### MCP Client Configuration (.mcp.json)

**Production (HTTPS - RECOMMENDED):**

```json
{
  "mcpServers": {
    "dot-ai": {
      "url": "https://dot-ai.example.com",
      "transport": {
        "type": "http"
      }
    }
  }
}
```

**Development/Testing Only (HTTP - Not Secure):**

> ⚠️ **WARNING**: HTTP transmits credentials in plaintext. Use only for local development.

```json
{
  "mcpServers": {
    "dot-ai": {
      "url": "http://dot-ai.example.com",
      "transport": {
        "type": "http"
      }
    }
  }
}
```

## Verification Steps

### 1. Check Gateway Status

```bash
# Reference mode
kubectl get gateway cluster-gateway -n gateway-system
```

**Output:**
```text
NAME              CLASS      ADDRESS        PROGRAMMED   AGE
cluster-gateway   kgateway   XXX.XXX.XXX.XXX   True         5m22s
```

Verify detailed status:

```bash
kubectl describe gateway cluster-gateway -n gateway-system
```

Look for `Programmed` condition = `True`.

### 2. Check HTTPRoute

```bash
kubectl get httproute -n dot-ai
```

**Output:**
```text
NAME         HOSTNAMES                   PARENTREFS          AGE
dot-ai-http  [dot-ai.example.com]        cluster-gateway     2m
```

Verify detailed status:

```bash
kubectl describe httproute -n dot-ai
```

**Output (excerpt):**
```text
Status:
  Parents:
    - Conditions:
        - Type: Accepted
          Status: "True"
          Reason: Accepted
        - Type: ResolvedRefs
          Status: "True"
          Reason: ResolvedRefs
      Controller Name: kgateway.dev/kgateway
      Parent Ref:
        Group: gateway.networking.k8s.io
        Kind: Gateway
        Name: cluster-gateway
        Namespace: gateway-system
```

### 3. Check Backend Services

```bash
kubectl get svc,pod -n dot-ai
```

**Output:**
```text
NAME                       TYPE        CLUSTER-IP   EXTERNAL-IP   PORT(S)    AGE
service/dot-ai             ClusterIP   XXX.XXX.XXX.XX   <none>        8080/TCP   2m

NAME                                    READY   STATUS    RESTARTS   AGE
pod/dot-ai-7c5f8d9b4-xyz9w             1/1     Running   0          2m
pod/qdrant-0                            1/1     Running   0          2m
```

### 4. Verify Gateway IP Address

```bash
# Get Gateway IP
kubectl get gateway cluster-gateway -n gateway-system -o jsonpath='{.status.addresses[0].value}'
```

**Output:**
```text
XXX.XXX.XXX.XXX
```

## Troubleshooting

### Gateway Not Getting IP Address

```bash
kubectl describe gateway cluster-gateway -n gateway-system
```

**Common issues:**
- GatewayClass not found → Check `kubectl get gatewayclass`
- Gateway controller not running → Check controller pods
- Invalid listener configuration → Review Gateway spec

**Solution:**
```bash
# Verify GatewayClass exists and is accepted
kubectl get gatewayclass -o yaml

# Check Gateway controller logs
kubectl logs -n istio-system -l app=istio-ingressgateway  # Example for Istio
```

### HTTPRoute Not Routing Traffic

```bash
kubectl describe httproute -n dot-ai
```

**Common issues:**
- Gateway name mismatch
- Cross-namespace without ReferenceGrant
- Backend Service doesn't exist

**Solution:**
```bash
# Verify parentRef matches Gateway
kubectl get httproute -n dot-ai -o yaml | grep -A 5 parentRefs

# Check Service exists
kubectl get svc -n dot-ai

# For cross-namespace, verify ReferenceGrant
kubectl get referencegrant -n gateway-system
```

### Cross-Namespace Access Denied

```bash
kubectl get referencegrant -n gateway-system
kubectl describe referencegrant -n gateway-system
```

**Common issues:**
- ReferenceGrant missing
- ReferenceGrant doesn't allow your namespace
- Gateway uses `from: Same` without ReferenceGrant

**Solution:**
Create ReferenceGrant as shown in [Cross-Namespace Gateway Access](#cross-namespace-gateway-access-referencegrant) section.

### Gateway Name Conflict (Creation Mode)

```bash
kubectl get gateway,deploy -n dot-ai
```

**Issue:** Gateway name conflicts with application Deployment name.

**Solution:** The chart automatically adds `-http` suffix to prevent this. Verify:
```bash
kubectl get gateway -n dot-ai
# Expected: dot-ai-http (or <release-name>-http)
```

### DNS Record Not Created (external-dns)

```bash
kubectl logs -n external-dns -l app.kubernetes.io/name=external-dns
```

**Common issues:**
- external-dns not running
- Missing annotations on Gateway
- DNS provider credentials missing

**Solution:**
```bash
# Verify Gateway has external-dns annotation
kubectl get gateway -n dot-ai -o yaml | grep external-dns

# Check DNS resolution
dig +short dot-ai.example.com
```

### Connection Timeout

**Issue:** HTTP requests timeout or SSE connections fail.

**Solution:** Verify timeout configuration:
```bash
kubectl get httproute -n dot-ai -o yaml | grep -A 5 timeouts
```

Expected:
```yaml
timeouts:
  request: 3600s
  backendRequest: 3600s
```

## Cost Comparison

| Mode | Gateway per App | Load Balancer Cost | 10 Apps Total Cost |
|------|----------------|-------------------|-------------------|
| **Reference (RECOMMENDED)** | No (shared) | $18-30/month | $18-30/month |
| **Creation (dev/test)** | Yes | $18-30/month each | $180-300/month |

**Recommendation:** Use reference pattern for production to save costs and follow Gateway API best practices.

## Migration from Ingress

Migrate from traditional Ingress to Gateway API:

### Step 1: Note Current Configuration

```bash
# Get current Ingress hostname
kubectl get ingress -n dot-ai -o yaml | grep host

# Get TLS configuration
kubectl get ingress -n dot-ai -o yaml | grep -A 5 tls
```

### Step 2: Platform Team Creates Gateway

Create Gateway with same hostname as current Ingress:

```bash
kubectl apply -f - <<EOF
apiVersion: gateway.networking.k8s.io/v1
kind: Gateway
metadata:
  name: cluster-gateway
  namespace: gateway-system
spec:
  gatewayClassName: istio
  listeners:
  - name: http
    protocol: HTTP
    port: 80
    allowedRoutes:
      namespaces:
        from: All
EOF
```

### Step 3: Deploy with Gateway API

```bash
helm upgrade dot-ai \
  oci://ghcr.io/vfarcic/dot-ai/charts/dot-ai:0.168.0 \
  --namespace dot-ai \
  --set ingress.enabled=false \
  --set gateway.name=cluster-gateway \
  --set gateway.namespace=gateway-system \
  --reuse-values
```

### Step 4: Verify and Update DNS

```bash
# Get Gateway IP
GATEWAY_IP=$(kubectl get gateway cluster-gateway -n gateway-system -o jsonpath='{.status.addresses[0].value}')

# Update DNS A record to point to Gateway IP
```

### Step 5: Verify Deployment

Verify the deployment through your MCP client (Claude Desktop, Cursor, etc.) by connecting to the configured endpoint. The MCP client will automatically validate the connection and display the server status.

Alternatively, check deployment status:

```bash
kubectl get httproute,svc,pod -n dot-ai
```

## Additional Resources

**Gateway API:**
- [Official Documentation](https://gateway-api.sigs.k8s.io/)
- [Best Practices](https://gateway-api.sigs.k8s.io/guides/best-practices/)
- [API Reference](https://gateway-api.sigs.k8s.io/api-types/gateway/)
- [ReferenceGrant](https://gateway-api.sigs.k8s.io/api-types/referencegrant/)

**Integration:**
- [cert-manager Gateway API Integration](https://cert-manager.io/docs/usage/gateway/)
- [external-dns Gateway API Support](https://github.com/kubernetes-sigs/external-dns/blob/master/docs/tutorials/gateway-api.md)

**dot-ai Documentation:**
- [Deployment Guide](deployment.md)


# Capability Management Guide


**Complete guide for discovering, managing, and leveraging Kubernetes resource capabilities with the DevOps AI Toolkit.**


## Overview

Capability Management transforms how the DevOps AI Toolkit understands your Kubernetes cluster by discovering and storing semantic information about what each resource actually does. Instead of treating resources as generic names, the system learns their capabilities, providers, complexity, and use cases - enabling intelligent, context-aware deployment recommendations.

### What is Capability Management?

Traditional Kubernetes deployment tools see resources like `sqls.devopstoolkit.live` as meaningless names among all the resources in your cluster. Capability Management teaches the system that `sqls.devopstoolkit.live` is actually a PostgreSQL database solution with low complexity, supporting multiple cloud providers.

**The Problem**: 
```
User: "I need a PostgreSQL database"
AI: Gets all cluster resources [sqls.devopstoolkit.live, Deployment, StatefulSet, Service, ...]
AI: Must analyze each resource individually to understand what it does
AI: Often makes poor matches due to overwhelming choices and lack of context
Result: Suboptimal recommendations based on incomplete understanding
```

**The Solution**:
```
User: "I need a PostgreSQL database" 
System: Semantic search finds database-relevant resources with rich context
AI: Gets pre-filtered candidates that actually match the intent
AI: Makes informed decisions using capability context (postgresql, complexity, providers)
Result: Optimal recommendations whether using 1 resource, 5 resources, or 50 resources
```

### How It Works

1. **Capability Discovery** → System analyzes your cluster resources and infers what each one does
2. **Semantic Storage** → Capabilities are stored with AI-generated embeddings for intelligent matching  
3. **Smart Recommendations** → Deployment requests get pre-filtered, relevant resources with rich context
4. **Informed AI Decisions** → AI makes better choices based on capability understanding

## See the Problem in Action

[![Why Kubernetes Discovery Sucks for AI (And How Vector DBs Fix It)](https://img.youtube.com/vi/MSNstHj4rmk/maxresdefault.jpg)](https://youtu.be/MSNstHj4rmk)

This video demonstrates the real-world challenge of discovering the right Kubernetes resources when you have hundreds of cryptically named resource types in your cluster. Watch how both humans and AI struggle with traditional keyword-based searching through 443+ resources, and see how semantic search with vector databases transforms an unsearchable cluster into an instantly discoverable one where you can describe what you want to accomplish rather than memorizing cryptic resource names.

### Understanding Organizational Data Types

Capability Management provides the foundation for other organizational intelligence features in the DevOps AI Toolkit.

**Quick Overview**:
- **Capabilities**: What resources can do (this guide - required foundation)
- **Patterns**: What resources to deploy together (organizational preferences)
- **Policies**: How resources should be configured (governance requirements)

**Used by**: Capabilities power semantic understanding in [Cluster Query](query.md), [Deployment Recommendations](recommend.md), and [Operations](operate.md).

For a complete understanding of how these three types work together, see the **[Organizational Data Concepts Guide](../organizational-data/concepts.md)**.

## Prerequisites

Before using Capability Management, ensure you have:

### Required Setup
- **DevOps AI Toolkit deployed** (see [Deployment Guide](../setup/deployment.md))
- **Vector DB service** (Qdrant) for capability storage and semantic search
- **AI integration** for capability inference (see [AI Model Configuration](../setup/deployment.md#ai-model-configuration) for supported models and API keys)
- **Kubernetes cluster access** for resource discovery and schema analysis

### Optional Enhancements
- **Embedding provider** configured for semantic search (see [Embedding Provider Configuration](../setup/deployment.md#embedding-provider-configuration) for options)
- **Organizational patterns** for enhanced recommendations (see [Pattern Management Guide](../organizational-data/patterns.md))

For complete setup instructions, see the [Deployment Guide](../setup/deployment.md).

## Scanning Methods

There are two ways to scan cluster capabilities. Choose based on whether your server is accessible from within the Kubernetes cluster.

### Method 1: Controller-Based Scanning (Recommended)

**Use when**: Server is deployed in Kubernetes or accessible via a URL from within the cluster.

The dot-ai-controller provides autonomous, event-driven capability scanning. Once configured, it automatically:
- Scans all cluster resources on startup
- Watches for CRD changes (create/update/delete)
- Keeps capabilities up-to-date without manual intervention

**Setup**:

1. Install the controller (if not already installed via Helm):
   ```bash
   helm install dot-ai-controller \
     oci://ghcr.io/vfarcic/dot-ai-controller/charts/dot-ai-controller \
     --namespace dot-ai --create-namespace --wait
   ```

2. Create a `CapabilityScanConfig` to start scanning:
   ```yaml
   apiVersion: dot-ai.devopstoolkit.live/v1alpha1
   kind: CapabilityScanConfig
   metadata:
     name: default-scan
     namespace: dot-ai
   spec:
     mcp:
       endpoint: http://dot-ai-mcp.dot-ai.svc.cluster.local:3456/api/v1/tools/manageOrgData
       authSecretRef:
         name: dot-ai-secrets
         key: auth-token
   ```

3. Apply the configuration:
   ```bash
   kubectl apply -f capabilityscanconfig.yaml
   ```

The controller will immediately begin scanning all cluster resources. Monitor progress with:
```bash
kubectl logs -n dot-ai -l app.kubernetes.io/name=dot-ai-controller --tail=100
```

**Benefits**:
- Fully autonomous - no manual scanning required
- Event-driven - new CRDs are scanned automatically when installed
- Resilient - retries failed operations with exponential backoff

For complete controller documentation, see the [Capability Scan Guide](/docs/controller/capability-scan-guide).

### Method 2: Manual Scanning via AI Agent

**Use when**: Server is NOT accessible from within the cluster (e.g., running locally on your laptop, Docker Desktop, or behind NAT).

When the controller cannot reach your server, use the interactive scanning workflow through your AI coding agent. This approach requires you to manually trigger scans when cluster resources change.

Discover and analyze all resources in your Kubernetes cluster through a complete workflow:

#### Initial Setup
```
User: I want to scan my cluster for resource capabilities

Agent: Would you like to scan all cluster resources or specify a subset?

Options:
1. All resources - Comprehensive analysis of all cluster resources
2. Specific subset - Choose specific resource types for focused analysis
```

#### Scanning Workflow Options

**Complete Cluster Analysis (Recommended)**
```
User: All resources

Agent: How would you like to process the resources?

Options:
1. Auto mode - Batch process all resources automatically (recommended for production)  
2. Manual mode - Review each capability analysis individually (useful for testing/validation)

User: Auto mode

Agent: Starting comprehensive cluster capability scan...

[Scanning happens in background - no progress updates shown]

Agent: ✅ Capability scan completed successfully! Processed cluster resources.

Your cluster capabilities are now available for intelligent deployment recommendations.
```

**Focused Resource Scanning**
```
User: Specific subset

Agent: Please provide specific resource list (comma-separated):
Format: Kind.group for CRDs, Kind for core resources
Examples: sqls.devopstoolkit.live, deployments.apps, services

User: sqls.devopstoolkit.live, server.dbforpostgresql.azure.upbound.io, persistentvolumeclaims

Agent: How would you like to process these 3 resources?

Options:
1. Auto mode - Batch process automatically 
2. Manual mode - Review each analysis individually

User: Manual mode

Agent: Starting focused capability scan for 3 resources in manual mode...

🔍 **Analyzing Resource 1/3: sqls.devopstoolkit.live**

**AI Analysis Results:**
🎯 Capabilities: postgresql, mysql, database, multi-cloud
📊 Complexity: low | Confidence: 85%
📝 Description: Managed database solution supporting multiple engines and cloud providers
🔧 Use Case: Simple database deployment without infrastructure complexity

Continue storing this capability?
Options:
1. yes - Store this capability 
2. no - Skip this resource
3. stop - End scanning process

User: yes

Agent: ✅ Stored capability for sqls.devopstoolkit.live

🔍 **Analyzing Resource 2/3: server.dbforpostgresql.azure.upbound.io**

**AI Analysis Results:**
🎯 Capabilities: postgresql, database, azure infrastructure  
📊 Complexity: medium | Confidence: 90%
📝 Description: Native Azure PostgreSQL service via Crossplane
🔧 Use Case: Production PostgreSQL with Azure integration

Continue storing this capability?
Options:
1. yes - Store this capability
2. no - Skip this resource  
3. stop - End scanning process

User: yes

Agent: ✅ Stored capability for server.dbforpostgresql.azure.upbound.io

🔍 **Analyzing Resource 3/3: persistentvolumeclaims**

**AI Analysis Results:**
🎯 Capabilities: storage, persistent volumes, data persistence
📊 Complexity: medium | Confidence: 95%
📝 Description: Kubernetes persistent storage abstraction
🔧 Use Case: Application data persistence and storage management

Continue storing this capability?
Options:
1. yes - Store this capability
2. no - Skip this resource
3. stop - End scanning process

User: yes

Agent: ✅ Stored capability for persistentvolumeclaims

🎯 **Manual Scan Complete!**
✅ Successfully analyzed and stored 3/3 capabilities
⏱️ Total time: 2 minutes 15 seconds
📊 All selected resources now have capability metadata for enhanced recommendations
```

## Working with Capabilities

Regardless of which scanning method you use, all capability operations (monitoring progress, listing, searching, etc.) are done through your AI coding agent.

### Monitor Scanning Progress

```
User: Check the progress of my capability scan

Agent: [Uses manageOrgData with dataType: "capabilities", operation: "progress"]

**Capability Scan Progress**

⏳ Status: Processing - 20/79 resources (25%)
📊 Current resource: ResourceSyncConfig.dot-ai.devopstoolkit.live
⏰ Est. remaining: 5.6 minutes
🕒 Started: 2025-12-25T18:28:37Z


# Knowledge Base Guide

**Complete guide for ingesting and searching organizational documentation with the DevOps AI Toolkit.**

## Overview

Knowledge Base provides semantic search over organizational documentation. Documents are ingested, chunked, and stored in a vector database. Users can then search this knowledge base using natural language queries through the `manageKnowledge` tool.

### What is Knowledge Base?

Knowledge Base allows you to:
- **Ingest documents** into a searchable vector store
- **Search semantically** using natural language to find relevant content by meaning, not just keywords
- **Maintain provenance** with full source URLs so users can verify and explore original documents

### How It Works

1. **Ingestion**: Documents are split into chunks (~1000 characters), embedded, and stored in Qdrant
2. **Search**: Queries are embedded and matched against stored chunks by semantic similarity
3. **Results**: Matching chunks are returned with source URI, relevance score, and metadata

### Key Concepts

| Concept | Description |
|---------|-------------|
| **URI** | Full URL identifying the source document (e.g., `https://github.com/org/repo/blob/main/docs/guide.md`) |
| **Chunk** | A segment of a document stored with its embedding for semantic search |
| **Semantic Search** | Finding content by meaning similarity, not just keyword matching |

## Prerequisites

Before using Knowledge Base, ensure you have completed one of the following setups:

- **[Stack Installation](/docs/stack)** (recommended) - Installs all components including the controller, AI Engine, and Qdrant
- **[Deployment Guide](../setup/deployment.md)** - Individual component installation with integrated Qdrant

Both setups include the required Vector DB (Qdrant). You also need an embedding provider API key - see [Embedding Provider Configuration](../setup/deployment.md#embedding-provider-configuration).

## Knowledge Base Operations

The `manageKnowledge` tool supports three operations: `ingest`, `search`, and `deleteByUri`.

### Ingesting Documents

There are two ways to ingest documents into the knowledge base:

#### Option 1: Controller (Recommended for Git Repositories)

For documentation stored in Git repositories, use the [dot-ai-controller](/docs/controller/knowledge-source-guide). Create a `KnowledgeSource` CR pointing to your repo, and the controller automatically:
- Syncs markdown files from specified paths
- Keeps the knowledge base up to date on a schedule
- Handles updates and deletions automatically

This is "set and forget" - once configured, your Git documentation stays in sync.

#### Option 2: Via AI Agent (For Ad-Hoc Ingestion)

For documents outside Git, or when you need to quickly add something, use your AI coding agent (Claude Code, Cursor, etc.). Simply ask it to ingest content in natural language.

**What you can ingest:**
- **Local files**: Point to files in your project or filesystem
- **Pasted text**: Copy-paste content directly into the chat
- **Multiple documents**: Ask the agent to ingest several files at once
- **Any text format**: Not limited to markdown - any text content works

**Example**:
```text
Ingest docs/guides/mcp-recommendation-guide.md
```

**Response**:
```text
Successfully ingested docs/guides/mcp-recommendation-guide.md into the knowledge base.
The document was split into 33 chunks and stored with URI
https://github.com/vfarcic/dot-ai/blob/main/docs/guides/mcp-recommendation-guide.md.
```

**What happened behind the scenes:**
- AI coding agent read the file content
- Constructed the full URI from the file path
- Called `manageKnowledge` with `operation: "ingest"`
- Document was chunked (~1000 characters per chunk) and stored with embeddings

### Searching the Knowledge Base

Search for relevant information using natural language queries. The system finds semantically similar content and synthesizes an answer.

**Example**:
```text
Search knowledgebase to explain how dot-ai recommendations work.
```

**Response**:
```text
How dot-ai Recommendations Work

The dot-ai recommendation system provides AI-powered Kubernetes deployment
recommendations through a multi-stage workflow:

Core Mechanism

1. Intent Refinement - When a user provides a vague intent (e.g., "deploy my app"),
   the system provides guidance for the client agent (Claude Code, Cursor, etc.)
   to analyze the project and gather details before proceeding.
2. Semantic Capability Matching - Instead of treating CRDs like
   sqls.devopstoolkit.live as meaningless names, the system understands what
   capabilities each resource provides through semantic search.
3. Capability-Enhanced Discovery - When you say "I need a PostgreSQL database":
   - Semantic search finds resources with matching capability context
   - AI gets pre-filtered, semantically relevant resources
   - Optimal matches appear as top suggestions with rich context
4. Organizational Pattern Integration - Recommendations automatically include
   your team's best practices and standards stored as patterns.
5. Helm Fallback - When no cluster capability matches, the system searches
   ArtifactHub for Helm charts and presents installation options.

Workflow Stages

The recommendation flow uses these stages:
- recommend - Initial intent analysis and solution generation
- chooseSolution - Select from generated alternatives
- answerQuestion:required/basic/advanced/open - Guided configuration collection
- generateManifests - Creates ready-to-deploy Kubernetes YAML
- deployManifests - Applies resources to the cluster

Sources:
- https://github.com/vfarcic/dot-ai/blob/main/docs/guides/mcp-recommendation-guide.md
```

**What happened behind the scenes:**
- Query was embedded and matched against stored chunks by semantic similarity
- Relevant chunks were retrieved from the knowledge base
- AI synthesized a coherent answer from the chunks
- Source URIs were included for verification

### Deleting Documents

Remove all chunks for a document by its URI. This is useful when a document is updated (delete then re-ingest) or removed from the source.

> **Note**: For Git repositories, the [dot-ai-controller](/docs/controller/knowledge-source-guide) handles document lifecycle automatically - it detects changes, deletes outdated chunks, and re-ingests updated content.

**Example** (manual deletion):
```text
Delete the recommendation guide from the knowledge base
(URI: https://github.com/vfarcic/dot-ai/blob/main/docs/guides/mcp-recommendation-guide.md)
```

**Response**:
```text
Done. Deleted 33 chunks for the recommendation guide from the knowledge base.
```

**What happened behind the scenes:**
- All chunks matching the URI were found in the knowledge base
- Each chunk was deleted from the vector database
- The document is no longer searchable

## Troubleshooting

### Check Knowledge Base Status

To verify the knowledge base is working and see how many documents are stored:

```text
Show dot-ai status
```

Look for the `knowledgeBase` collection in the response:

```text
Collections

| Collection    | Status   | Documents |
|---------------|----------|-----------|
| knowledgeBase | ✅ Ready | 156       |
```

If the collection shows `⚠️ Not created`, no documents have been ingested yet. If it shows an error, check your Vector DB connection and embedding provider configuration.

## FAQ

**Q: How do I update a document?**
A: Simply re-ingest it. The system automatically deletes existing chunks for the URI before storing new ones.

**Q: How are documents chunked?**
A: Documents are split into chunks of approximately 1000 characters with 200 character overlap to preserve context across chunk boundaries.

**Q: Can I search across multiple repositories?**
A: Yes. All ingested documents are stored in the same knowledge base regardless of source. Search returns results from all sources, with URIs indicating the origin.

## See Also

- **[Deployment Guide](../setup/deployment.md)** - Server deployment and configuration
- **[Controller Knowledge Source Guide](/docs/controller/knowledge-source-guide)** - Automated Git repository ingestion
- **[Tools and Features Overview](overview.md)** - Browse all available tools


# Operate Guide

**Complete guide for AI-powered Kubernetes operations with the DevOps AI Toolkit.**


## Using via Web UI

These tools are also available through the [Web Dashboard](/docs/ui).

## Prerequisites

Before using this guide, complete the [Deployment Guide](../setup/deployment.md) to set up:
- DevOps AI Toolkit deployed
- AI model API key configured (see [AI Model Configuration](../setup/deployment.md#ai-model-configuration) for supported models and setup)
- `KUBECONFIG` pointing to your Kubernetes cluster

**Required - Capability Management:**
- Vector DB service (Qdrant) for capability storage
- Cluster capabilities discovered via [Capability Management Guide](capability-management.md)
- **Note**: Operations will fail without capabilities - the system requires semantic understanding of your cluster resources

**Optional - Enhanced with Organizational Context:**
- Organizational patterns created via [Pattern Management Guide](../organizational-data/patterns.md)
- Policy intents created via [Policy Management Guide](../organizational-data/policies.md)
- When configured, operations automatically follow organizational best practices and governance rules

## Overview

**What it does**: Provides AI-powered Day 2 operations for any Kubernetes resources through natural language intents. Updates, scales, enhances, and manages workloads, databases, infrastructure, and cloud resources with cluster-aware recommendations and organizational governance.

**Use when**: You need to perform operational changes on deployed resources - applications, databases, storage, AWS/Azure/GCP resources via operators, networking, or any Kubernetes-managed infrastructure.

📖 **Full Guide**: This document covers the complete operations workflow with detailed examples and behind-the-scenes explanations.

### Key Features

The DevOps AI Toolkit operate feature provides:
- **Natural language operations** - Describe what you want, AI figures out how to do it
- **Cluster-aware decisions** - Leverages installed operators and custom resources automatically
- **Helm release support** - Works with Helm-managed applications (upgrades, rollbacks, uninstalls) in addition to raw Kubernetes resources
- **Pattern-driven operations** - Applies organizational best practices to every change
- **Policy enforcement** - Validates operations against governance rules before execution
- **Dry-run validation** - All changes tested before proposing to ensure they'll work
- **Safe execution** - Exact approved commands executed with comprehensive validation
- **Iterative validation** - Verifies operations completed successfully with AI analysis

### How AI-Driven Operations Work

**Traditional Problem**: Manual operations require deep Kubernetes expertise, error-prone YAML editing, and knowing which operators/resources are available.

**AI Operations Solution**: Natural language intents analyzed by AI with full cluster awareness and organizational context.

**Operations Flow**:
1. **Intent Analysis**: AI understands your operational goal
2. **Cluster Investigation**: AI inspects current state and discovers relevant resources
3. **Context Integration**: Applies relevant organizational patterns, policies, and cluster capabilities
4. **Solution Design**: Generates operational plan that satisfies your intent
5. **Dry-Run Validation**: Tests all changes to ensure they'll succeed
6. **User Approval**: Review proposed changes with full transparency
7. **Execution**: Approved commands executed exactly as shown
8. **Validation**: AI verifies operation completed successfully

## Complete Workflow Example

<!--
Test Scenario Setup for Pattern-Driven Scaling with HPA Creation:
Create a deployment with resource requests/limits to demonstrate pattern-driven operations

kubectl create namespace operate-demo
kubectl apply -n operate-demo -f - <<'EOF'
apiVersion: apps/v1
kind: Deployment
metadata:
  name: demo-api
spec:
  replicas: 2
  selector:
    matchLabels:
      app: demo-api
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: 25%
      maxSurge: 25%
  template:
    metadata:
      labels:
        app: demo-api
    spec:
      containers:
      - name: nginx
        image: nginx:1.21
        resources:
          requests:
            cpu: 100m
            memory: 128Mi
          limits:
            cpu: 200m
            memory: 256Mi
        ports:
        - containerPort: 80
EOF

This creates a deployment where:
- 2 replicas running nginx:1.21
- Resource requests/limits configured (required for HPA)
- RollingUpdate strategy with 25% max unavailable/surge
- Ready for scaling operations demonstration
- Uses separate namespace to keep demo isolated
-->

### Pattern-Driven Scaling with HPA Creation

This example demonstrates how organizational patterns influence operational decisions. We'll scale a deployment twice:
1. **Without pattern** - AI recommends direct replica scaling
2. **With pattern** - Same intent, but AI creates HPA instead

**Scenario Setup:**
```bash
# Deployment exists: demo-api in operate-demo namespace
# Current state: 2 replicas, nginx:1.21, with resource requests/limits
```

#### Step 1: Baseline Operation (No Pattern)

First, let's see what AI recommends without any organizational patterns:

```md
User: scale demo-api deployment in operate-demo namespace to 4 replicas
```

~~~md
Agent: Let me analyze the current state and propose a scaling operation.

[Uses mcp__dot-ai__operate with intent "scale demo-api deployment in operate-demo namespace to 4 replicas"]

**Analysis Complete:**

Scaling demo-api deployment from 2 to 4 replicas in the operate-demo namespace. This is a simple, low-risk scaling operation that will increase application availability and capacity.

**Current State:**
- Deployment: demo-api (operate-demo namespace)
- Current replicas: 2/2 running
- Image: nginx:1.21
- Strategy: RollingUpdate (25% max unavailable/surge)
- Resources: 100m CPU/128Mi memory requests, 200m CPU/256Mi memory limits

**Proposed Changes:**
- **Update Deployment** demo-api: replicas: 4 (increased from 2)
- Rationale: Directly satisfies the operational intent to scale deployment to 4 replicas

**Commands to Execute:**
```bash
kubectl scale deployment demo-api --replicas=4 -n operate-demo
```

**✅ Dry-Run Validation:** Success - Deployment accepts the replica count modification

**Patterns Applied:** None

**Capabilities Used:** Deployment

**Policies Checked:**
- Applications must NEVER use 'latest' tag - PASSED (uses nginx:1.21)

**Risk Assessment:** LOW RISK
- Scaling operation is non-disruptive
- New pods created in parallel with existing configuration
- No downtime expected, easily reversible

**Session ID:** opr-1763169767759-0c2a4bb7

📊 **Visualization:** http://localhost:3000/v/opr-1763169767759-0c2a4bb7

Would you like to execute this operation?
~~~

#### Visualization (Optional)

The visualization URL opens an interactive analysis view in the Web UI:


> **Note**: Visualization URLs require additional setup:
> - Configure `webUI.baseUrl` in your Helm values (see [Deployment Guide - Web UI Visualization](../setup/deployment.md#web-ui-visualization))
> - Install the Web UI (see [Web UI Documentation](/docs/ui))

**What happened behind the scenes:**
- AI analyzed deployment and discovered current state: 2 replicas, properly configured resources
- No organizational patterns found related to "scaling"
- Generated simple `kubectl scale` command - direct approach
- Validated with dry-run successfully
- Proposed straightforward replica increase

**Note:** Without patterns, AI takes the most direct approach - modifying replicas directly. This works, but doesn't capture organizational best practices like using HPA for managed scaling.


#### Step 3: Pattern-Influenced Operation

Now let's run the **exact same scaling intent** and see how the pattern changes the recommendation:

```md
User: scale demo-api deployment in operate-demo namespace to 4 replicas. Start the process again.
```

~~~md
Agent: Let me analyze the operation with organizational patterns applied.

[Uses mcp__dot-ai__operate with intent "scale demo-api deployment in operate-demo namespace to 4 replicas"]

**Analysis Complete:**

Scaling demo-api deployment from 2 to 4 replicas using HorizontalPodAutoscaler as per organizational Pattern 1 (Horizontal scaling with HPA). The pattern specifies that all scaling operations should use HPA for managing multiple replicas, even when both min and max are set to the same value.

**Current State:**
- Deployment: demo-api (operate-demo namespace)
- Current replicas: 2 running
- Image: nginx:1.21
- Resource requests: 100m CPU / 128Mi memory (HPA-ready)
- No HPA configured

**Proposed Changes:**
- **Create HorizontalPodAutoscaler** demo-api
  - minReplicas: 4, maxReplicas: 4
  - CPU utilization target: 80%
  - Rationale: Follows organizational Pattern 1 which requires all scaling operations to use HPA for consistency and operational best practices. Creates foundation for future autoscaling enhancements.

**HPA Manifest:**
```yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: demo-api
  namespace: operate-demo
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: demo-api
  minReplicas: 4
  maxReplicas: 4
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 80
```

**✅ Dry-Run Validation:** Success - Cluster accepted the HPA resource definition

**Patterns Applied:**
- Pattern 1: Horizontal scaling with HPA - All scaling operations use HorizontalPodAutoscaler for consistency and operational governance

**Capabilities Used:** HorizontalPodAutoscaler (autoscaling/v2)

**Risk Assessment:** LOW RISK
- Scaling up from 2 to 4 replicas on healthy deployment
- Existing pods continue running while new replicas created
- Easily reversible by deleting HPA or adjusting counts

**Session ID:** opr-1763170292295-0dbae152

Would you like to execute this operation?
~~~

**What happened behind the scenes:**
- Vector search found relevant pattern: "Horizontal scaling with HPA" (semantic match on "scaling", "replicas")
- AI read pattern rationale: "All scaling operations should use HorizontalPodAutoscaler..."
- AI changed strategy from direct replica scaling to HPA creation
- Generated HPA manifest with minReplicas=maxReplicas=4 to lock at desired count
- Validated HPA manifest with dry-run successfully
- Listed pattern in "patternsApplied" field for audit trail

**Key Difference:** Same intent, completely different approach! Pattern transformed a simple `kubectl scale` into an HPA resource, enforcing organizational best practice.


## Summary

This example demonstrated:

1. **Baseline behavior** - Without patterns, AI recommends direct approach (kubectl scale)
2. **Pattern creation** - Define organizational best practice (use HPA for scaling)
3. **Pattern influence** - Same intent produces different recommendation (HPA creation)
4. **Safe execution** - Exact approved commands executed with AI validation

**Key Takeaway:** Organizational patterns transform operations from "what works" to "what's best for your organization" - automatically enforcing governance without manual intervention.

**Learn More:**
- See [Pattern Management Guide](../organizational-data/patterns.md) for detailed information on creating and managing organizational patterns
- Explore [Policy Management Guide](../organizational-data/policies.md) for governance and compliance enforcement


## Best Practices

### Writing Effective Intents

**Be specific about target resources:**
```md
✅ Good: "scale demo-api deployment in production namespace to 5 replicas"
❌ Vague: "scale the app"
```

**Include namespace when working with multiple environments:**
```md
✅ Good: "update my-api in staging namespace to v2.0"
❌ Ambiguous: "update my-api to v2.0" (which namespace?)
```

**Specify operational requirements when relevant:**
```md
✅ Good: "update my-api to v2.0 with zero downtime"
✅ Good: "make my-database highly available with backups"
```

### Session Management

- **Review proposals carefully** - Always review proposed changes before execution
- **Sessions are temporary** - Session data expires after operation completion
- **Refine if needed** - Use `refinedIntent` parameter to clarify ambiguous requests

### Pattern and Policy Integration

- **Create patterns proactively** - Define operational best practices before they're needed
- **Use specific triggers** - Patterns with clear triggers match more accurately
- **Document rationale** - Clear rationale helps AI apply patterns correctly
- **Test patterns** - Verify patterns influence recommendations as expected


## Related Guides

- **[Recommendation Guide](recommend.md)** - Initial application deployment with capability-enhanced recommendations
- **[Remediate Guide](remediate.md)** - AI-powered troubleshooting and issue resolution
- **[Pattern Management Guide](../organizational-data/patterns.md)** - Creating and managing organizational patterns
- **[Policy Management Guide](../organizational-data/policies.md)** - Defining and enforcing governance policies
- **[Capability Management Guide](capability-management.md)** - Cluster resource discovery and semantic capabilities


# Tools and Features Overview

**Quick reference to all available tools in the DevOps AI Toolkit.**

## Why Your Infrastructure AI Sucks (And How to Fix It)

[![Why Your Infrastructure AI Sucks (And How to Fix It)](https://img.youtube.com/vi/Ma3gKmuXahc/maxresdefault.jpg)](https://youtu.be/Ma3gKmuXahc)

This video demonstrates the core AI-powered deployment workflow: capabilities discovery, organizational patterns, policy enforcement, context management, and intelligent workflows. Watch how these components work together to transform generic AI responses into infrastructure solutions that actually work in your organization.

## Available Tools

### 🚀 Kubernetes Deployment Recommendations
AI-powered application deployment assistance with smart intent clarification, semantic resource matching, and automated manifest generation.
- **What it does**: Clarifies deployment intents through adaptive questioning, then provides intelligent deployment suggestions based on your cluster capabilities. When no capability matches, automatically discovers and installs third-party applications via Helm charts from ArtifactHub.
- **Use when**: Deploying applications to Kubernetes clusters, or installing third-party tools (Prometheus, Argo CD, Crossplane, etc.) via Helm
- **📖 Full Guide**: [Kubernetes Deployment Recommendations](recommend.md)

### 🔎 Cluster Query
Natural language interface for querying your Kubernetes cluster.
- **What it does**: Ask questions about your cluster in plain English - find resources by concept, describe specific resources, or check health status
- **Use when**: Exploring your cluster, finding resources without memorizing kubectl syntax, or checking resource status
- **📖 Full Guide**: [Cluster Query](query.md)

### 🔍 Capability Management
Teaches the AI what your Kubernetes resources actually do through semantic discovery and analysis.
- **What it does**: Scans cluster resources and discovers their capabilities for intelligent matching
- **Use when**: Setting up recommendations (required) or improving deployment intelligence
- **📖 Full Guide**: [Capability Management](capability-management.md)

### 🏛️ Pattern Management
Captures organizational deployment knowledge as reusable patterns that enhance AI recommendations.
- **What it does**: Creates deployment templates with your organization's best practices
- **Use when**: Standardizing deployments across teams or enforcing organizational standards
- **📖 Full Guide**: [Pattern Management](../organizational-data/patterns.md)

### 🛡️ Policy Management
Enables proactive governance through policy intents that guide users toward compliant configurations.
- **What it does**: Creates governance policies that integrate into AI recommendations with optional Kyverno enforcement
- **Use when**: Implementing security requirements, compliance standards, or configuration governance
- **📖 Full Guide**: [Policy Management](../organizational-data/policies.md)

### 🔧 Kubernetes Issue Remediation
AI-powered issue analysis and remediation with intelligent root cause identification.
- **What it does**: Multi-step investigation loop to identify root causes and generate executable remediation commands
- **Use when**: Troubleshooting Kubernetes failures, diagnosing pod/networking/storage issues, or understanding "what's wrong"
- **📖 Full Guide**: [Kubernetes Issue Remediation](remediate.md)

### ⚙️ Kubernetes Operations
AI-powered Day 2 operations for any Kubernetes resources through natural language intents.
- **What it does**: Handles updates, scaling, enhancements, rollbacks, and any operational changes to workloads, databases, infrastructure, cloud resources, or Helm-managed applications with pattern-driven recommendations and dry-run validation
- **Use when**: Performing operational changes on deployed resources - applications, databases, storage, Helm releases, AWS/Azure/GCP resources via operators, networking, or any Kubernetes-managed infrastructure
- **📖 Full Guide**: [Kubernetes Operations](operate.md)

### 📦 Project Setup & Governance
Comprehensive repository setup with governance, legal, security, and automation files.
- **What it does**: Generates 25+ standardized files including LICENSE, CODE_OF_CONDUCT, CONTRIBUTING, SECURITY policies, GitHub issue/PR templates, workflows (OpenSSF Scorecard), and automation (Renovate, Labeler, Stale Bot)
- **Use when**: Setting up new repositories, standardizing team workflows, or implementing governance and security best practices
- **📖 Full Guide**: [Project Setup & Governance](project-setup.md)


### 💬 Shared Prompts Library
Centralized prompt sharing via native slash commands in AI coding agents.
- **What it does**: Provides curated prompts as slash commands (e.g., `/explain-code`, `/security-review`)
- **Use when**: Boosting productivity with standardized prompts across projects
- **📖 Full Guide**: [Shared Prompts Library](prompts.md)

### 🌐 REST API Gateway
HTTP REST endpoints for all DevOps AI Toolkit tools, enabling integration with traditional applications and CI/CD pipelines.
- **What it does**: Exposes all tools via standard HTTP POST/GET endpoints with auto-generated OpenAPI documentation
- **Use when**: Integrating with automation scripts, CI/CD pipelines, Kubernetes controllers, or any non-MCP applications
- **📖 Full Guide**: [REST API Gateway](../api/rest-api.md)

### 🩺 System Status & Diagnostics
Comprehensive health check and diagnostics for your DevOps AI Toolkit installation.
- **What it does**: Shows version info, component health (Kubernetes, Vector DB, AI provider), collection statistics, and active capabilities
- **Use when**: Troubleshooting setup issues, verifying installation, or checking what features are available
- **📖 Full Guide**: [System Status](version.md)

### 📚 Knowledge Base
Semantic search over organizational documentation.
- **What it does**: Ingests documents into a searchable vector store, enabling natural language search across your organization's documentation
- **Use when**: Making organizational knowledge searchable by AI, finding relevant documentation by meaning rather than keywords
- **📖 Full Guide**: [Knowledge Base](knowledge-base.md)

## Quick Start

1. **Complete Setup**: Follow the [Deployment Guide](../setup/deployment.md)
2. **Start with Capability Management** to scan your cluster (required for recommendations and operations)
3. **Try Deployment Recommendations** with a simple application
4. **Use Kubernetes Operations** for Day 2 changes (updates, scaling, enhancements)
5. **Optional**: Create organizational patterns, policy intents, or use issue remediation

## Prerequisites

**Required for all tools:**
- **DevOps AI Toolkit deployed**: See [Deployment Guide](../setup/deployment.md)

**Works without AI keys:**
- ✅ **Project Setup & Governance**
- ✅ **Shared Prompts Library**
- ✅ **REST API Gateway**

**For AI-powered features (deployment, remediation, patterns, policies, capabilities):**
- **AI Model API key**: See [AI Model Configuration](../setup/deployment.md#ai-model-configuration) for model options
- **Cluster access**: `KUBECONFIG` for Kubernetes integration (deployment, remediation, capabilities)
- **Vector database**: Qdrant for capability, pattern, and policy storage
- **Embedding provider API key**: OpenAI, Google, or Amazon Bedrock for pattern/policy semantic search

## Tool Dependencies

- **Deployment Recommendations** ← requires **Capability Management**
- **Kubernetes Operations** ← requires **Capability Management**
- **Pattern Management** → enhances **Deployment Recommendations** and **Kubernetes Operations**
- **Policy Management** → enhances **Deployment Recommendations** and **Kubernetes Operations**
- **Kubernetes Issue Remediation** ← independent
- **Project Setup** ← independent
- **Shared Prompts Library** ← independent
- **REST API Gateway** ← provides HTTP access to all tools

## Getting Help

For troubleshooting, use the system status command:
```
Show dot-ai status
```

This provides comprehensive diagnostics including component health, connection status, and capability summary. See the [System Status Guide](version.md) for detailed troubleshooting steps.

# GitHub Repository Setup & Governance Guide

**Complete guide for AI-powered repository setup and governance with the DevOps AI Toolkit.**


## Prerequisites

Before using this guide, complete the [Deployment Guide](../setup/deployment.md) to set up:
- DevOps AI Toolkit deployed
- No additional dependencies required (tool does not need Kubernetes or AI model access)

**Note**: This tool does NOT require:
- Kubernetes cluster access
- Qdrant vector database
- AI model API keys

The tool generates files based on templates and your answers - no AI inference or cluster access needed.

## Overview

The DevOps AI Toolkit project setup feature provides:
- **Repository audit** - Scans repositories to identify missing governance, legal, and automation files
- **Gap analysis** - Shows which standardized files are missing from your repository
- **Template-based generation** - Creates 25+ files across 9 scopes using industry best practices
- **Interactive workflow** - Asks relevant questions and generates customized files
- **Standards compliance** - All templates based on authoritative sources (Contributor Covenant, OpenSSF, GitHub)

### What It Generates

The tool can generate files across these scopes:
- **readme**: README.md with project information
- **legal**: LICENSE files (MIT, Apache-2.0, GPL-3.0, BSD-3-Clause) with NOTICE/COPYRIGHT as needed
- **governance**: CODE_OF_CONDUCT.md, CONTRIBUTING.md, SECURITY.md
- **community**: SUPPORT.md, GOVERNANCE.md, CODEOWNERS
- **github-issues**: Bug report, feature request, and question templates (.github/ISSUE_TEMPLATE/)
- **pr-template**: Pull request template (.github/PULL_REQUEST_TEMPLATE.md) - automatically detected and used by the [`prd-done` workflow](prompts.md#available-prompts) for intelligent PR creation
- **github-community**: FUNDING.yml, AUTHORS, CONTRIBUTORS, CHANGELOG, release.yml
- **github-security**: OpenSSF Scorecard workflow (.github/workflows/scorecard.yml)
- **github-automation**: Renovate, GitHub Labeler, Stale Bot configurations

### When to Use This Tool

Use project setup when you want to:
- Bootstrap a new repository with proper governance
- Audit existing repository for missing files
- Standardize team workflows across multiple repositories
- Implement security and contribution guidelines
- Add GitHub automation (dependency updates, PR labeling, stale issue management)

### Video: Transform Your GitHub Repository

[![Transform Your GitHub Repository: Templates, Automation, and Governance](https://img.youtube.com/vi/gYl3moYa4iI/maxresdefault.jpg)](https://youtu.be/gYl3moYa4iI)

## Complete Workflow Example

### Step 1: Request Repository Audit

```
User: Audit my repository for missing governance files
```

```
Agent: I'll scan your repository to identify missing files.

[Scans repository for 21 files]

Repository Analysis:

✓ readme: 1/1 files exist
✓ legal: 1/1 files exist
○ governance: 0/6 files exist
  Missing: CODE_OF_CONDUCT.md, CONTRIBUTING.md, SECURITY.md, docs/MAINTAINERS.md, docs/GOVERNANCE.md, docs/ROADMAP.md
○ community: 0/2 files exist
  Missing: SUPPORT.md, ADOPTERS.md
○ github-issues: 0/3 files exist
  Missing: .github/ISSUE_TEMPLATE/bug_report.yml, .github/ISSUE_TEMPLATE/feature_request.yml, .github/ISSUE_TEMPLATE/config.yml
○ pr-template: 0/1 files exist
  Missing: .github/PULL_REQUEST_TEMPLATE.md
○ github-community: 0/2 files exist
  Missing: .github/CODEOWNERS, .github/release.yml
○ github-security: 0/1 files exist
  Missing: .github/workflows/scorecard.yml
○ github-automation: 0/4 files exist
  Missing: renovate.json, .github/labeler.yml, .github/workflows/labeler.yml, .github/workflows/stale.yml

Which scopes would you like to set up? (You can choose multiple)
```

### Step 2: Select Scopes and Answer Questions

```
User: Set up governance and github-issues
```

```
Agent: I'll set up governance first (6 files to generate).

Let me analyze your repository to suggest answers:

1. Project name: my-awesome-app
2. Repository URL: https://github.com/username/my-awesome-app
3. Enforcement email: conduct@example.com
4. Security email: security@example.com
5. Maintainer email: maintainers@example.com
6. Lead maintainer name: John Doe
7. Lead maintainer GitHub: johndoe
8. Setup command: npm install
9. Test command: npm test
10. Lint command: npm run lint
... (27 questions total)

Do these look correct, or would you like to modify any?
```

```
User: Looks good
```

### Step 3: Files Generated and Written

```
Agent: Generated 6 files for governance scope:
- CODE_OF_CONDUCT.md
- CONTRIBUTING.md
- SECURITY.md
- docs/MAINTAINERS.md
- docs/GOVERNANCE.md
- docs/ROADMAP.md

[Writes these 6 files to repository]

✓ Governance files created successfully!

Now let's set up github-issues (3 files to generate).

Questions for github-issues:
1. Project name: my-awesome-app
2. Repository URL: https://github.com/username/my-awesome-app
3. Bug report form title: Bug Report
4. Feature request form title: Feature Request
... (8 questions total)

Do these look correct?
```

```
User: Yes
```

```
Agent: Generated 3 files for github-issues scope:
- .github/ISSUE_TEMPLATE/bug_report.yml
- .github/ISSUE_TEMPLATE/feature_request.yml
- .github/ISSUE_TEMPLATE/config.yml

[Writes these 3 files to repository]

✓ GitHub issue templates created successfully!

All requested scopes are now complete. Your repository now has:
- 6 governance files
- 3 GitHub issue templates
```

## Scope Reference

| Scope | Files | Description |
|-------|-------|-------------|
| **readme** | README.md | Basic project README with name, description, license |
| **legal** | LICENSE, NOTICE*, COPYRIGHT* | License files (MIT, Apache-2.0, GPL-3.0, BSD-3-Clause) |
| **governance** | CODE_OF_CONDUCT.md, CONTRIBUTING.md, SECURITY.md, MAINTAINERS.md, GOVERNANCE.md, ROADMAP.md | Complete governance documentation |
| **community** | SUPPORT.md, ADOPTERS.md | Support channels and project adopters |
| **github-issues** | bug_report.yml, feature_request.yml, config.yml | GitHub issue templates |
| **pr-template** | PULL_REQUEST_TEMPLATE.md | Pull request template with checklist |
| **github-community** | FUNDING.yml*, AUTHORS*, CONTRIBUTORS*, CHANGELOG*, release.yml* | Community and funding files |
| **github-security** | workflows/scorecard.yml | OpenSSF Scorecard security workflow |
| **github-automation** | renovate.json, labeler.yml, workflows/labeler.yml, workflows/stale.yml | Automated dependency updates, PR labeling, stale issue management |

_* Conditional files - only generated based on your answers_

## Common Use Cases

### New Open Source Project
```
User: Set up a new open source project with all governance files
```
**Recommended scopes**: legal, governance, community, github-issues, pr-template, github-security, github-automation

### Existing Project Audit
```
User: Audit my existing project for missing files
```
The agent will show what's missing and let you choose which scopes to add.

### Add GitHub Automation Only
```
User: Add Renovate and issue automation to my project
```
**Recommended scopes**: github-automation

### Security and Compliance
```
User: Add security scanning and policies
```
**Recommended scopes**: governance (includes SECURITY.md), github-security

## Tips

### Workflow Tips

**Review before committing**: The agent generates files but YOU control what gets committed. Review the generated content before committing.

**Multiple sessions**: You can run the tool multiple times. It only generates missing files, never overwrites existing ones.

**Customize templates**: After generation, you can edit files to match your project's specific needs.

### Best Practices

**Start with governance**: If setting up a new project, start with `legal` and `governance` scopes first.

**Add automation last**: Set up `github-automation` after your basic repository structure is in place.

**Review email addresses**: Double-check all email addresses in governance files before committing.

**Test workflows**: After generating GitHub Actions workflows, verify they run successfully.

## See Also

- **[Deployment Guide](../setup/deployment.md)** - Server deployment and configuration
- **[Tools and Features Overview](overview.md)** - Browse all available tools and features


# Shared Prompt Library


## What are Shared Prompts

Shared Prompts provide a centralized way to share and discover prompts across projects and team members. Instead of manually copying prompt files between projects, prompts are served directly through the DevOps AI Toolkit and automatically appear as native slash commands in AI coding agents.

**Key Benefits:**
- **Zero setup**: Connect to the server and prompts are immediately available
- **Native integration**: Prompts appear as slash commands in your coding agent's command menu
- **Cross-project sharing**: Same prompts available across all projects without file management
- **Instant updates**: New prompts added to server are immediately available to all users
- **Team consistency**: Everyone uses the same proven prompts

## Prerequisites

Before using Shared Prompts, you need:
- **DevOps AI Toolkit deployed** — see [Deployment Guide](../setup/deployment.md)
- **Access via** [MCP client](/docs/mcp) or [CLI](/docs/cli)

**Note**: The Anthropic API key is only required for AI-powered tools (like deployment recommendations), not for using the shared prompts library.

## How It Works

The DevOps AI Toolkit implements the standard MCP Prompts specification, exposing two key endpoints:

- **`prompts/list`**: Returns all available prompts with metadata
- **`prompts/get`**: Returns specific prompt content by ID

When you connect to the server, your coding agent automatically discovers available prompts and makes them accessible through agent-specific interfaces. The access method varies by agent - some use slash commands, others integrate prompts as available tools.

## See Shared Prompts in Action

[![Using PRD Prompts with DevOps AI Toolkit](https://img.youtube.com/vi/LUFJuj1yIik/maxresdefault.jpg)](https://youtu.be/LUFJuj1yIik)

This video demonstrates how to use the PRD (Product Requirements Document) prompts served by the DevOps AI Toolkit. Watch how to create, manage, and work through complete PRD workflows using the shared prompts library with conversational AI agents.

## Using Shared Prompts

### Discovering Available Prompts

1. Ensure you're connected to the DevOps AI Toolkit (see [MCP client setup](/docs/mcp))
2. Access prompts using your agent's interface:
   - **Claude Code**: Type `/` and look for `/dot-ai:prompt-name` commands
   - **VS Code + GitHub Copilot**: Type `/` and look for `/mcp.dot-ai.prompt-name` commands
   - **Other agents**: Check your agent's command menu or ask the agent to use specific prompts by name
3. Browse available prompts through your agent's discovery interface

### Executing Shared Prompts

**Claude Code:**
```bash
# Create a new PRD
/dot-ai:prd-create

# Get current PRD status
/dot-ai:prds-get
```

**VS Code + GitHub Copilot:**
```bash
# Create a new PRD
/mcp.dot-ai.prd-create

# Get current PRD status
/mcp.dot-ai.prds-get
```

**Other Agents:**
- May use slash commands with different formats
- Or ask the agent to use the prompt by name (e.g., "Use the prd-create prompt")

The prompt content executes exactly as if it were a local command file, but without any file management on your part.

## Available Prompts

### Project Management Prompts

**`prd-create`**
- **Purpose**: Create a comprehensive Product Requirements Document following documentation-first approach
- **Use when**: Starting a new feature that requires detailed planning, tracking, and documentation
- **Example**: Beginning work on a new feature or major enhancement

**`prds-get`**
- **Purpose**: Fetch all open GitHub issues with 'PRD' label from the current project repository
- **Use when**: Want to see all active Product Requirements Documents and their status
- **Example**: Getting overview of current project priorities and PRD status

**`prd-next`**
- **Purpose**: Analyze existing PRD to identify and recommend the single highest-priority task to work on next
- **Use when**: Need guidance on what to work on next within a PRD, uses smart auto-detection of target PRD
- **Example**: Continuing work on a PRD and need to prioritize remaining tasks

**`prd-start`**
- **Purpose**: Start working on a PRD implementation
- **Use when**: Beginning work on a specific PRD
- **Example**: Moving from planning phase to implementation phase

**`prd-update-progress`**
- **Purpose**: Update PRD progress based on git commits and code changes, enhanced by conversation context
- **Use when**: Implementation work has been completed and need to mark PRD items as done based on actual code changes
- **Example**: After completing development tasks, update PRD to reflect current status

**`prd-update-decisions`**
- **Purpose**: Update PRD based on design decisions and strategic changes made during conversations
- **Use when**: Architecture, workflow, or requirement decisions were made in conversation that need to be captured in the PRD
- **Example**: After making architectural decisions that affect the original PRD scope

**`prd-done`**
- **Purpose**: Complete PRD implementation workflow - create branch, push changes, create PR, merge, and close issue
- **Enhanced PR Creation**: Automatically detects and uses project PR templates (e.g., `.github/PULL_REQUEST_TEMPLATE.md`), analyzes git changes to auto-fill PR content, and prompts for information requiring human judgment
- **Template Integration**: Works seamlessly with PR templates generated by the [Project Setup](project-setup.md) tool
- **Use when**: Finished implementing a PRD and ready to deploy and close out the work
- **Example**: All PRD tasks completed and ready for final deployment and closure

**`prd-close`**
- **Purpose**: Close a PRD that is already implemented or no longer needed without creating a PR
- **Use when**: PRD functionality already exists (implemented elsewhere), PRD is superseded, or requirements changed
- **Example**: Closing a PRD whose features were already implemented in a separate project

### Development Prompts

**`generate-dockerfile`**
- **Purpose**: Generate production-ready, secure, multi-stage Dockerfile and .dockerignore for any project
- **Use when**: Containerizing an application for production deployment, or improving an existing Dockerfile
- **Example**: Setting up Docker for a new project, or fixing security issues in an existing Dockerfile

**`generate-cicd`**
- **Purpose**: Generate intelligent CI/CD workflows through interactive conversation by analyzing repository structure and user preferences
- **Use when**: Setting up CI/CD pipelines that understand your project's existing automation (Makefile, npm scripts, etc.)
- **Example**: Adding GitHub Actions workflows that use your project's build system rather than raw commands

## Example Workflows

### Workflow 1: Complete PRD Lifecycle

- **Start new feature**: Use `prd-create` prompt to create comprehensive requirements document
  1. GitHub issue created with PRD label
  2. PRD file generated with proper naming
  3. Complete documentation content written across multiple files with traceability

- **Check priorities**: Use `prds-get` prompt to see all active PRDs and priorities
  1. Open PRD issues fetched from GitHub
  2. Issues formatted with status analysis
  3. Next steps recommendations provided

- **Begin implementation**: Use `prd-start` prompt to begin working on specific PRD
  1. Target PRD auto-detected from context
  2. PRD readiness validated
  3. Feature branch created
  4. First implementation task identified with detailed plan

- **Get next task**: Use `prd-next` prompt to identify highest-priority remaining work
  1. Current PRD state analyzed
  2. Single highest-value next task identified
  3. Implementation design guidance provided

- **Update decisions**: Use `prd-update-decisions` prompt when design decisions are made during implementation
  1. Conversation context analyzed for design decisions
  2. Decision impact assessed across requirements and scope
  3. PRD sections updated with new decisions and rationale

- **Update progress**: Use `prd-update-progress` prompt after completing implementation tasks
  1. Git commits and code changes analyzed
  2. Changes mapped to PRD requirements
  3. PRD checkboxes updated with work log entry

- **Finalize**: Use `prd-done` prompt to deploy, merge, and close out completed work
  1. Pre-completion validation performed
  2. Pull request created and merged
  3. GitHub issue closed with final validation

### Workflow 2: Containerization

#### Video: Production Dockerfile Generation

[![My Production Dockerfile Rules: How I Build Docker Images](https://img.youtube.com/vi/ueTe-VQaD7c/maxresdefault.jpg)](https://youtu.be/ueTe-VQaD7c)

This video demonstrates how to use the `generate-dockerfile` prompt to create production-ready, secure, multi-stage Dockerfiles. Watch how the prompt analyzes your project structure and generates optimized Docker configurations following best practices.

- **Containerize project**: Use `generate-dockerfile` prompt to create production-ready Docker configuration
  1. Project structure analyzed (language, framework, dependencies)
  2. Multi-stage Dockerfile generated with security best practices
  3. Minimal .dockerignore created based on Dockerfile's COPY commands
  4. Image built and validated automatically
  5. Container tested to ensure application runs correctly

- **Improve existing Dockerfile**: Use same prompt when Dockerfile already exists
  1. Existing Dockerfile analyzed against best practices
  2. Security issues identified (running as root, :latest tags, etc.)
  3. Optimizations applied while preserving intentional customizations
  4. Changes explained with rationale

### Workflow 3: CI/CD Generation

- **Generate CI/CD workflows**: Use `generate-cicd` prompt to create intelligent workflows
  1. CI platform confirmed (GitHub Actions supported, feature request offered for others)
  2. Repository analyzed (language, automation, existing CI, deployment mechanism)
  3. Findings presented for user confirmation
  4. Workflow choices presented (PR workflow, release triggers, deployment strategy)
  5. Workflows generated using project automation (npm test, make build, etc.)
  6. Required secrets and permissions documented
  7. Workflows committed and validated

## Cross-Agent Compatibility

### Supported Coding Agents

**Agents with Full Slash Command Support**: ✅
- **Claude Code**: Prompts appear as `/dot-ai:prompt-name`
- **VS Code + GitHub Copilot**: Prompts appear as `/mcp.dot-ai.prompt-name`
- Both support native slash command integration and prompt discovery
- Tools appear in agent configuration menus
- Full metadata support with descriptions

**Other MCP-Compatible Agents**: 🤔 Expected to work (not validated)
- Other MCP-compatible clients like Cursor should work since they follow MCP specifications
- Different agents may use different slash command formats (e.g., `/mcp.dot-ai.prompt-name` vs `/dot-ai:prompt-name`)
- **Help us validate**: Try these prompts in your agent and [report your experience via GitHub issues](https://github.com/vfarcic/dot-ai/issues)

## Contributing Prompts

Have a useful prompt to share? Contribute it to the shared library:

1. **Fork the repository** and create a feature branch
2. **Add your prompt** to the `shared-prompts/` directory following existing naming conventions
3. **Update the documentation** by adding your prompt to the "Available Prompts" section above
4. **Submit a pull request** with a clear description of what the prompt does and when to use it

### Prompt Metadata Format

Each prompt file must include YAML frontmatter that defines how it appears in coding agents:

```yaml

# Your Prompt Content

Your prompt instructions go here...
```

**Metadata Fields:**
- **`name`**: Becomes the slash command name (e.g., `name: prd-create` → `/dot-ai:prd-create`)
- **`description`**: Shows up in coding agent command menus and help text
- **`category`**: Used for organizing prompts in documentation (must be one of: `project-management`, `development`)

**How It Works:**
1. **Prompt Discovery**: Your coding agent automatically discovers all available prompts and their metadata
2. **Slash Commands**: Each prompt appears as a slash command — the `name` becomes the command, `description` appears in menus
3. **Organization**: The `category` field groups prompts in documentation for easy browsing

**Contribution Guidelines:**
- Use descriptive, kebab-case names (e.g., `database-optimization`, `api-security-review`)
- Include clear purpose and usage examples in your PR description
- Test your prompt across different scenarios before contributing
- Follow the established prompt format and documentation patterns

## User-Defined Prompts

Serve custom prompts from your own git repository. Your prompts appear alongside built-in prompts.

### Why User-Defined Prompts?

- **Agent-agnostic**: Prompts work with any compatible coding agent (Claude Code, Cursor, VS Code, etc.) without maintaining separate prompt directories for each tool
- **Team consistency**: Share standard prompts across all projects without contributing to the core project
- **Organization-specific workflows**: Create prompts tailored to your team's processes
- **Version control**: Manage prompts through standard git workflows (commit, push, PR)
- **Works everywhere**: Prompts work across all Kubernetes deployments

### Configuration

Configure user prompts via environment variables:

| Variable | Purpose | Default |
|----------|---------|---------|
| `DOT_AI_USER_PROMPTS_REPO` | Git repository URL (HTTPS) | None (feature disabled) |
| `DOT_AI_USER_PROMPTS_BRANCH` | Branch to use | `main` |
| `DOT_AI_USER_PROMPTS_PATH` | Subdirectory within repo | Root directory |
| `DOT_AI_GIT_TOKEN` | Authentication token for private repos | None |
| `DOT_AI_USER_PROMPTS_CACHE_TTL` | Cache duration in seconds | `86400` (24 hours) |

**Supported Git Providers:**
- GitHub (github.com)
- GitLab (gitlab.com or self-hosted)
- Gitea / Forgejo (self-hosted)
- Bitbucket (bitbucket.org)
- Any git server supporting HTTPS clone

### Repository Setup

Create a git repository with prompt files as markdown (`.md`) files:

```
my-team-prompts/
├── deploy-app.md
├── review-pr.md
└── team-standup.md
```

Or use a subdirectory within an existing repository:

```
my-project/
├── src/
├── docs/
└── prompts/          # Set DOT_AI_USER_PROMPTS_PATH=prompts
    ├── deploy-app.md
    └── review-pr.md
```

### Prompt File Format

User prompts follow the same format as built-in prompts, with optional MCP arguments support:

```yaml

# Deploy Application

Deploy the application to {{environment}}.

{{#if version}}
Deploying version: {{version}}
{{/if}}

## Steps

1. Verify the deployment configuration
2. Run pre-deployment checks
3. Execute deployment to {{environment}}
4. Validate deployment success
```

**Metadata Fields:**
- **`name`**: Becomes the slash command (e.g., `name: deploy-app` → `/dot-ai:deploy-app`)
- **`description`**: Shows in coding agent command menus
- **`category`**: Organizes prompts in documentation
- **`arguments`**: Optional parameters substituted via `{{argumentName}}` placeholders

### Deployment Configuration

#### Kubernetes (Helm)

Add environment variables via `extraEnv` in your Helm values:

```bash
helm upgrade --install dot-ai oci://ghcr.io/vfarcic/helm/dot-ai \
  --namespace dot-ai --create-namespace \
  --set ai.anthropic.apiKey="${ANTHROPIC_API_KEY}" \
  --set ai.embeddings.openai.apiKey="${OPENAI_API_KEY}" \
  --set-json 'extraEnv=[
    {"name":"DOT_AI_USER_PROMPTS_REPO","value":"https://github.com/your-org/team-prompts.git"},
    {"name":"DOT_AI_USER_PROMPTS_PATH","value":"prompts"},
    {"name":"DOT_AI_GIT_TOKEN","value":"'"${DOT_AI_GIT_TOKEN}"'"}
  ]'
```

### How It Works

1. **First access**: Repository is cloned to a local cache directory
2. **Subsequent access**: Repository is pulled if cache TTL has expired
3. **Merging**: User prompts are merged with built-in prompts
4. **Precedence**: Built-in prompts take precedence over user prompts with the same name

### Error Handling

The feature is designed for graceful degradation:

| Scenario | Behavior |
|----------|----------|
| Repository not configured | Built-in prompts only (no error) |
| Clone fails (auth, network) | Built-in prompts only, error logged |
| Pull fails | Cached version used, warning logged |
| Invalid prompt format | Prompt skipped, warning logged |
| Name collision with built-in | User prompt skipped, warning logged |

### Troubleshooting User Prompts

**User prompts don't appear**
- **Cause**: Repository not configured or clone failed
- **Solution**: Verify `DOT_AI_USER_PROMPTS_REPO` is set and accessible
- **Check**: Run "Show dot-ai status" to verify prompt loading and connectivity

**Private repository auth fails**
- **Cause**: Missing or invalid `DOT_AI_GIT_TOKEN`
- **Solution**: Set a valid personal access token (PAT) with repo read access
- **Note**: Tokens are never logged; URLs are sanitized in log output

**Changes not appearing**
- **Cause**: Cache hasn't expired yet
- **Solution**: Wait for TTL to expire, or set `DOT_AI_USER_PROMPTS_CACHE_TTL=0` for testing

**Prompt has same name as built-in**
- **Cause**: Name collision with built-in prompt
- **Solution**: Rename your prompt to a unique name
- **Note**: Built-in prompts always take precedence

## Troubleshooting

### Common Issues

**Prompts don't appear in command menu**
- **Cause**: Server not connected or prompts capability not enabled
- **Solution**: Check connection status and server configuration
- **See**: [Deployment Guide](../setup/deployment.md) for server troubleshooting, [MCP client setup](/docs/mcp) for connection issues

**Prompt execution fails with "not found" error**
- **Cause**: Prompt ID mismatch or server synchronization issue
- **Solution**: Refresh the connection or restart your coding agent
- **Workaround**: Disconnect and reconnect to the server

**Prompts work in one agent but not another**
- **Cause**: Agent-specific MCP implementation differences
- **Solution**: Check agent-specific compatibility notes above
- **Alternative**: Use a fully compatible agent for prompt-heavy workflows

## See Also

- **[Deployment Guide](../setup/deployment.md)** - Server deployment and configuration
- **[Tools and Features Overview](overview.md)** - Browse all available tools and features


# Cluster Query Guide

**Complete guide for using natural language to query your Kubernetes cluster with the DevOps AI Toolkit.**


## See It In Action

[![Why Kubernetes Querying Is Broken and How I Fixed It](https://img.youtube.com/vi/uUdbQkq5c4k/maxresdefault.jpg)](https://youtu.be/uUdbQkq5c4k)

## Using via Web UI

These tools are also available through the [Web Dashboard](/docs/ui).

## Prerequisites

Before using this guide, complete the [Deployment Guide](../setup/deployment.md).

**Optional but strongly recommended:**
- Scanned capabilities - see [Capability Management](capability-management.md)
- Resources synced to Vector DB - automatically handled by the [dot-ai-controller](/docs/controller)

> **Note**: While the query tool technically works without capabilities and resources in the Vector DB (falling back to raw kubectl commands), AI performance and accuracy will be severely reduced. Capabilities provide semantic understanding (knowing that "database" relates to StatefulSet, CNPG, etc.), and resources provide fast inventory lookups. Without these, the AI loses its ability to answer concept-based queries and must rely solely on kubectl, resulting in slower and less intelligent responses.

## Related Projects

**[dot-ai-controller](/docs/controller)** - A Kubernetes controller that automatically scans cluster capabilities and syncs resources to the Vector DB. This enables the query tool's semantic understanding and fast inventory lookups without manual setup.

## Overview

The DevOps AI Toolkit query feature provides:
- **Natural language cluster queries** - Ask questions about your cluster in plain English
- **Semantic understanding** - Query by concept ("databases", "web servers") not just resource names
- **Multi-source intelligence** - Combines Vector DB knowledge with live kubectl data
- **AI-powered orchestration** - LLM decides which tools to use based on your intent
- **Read-only operations** - Safe cluster exploration without risk of modifications

### How AI-Driven Querying Works

**Traditional Problem**: Finding cluster resources requires knowing kubectl syntax, label conventions, and resource types.

**AI Query Solution**: Describe what you want to find in natural language, and AI orchestrates the right tools.

**Query Flow**:
1. **Intent Analysis**: AI interprets your natural language query
2. **Tool Selection**: AI chooses appropriate tools (Vector DB search, kubectl commands)
3. **Data Gathering**: Tools execute and return results
4. **Response Synthesis**: AI summarizes findings in human-readable format

### Three Data Sources

The query tool combines three data sources to answer your questions intelligently:

| Source | What it provides | Example use |
|--------|------------------|-------------|
| **Capabilities** | Semantic understanding - knows that "database" relates to StatefulSet, CNPG Cluster, Crossplane RDS, etc. | "What databases are running?" |
| **Resources** | Searchable inventory - find resources by name patterns, labels, or annotations without knowing kubectl syntax or exact resource locations | "Find all resources with label team=platform" |
| **Kubernetes API** | Live cluster state - real-time status, events, logs, and detailed resource information via kubectl | "Describe the my-postgres cluster" |

This combination enables natural language queries while providing accurate, live cluster data.

## Complete Workflow Examples

> **Tip**: If your AI coding agent doesn't automatically use the query tool, add "use dot-ai" to your query (e.g., "What databases are running? Use dot-ai").

This section demonstrates common query patterns:

- **[Semantic Queries](#semantic-queries)** - Find resources by concept
- **[Direct Queries](#direct-queries)** - Ask about specific resources or namespaces
- **[Status Queries](#status-queries)** - Check health and troubleshoot issues


### Direct Queries

Direct queries ask about specific resources. The AI finds the resource and retrieves detailed information without needing you to specify the exact namespace or resource type.

#### Example: Describe a specific resource

```text
Describe the my-postgres cluster
```

**Response:**

```text
my-postgres Cluster

Type: CloudNativePG (CNPG) PostgreSQL Cluster
Namespace: default
Status: Healthy (1/1 instances ready)

Cluster Overview

| Property           | Value           |
|--------------------|-----------------|
| PostgreSQL Version | 17.4            |
| Age                | 22 minutes      |
| Instances          | 1 (single node) |
| Primary Pod        | my-postgres-1   |
| Database           | app             |

Configuration

| Setting               | Value                 |
|-----------------------|-----------------------|
| Storage               | 1Gi persistent volume |
| Encoding              | UTF8 (C locale)       |
| WAL Level             | logical               |
| TLS                   | TLSv1.3 (min/max)     |
| Max Parallel Workers  | 32                    |
| Max Replication Slots | 32                    |
| Superuser Access      | Disabled (secure)     |

Services

| Service        | Purpose              |
|----------------|----------------------|
| my-postgres-rw | Read-write (primary) |
| my-postgres-r  | Read                 |
| my-postgres-ro | Read-only            |

The cluster is production-ready with enterprise features including automated
failover support, backup/recovery capabilities, and secure TLS communication.
```

**What happened**: The AI searched the resources inventory to locate `my-postgres`, identified it as a CNPG Cluster in the default namespace, then used the Kubernetes API to get detailed status and configuration.


# Recommendation Guide

**Complete guide for AI-powered Kubernetes deployment recommendations with the DevOps AI Toolkit.**


## See It In Action

[![DevOps AI Toolkit: AI-Powered Application Deployment](https://img.youtube.com/vi/8Yzn-9qQpQI/maxresdefault.jpg)](https://youtu.be/8Yzn-9qQpQI)

This video explains the platform engineering problem and demonstrates the Kubernetes deployment recommendation workflow from intent to running applications.

## Using via Web UI

These tools are also available through the [Web Dashboard](/docs/ui).

## Prerequisites

Before using this guide, complete the [Deployment Guide](../setup/deployment.md) to set up:
- DevOps AI Toolkit deployed
- AI model API key configured (see [AI Model Configuration](../setup/deployment.md#ai-model-configuration) for supported models and setup)
- `KUBECONFIG` pointing to your Kubernetes cluster (optional but recommended)

**Required - Capability Management:**
- Vector DB service (Qdrant) for capability storage
- Cluster capabilities discovered via [Capability Management Guide](capability-management.md)
- **Note**: Recommendations will fail without capabilities - the system requires semantic understanding of your cluster resources

**Optional - Enhanced with Organizational Patterns:**
- Organizational patterns created via [Pattern Management Guide](../organizational-data/patterns.md)
- When configured, recommendations automatically include organizational best practices

## Overview

**What it does**: Provides AI-powered Kubernetes deployment recommendations using cluster capabilities, and installs third-party applications via Helm charts when no matching capability exists. Features semantic capability matching, organizational pattern integration, and guided workflow from intent clarification through deployment.

**Use when**: You need intelligent deployment recommendations that leverage your cluster's actual capabilities, or want to install third-party tools (Prometheus, Argo CD, Crossplane, etc.) via Helm charts with AI-guided configuration.

📖 **Full Guide**: This document covers the complete recommendation workflow with detailed examples and behind-the-scenes explanations.

### Key Features

The DevOps AI Toolkit recommendation feature provides:
- **Intent refinement guidance** - When intent is vague, provides guidance for the client agent (Claude Code, Cursor, etc.) to analyze the project and gather details before proceeding
- **Semantic capability understanding** - Uses discovered resource capabilities for intelligent semantic matching
- **AI-powered recommendations** - Suggests optimal deployment approaches based on capability-filtered resources with rich context
- **Enhanced resource discovery** - Finds resources like `sqls.devopstoolkit.live` for "PostgreSQL database" requests through semantic search
- **Organizational pattern integration** - Automatically enhances recommendations with your team's best practices and standards
- **Guided configuration** - Step-by-step configuration collection for your deployments
- **Manifest generation** - Creates ready-to-deploy Kubernetes YAML files
- **Helm chart installation** - Discovers and installs third-party applications via ArtifactHub when no cluster capability matches
- **Automated deployment** - Deploys and monitors your applications

### How Capability-Enhanced Recommendations Work

**Traditional Problem**: AI sees `sqls.devopstoolkit.live` as a meaningless name among many cluster resources.

**Capability Solution**: AI understands that `sqls.devopstoolkit.live` provides PostgreSQL database capabilities with multi-cloud support.

**Recommendation Flow**:
1. **User Intent**: "I need a PostgreSQL database"
2. **Capability Search**: Semantic search finds relevant resources with capability context
3. **Pre-filtered Results**: AI gets semantically relevant resources instead of all generic resource names
4. **Enhanced Recommendations**: Optimal matches like `sqls.devopstoolkit.live` appear as top suggestions

**Helm Fallback**: When no cluster capability matches (e.g., "Install Prometheus" but no Prometheus operator installed), the system automatically searches ArtifactHub for Helm charts and presents installation options.

## Complete Workflow Examples

### Example 1: Stateless Golang Web Application

```md
User: I want to deploy my app
```

```md
Agent: [Uses mcp__dot-ai__recommend with intent "deploy my app"]

The intent needs more detail. Let me analyze your project to understand what you're deploying.

[Agent analyzes the project directory, reads package.json, Dockerfile, source code, etc.]

Based on my analysis of your project:
- **Language**: Go (found go.mod, main.go)
- **Type**: Web application (HTTP server on port 8080)
- **Dependencies**: Connects to PostgreSQL database (found database connection code)
- **Container**: Ready to containerize (Dockerfile present)

Is this for production or development? And should the app be publicly accessible?
```

```md
User: It's for production, needs external access, expects moderate traffic with occasional spikes.
```

```md
Agent: Perfect! Based on my analysis and your requirements, let me get production-focused deployment recommendations.

[Uses mcp__dot-ai__recommend with refined intent "deploy stateless Golang web application for production with external access via ingress, moderate traffic with occasional spikes requiring autoscaling, connects to PostgreSQL database, container image available" and final: true]

I found 5 deployment solutions for your Golang web application. All include:
- Deployment - manages your stateless app with rolling updates
- Service - internal load balancing and service discovery
- Ingress - external HTTP/HTTPS access
- HorizontalPodAutoscaler - handles traffic spikes

The difference is how PostgreSQL is provisioned:

| #   | Solution                  | Database Approach                                  | Score |
|-----|---------------------------|----------------------------------------------------|-------|
| 1   | Multi-cloud (Recommended) | DevOps Toolkit SQL - portable across AWS/GCP/Azure | 96    |
| 2   | Azure-native              | Azure Flexible Server + ResourceGroup              | 92    |
| 3   | AWS-native                | RDS DBInstance                                     | 90    |
| 4   | GCP-native                | Cloud SQL DatabaseInstance + Database              | 88    |
| 5   | UpCloud                   | ManagedDatabasePostgreSQL                          | 85    |

Organizational Context:
- Solutions 1 & 2 follow your organization's patterns (DevOps Toolkit DB pattern, Azure ResourceGroup pattern)
- Policy enforced: Applications must never use the latest image tag
- Database regions are standardized (us-east-1 for AWS, us-east1 for GCP, eastus for Azure)

📊 **View visualization**: http://localhost:3000/v/sol-...

Which solution would you like to proceed with? (1-5)
```

**What happened behind the scenes:**
- First call with vague intent ("deploy my app" - under 100 characters) triggered intent refinement guidance
- Agent received guidance to analyze the project and gather details before proceeding
- Agent used its local tools (file reading, code analysis) to discover: Go language, web app type, PostgreSQL dependency, Dockerfile
- Agent discussed findings with user and gathered missing context (production, external access, traffic pattern)
- Second call with detailed intent + `final: true` proceeded to recommendations
- Capability search found relevant resources: Deployment, Service, Ingress, HorizontalPodAutoscaler, plus database operators
- Vector DB search found matching organizational patterns: DevOps Toolkit DB pattern, Azure ResourceGroup pattern
- Generated 5 solution alternatives with different cloud provider options, scoring based on capability relevance and pattern alignment
- Policy intents automatically enforced (e.g., no `latest` image tags, standardized regions)
- Visualization URL generated for exploring solutions in Web UI (shows resources, patterns, and configuration details)

#### Visualization (Optional)

The visualization URL opens an interactive view in the Web UI showing all solution options with their resources, patterns, and scoring:


> **Note**: Visualization URLs require additional setup:
> - Configure `webUI.baseUrl` in your Helm values (see [Deployment Guide - Web UI Visualization](../setup/deployment.md#web-ui-visualization))
> - Install the Web UI (see [Web UI Documentation](/docs/ui))

**Note:** Requires capability scanning via [Capability Management Guide](capability-management.md) and optionally organizational patterns via [Pattern Management](../organizational-data/patterns.md). Policy requirements shown require active policy intents created via [Policy Management Guide](../organizational-data/policies.md).

```md
User: I'll go with option 1
```

```md
Agent: Excellent choice! Let me get the configuration requirements.

[Uses mcp__dot-ai__recommend with stage: 'chooseSolution', solutionId: 'sol-1765409923079-fa3f055c']

I need some required configuration details:

**Application:**
- Name - What should the app be called? (e.g., my-golang-app)
- Namespace - Where to deploy? Options: a-team, b-team, default, dot-ai
- Container image - Full image URI with version tag (e.g., myregistry.io/app:v1.0.0) - latest tag is not allowed per policy
- Port - What port does your app listen on? (default: 8080)
- Ingress hostname - Public DNS name (e.g., myapp.example.com)

**Database:**
- Cloud provider - aws, azure, or gcp?
- Size - small, medium, or large?
- PostgreSQL version - e.g., 15, 14.8

**Output:**
- Format - raw YAML, helm chart, or kustomize?
- Output path - Where to save files? (e.g., ./manifests)

You can answer all at once or just the ones you want to customize (I'll use sensible defaults for the rest).
```

**What happened behind the scenes:**
- recommend tool with stage 'chooseSolution' loaded the selected solution and returned only the required stage questions
- Questions generated from resource schema analysis: Deployment needs name/image/port, Ingress needs hostname, database needs provider/size/version
- Policy intents enforced: "latest" image tag not allowed (flagged in question)
- New packaging questions added: `outputFormat` (raw/helm/kustomize) and `outputPath` for GitOps-ready output
- System categorizes questions as required (must answer) to proceed

```md
User: my-app, a-team namespace, ghcr.io/vfarcic/silly-demo:v1.5.223, port 8080, silly-demo.acme.com for ingress. AWS provider, small database, PostgreSQL 15. Output as Kustomize to ./manifests.
```

```md
Agent: Perfect! Let me configure the required settings.

[Uses mcp__dot-ai__recommend with stage: 'answerQuestion:required', solutionId, answers: {name: "my-app", namespace: "a-team", container-image: "ghcr.io/vfarcic/silly-demo:v1.5.223", application-port: 8080, ingress-host: "silly-demo.acme.com", database-cloud-provider: "aws", database-size: "small", database-version: "15.0", outputFormat: "kustomize", outputPath: "./manifests"}]

Required configuration saved. Now for basic settings (all optional - I can use defaults):

**Scaling:**
- Initial replicas: 2
- Min replicas (HPA): 2
- Max replicas (HPA): 10

**Resources per pod:**
- CPU: 250m request / 500m limit
- Memory: 256Mi request / 512Mi limit

**Networking:**
- Ingress class: nginx
- Enable TLS/HTTPS: yes

**Database:**
- Database name: appdb

Would you like to customize any of these, or should I proceed with these defaults? (say "skip" to use defaults)
```

**What happened behind the scenes:**
- recommend tool with stage 'answerQuestion:required' processed required stage answers and updated solution file
- Validated all required fields were provided (name, namespace, image, port, ingress host, database config)
- Captured output format (Kustomize) and path (./manifests) for later manifest generation
- Determined next stage is 'basic' and returned basic configuration questions with sensible defaults
- Questions are optional at this stage - user can answer, customize specific values, or skip to use defaults

```md
User: 2 replicas, DB name: mydb
```

```md
Agent: Basic settings saved. Now for advanced configuration (all optional):

[Uses mcp__dot-ai__recommend with stage: 'answerQuestion:basic', solutionId, answers: {replicas: 2, hpa-min-replicas: 2, hpa-max-replicas: 10, cpu-request: "250m", cpu-limit: "500m", memory-request: "256Mi", memory-limit: "512Mi", ingress-class: "nginx", database-name: "mydb", enable-tls: true}]

**Deployment Strategy:**
- RollingUpdate with max surge 1, max unavailable 0 (zero-downtime)

**Health Probes:**
- Readiness: /health, 10s initial delay, 10s period
- Liveness: /health, 30s initial delay, 15s period

**Autoscaling:**
- Target CPU: 75%
- Target memory: 75%

**Resilience:**
- PodDisruptionBudget: min 1 available
- Termination grace period: 30s
- Pod anti-affinity: spread across nodes

**Security:**
- Run as non-root: yes
- Read-only filesystem: no
- TLS secret name: app-tls-cert

**Database:**
- Backups enabled: yes

Would you like to customize any of these, or say "skip" to use defaults?
```

**What happened behind the scenes:**
- recommend tool with stage 'answerQuestion:basic' processed basic stage answers (replicas: 2, database-name: mydb, plus defaults)
- Updated solution file and determined next stage is 'advanced'
- Returned advanced configuration questions focused on deployment strategy, health probes, autoscaling, resilience, and security

```md
User: Skip advanced configuration
```

```md
Agent: No problem! Moving to the final configuration stage.

[Uses mcp__dot-ai__recommend with stage: 'answerQuestion:advanced', solutionId, answers: {}]

Any additional requirements or constraints for your deployment?
```

**What happened behind the scenes:**
- recommend tool with stage 'answerQuestion:advanced' processed empty advanced stage answers (using defaults for all)
- Determined next stage is 'open' and returned the open-ended question

```md
User: N/A
```

```md
Agent: Perfect! Configuration complete.

[Uses mcp__dot-ai__recommend with stage: 'answerQuestion:open', solutionId, answers: {open: "N/A"}]

Configuration complete. Ready to generate deployment manifests.
```

**What happened behind the scenes:**
- recommend tool with stage 'answerQuestion:open' processed open stage with "N/A" (no additional requirements)
- All configuration stages now complete (required ✓, basic ✓, advanced defaults, open ✓)
- Returned status: 'ready_for_manifest_generation', nextAction: 'generateManifests'

```md
User: Generate the manifests
```

```md
Agent: Configuration complete. Generating your Kustomize manifests...

[Uses mcp__dot-ai__recommend with stage: 'generateManifests', solutionId]

Your Kustomize manifests are ready at ./manifests/:

manifests/
├── kustomization.yaml              # Points to overlays/production
├── README.md
├── overlays/
│   └── production/
│       └── kustomization.yaml      # ← Edit this file to customize
└── base/
    ├── kustomization.yaml
    ├── namespace.yaml
    ├── deployment.yaml             # Image without tag (tag set in overlay)
    ├── service.yaml
    ├── ingress.yaml
    ├── hpa.yaml
    ├── pdb.yaml
    ├── sql.yaml
    └── secret.yaml

**What's included:**
- **Deployment** - 2 replicas, rolling updates, pod anti-affinity, non-root security
- **Service** - ClusterIP on port 80
- **Ingress** - silly-demo.acme.com with TLS via cert-manager
- **HPA** - scales 2-10 replicas based on CPU/memory (75% threshold)
- **PDB** - ensures at least 1 pod available during disruptions
- **SQL** - AWS PostgreSQL 15 (small) via DevOps Toolkit pattern
- **Secret** - database credentials (placeholder - will be populated by Crossplane)

**To customize:**
Edit `overlays/production/kustomization.yaml`:
- `image:` section to change image tag (for upgrades)
- `patches:` for other customizations

Deploy with:
kubectl apply -k ./manifests

📊 **View visualization**: http://localhost:3000/v/sol-...

Would you like me to deploy this to your cluster now?
```

**What happened behind the scenes:**
- recommend tool with stage 'generateManifests' loaded the fully configured solution file
- AI generated Kustomize structure based on `outputFormat: kustomize` selection
- Raw manifests generated first, validated via `kubectl kustomize | kubectl apply --dry-run=server`
- AI packaged validated manifests into Kustomize structure:
  - `base/` contains generic manifests (image without tag)
  - `overlays/production/kustomization.yaml` sets image tag and customizations
- Response includes `files` array with `relativePath` and `content` for each file
- Agent writes files to user's chosen `outputPath` (./manifests)
- Each resource includes dot-ai.io/ labels for tracking
- README.md generated with deployment instructions and overlay examples
- Visualization URL generated for exploring deployment configuration in Web UI

#### Visualization (Optional)

The visualization URL opens an interactive view with multiple tabs including resource allocation, autoscaling configuration, health monitoring, generated manifests, traffic flow, and more:


```md
User: Yes, deploy it!
```

```md
Agent: Deploying your application...

[Uses mcp__dot-ai__recommend with stage: 'deployManifests', solutionId]

🎉 **Deployed successfully!** All resources were created:

- namespace/a-team configured
- deployment.apps/my-app created
- service/my-app created
- ingress.networking.k8s.io/my-app created
- horizontalpodautoscaler.autoscaling/my-app created
- poddisruptionbudget.policy/my-app created
- sql.devopstoolkit.live/my-app-db created
- secret/my-app-db-credentials created

**Next steps:**
1. Wait for the SQL resource to provision the AWS RDS PostgreSQL instance (may take 5-10 minutes)
2. Once RDS is ready, Crossplane will populate the actual database credentials
3. The app will become healthy once it can connect to the database
4. Access your app at https://silly-demo.acme.com (ensure DNS points to your ingress controller)

**Verify status:**
kubectl get all -n a-team
kubectl get sql -n a-team
kubectl get ingress -n a-team
```

**What happened behind the scenes:**
- recommend tool with stage 'deployManifests' applied Kustomize manifests using `kubectl apply -k`
- Created all resources: Namespace, Deployment, Service, Ingress, HPA, PDB, SQL (database), Secret
- SQL resource triggers Crossplane to provision AWS RDS PostgreSQL instance
- Provided next steps and verification commands

### Example 2: Third-Party Application Installation (Helm)

This example demonstrates installing third-party tools via Helm charts when no matching cluster capability exists.

```md
User: Install Prometheus
```

```md
Agent: [Uses mcp__dot-ai__recommend with intent "install Prometheus monitoring" and final: true]

Found a Prometheus Helm chart solution:

**prometheus-community/prometheus (v27.49.0)**
- App Version: v3.8.0
- Official chart maintained by prometheus-community
- Verified Publisher with 523 stars
- Repository: https://prometheus-community.github.io/helm-charts

This is the standard, recommended chart for installing Prometheus monitoring on Kubernetes.

📊 **View visualization**: http://localhost:3000/v/sol-...

Would you like me to proceed with this chart? I'll guide you through the configuration options.
```

**What happened behind the scenes:**
- System detected "Install Prometheus" as a third-party installation intent
- No Prometheus-related CRDs found in cluster capabilities, triggering Helm chart discovery
- ArtifactHub API searched for "prometheus" charts
- AI analyzed and scored charts based on official status, verified publisher, and popularity
- Visualization URL generated for exploring Helm chart details in Web UI

#### Visualization (Optional)

The visualization URL opens an interactive view with multiple tabs including monitoring stack architecture, data collection flow, chart information, stack components, and more:


> **Note**: Visualization URLs require additional setup:
> - Configure `webUI.baseUrl` in your Helm values (see [Deployment Guide - Web UI Visualization](../setup/deployment.md#web-ui-visualization))
> - Install the Web UI (see [Web UI Documentation](/docs/ui))

```md
User: Yes, continue
```

```md
Agent: [Uses mcp__dot-ai__recommend with stage: 'chooseSolution', solutionId]

I need some details to configure your Prometheus installation:

**Required Configuration:**
- What name should be used for this Prometheus deployment?
- Which namespace should Prometheus be deployed to?
- Should Prometheus use persistent storage for metrics data?
```

**What happened behind the scenes:**
- System fetched chart values.yaml and README via Helm CLI
- AI analyzed configuration options and categorized into required/basic/advanced questions
- Cluster context gathered: available namespaces, StorageClasses
- Questions include suggested answers based on chart defaults and cluster state

```md
User: prometheus, monitoring namespace, no persistence
```

```md
Agent: [Uses mcp__dot-ai__recommend with stage: 'answerQuestion:required', solutionId, answers]

Here are some basic configuration options. Would you like to customize these or use defaults?

**Key settings (defaults shown):**
- Data retention: 15 days
- Scrape interval: 1 minute
- AlertManager: enabled
- kube-state-metrics: enabled (cluster object metrics)
- node-exporter: enabled (node hardware metrics)
- PushGateway: disabled
- Ingress: disabled
```

```md
User: Enable AlertManager
```

```md
Agent: [Uses mcp__dot-ai__recommend with stage: 'answerQuestion:basic', solutionId, answers]

Basic configuration set. Now for advanced options.

[Uses mcp__dot-ai__recommend with stage: 'answerQuestion:advanced', solutionId, answers: {}]

Skipping advanced options. Ready to generate the installation.

[Uses mcp__dot-ai__recommend with stage: 'generateManifests', solutionId]

Helm installation ready. Here's the configuration:

**Helm Command:**
helm upgrade --install prometheus prometheus-community/prometheus \
  --namespace monitoring --create-namespace \
  --version 27.49.0 -f values.yaml

**Values (values.yaml):**
server:
  global:
    scrape_interval: 1m
    scrape_timeout: 10s
    evaluation_interval: 1m
  retention: 15d
  persistentVolume:
    enabled: false

alertmanager:
  enabled: true

kube-state-metrics:
  enabled: true

prometheus-node-exporter:
  enabled: true

prometheus-pushgateway:
  enabled: false

📊 **View visualization**: http://localhost:3000/v/sol-...

Ready to install?
```

**What happened behind the scenes:**
- AI generated values.yaml containing only non-default values
- System ran `helm upgrade --install --dry-run` to validate the installation
- Returned helm command and values for user review

```md
User: Yes, deploy
```

```md
Agent: [Uses mcp__dot-ai__recommend with stage: 'deployManifests', solutionId]

Prometheus deployed successfully to the monitoring namespace.

**Deployed Components:**
- Prometheus Server
- AlertManager
- kube-state-metrics
- node-exporter

**Access Prometheus UI:**
kubectl --namespace monitoring port-forward svc/prometheus-server 9090:80
Then open http://localhost:9090

**Access AlertManager:**
kubectl --namespace monitoring port-forward svc/prometheus-alertmanager 9093:9093
Then open http://localhost:9093

**Internal DNS names:**
- Prometheus: prometheus-server.monitoring.svc.cluster.local:80
- AlertManager: prometheus-alertmanager.monitoring.svc.cluster.local:9093
```

**What happened behind the scenes:**
- Helm repository added and `helm upgrade --install --wait` executed
- Helm waited for all pods to become ready before reporting success
- Access instructions provided based on installed components

#### Post-Deployment Visualization

After deployment, return to the Web UI visualization and click **Reload** to see updated runtime information including deployed components, pod status, and live cluster state:


## See Also

- **[Deployment Guide](../setup/deployment.md)** - Server deployment and configuration
- **[Tools and Features Overview](overview.md)** - Browse all available tools and features

# Remediate Guide

**Complete guide for AI-powered Kubernetes issue analysis and remediation with the DevOps AI Toolkit.**


## Using via Web UI

These tools are also available through the [Web Dashboard](/docs/ui).

## Prerequisites

Before using this guide, complete the [Deployment Guide](../setup/deployment.md) to set up:
- DevOps AI Toolkit deployed
- AI model API key configured (see [AI Model Configuration](../setup/deployment.md#ai-model-configuration) for supported models and setup)
- `KUBECONFIG` pointing to your Kubernetes cluster (optional for pure analysis)
- kubectl command-line tool installed

**Required Dependencies:**
- kubectl access to cluster for investigation (get, describe, logs, events, top, explain operations)

**Optional Dependencies:**  
- Write permissions for remediation execution (patch, apply, delete, scale, etc.) - only needed if you want to execute recommended fixes

## Overview

The DevOps AI Toolkit remediate feature provides:
- **AI-powered root cause analysis** - Multi-iteration investigation loop to understand Kubernetes issues
- **Helm-aware diagnostics** - Detects and diagnoses issues with Helm-managed applications, including stuck releases and failed upgrades
- **Intelligent data gathering** - Safe, read-only cluster investigation with targeted operations
- **Comprehensive remediation plans** - Step-by-step solutions with risk assessment and validation
- **Dual execution modes** - Manual approval workflow or automatic execution based on confidence/risk thresholds
- **Iterative remediation support** - Handles complex issues requiring multiple sequential fixes
- **Complete audit trail** - Full investigation history and remediation tracking

## Related Projects

**[dot-ai-controller](/docs/controller/remediation-guide/)** - A Kubernetes controller that transforms reactive infrastructure management into proactive, AI-enhanced operations. The controller monitors Kubernetes cluster events and automatically triggers this project's remediation tool for intelligent analysis and resolution. It supports both automatic remediation (where AI directly implements solutions) and manual remediation (where AI provides recommendations for human review), with configurable filtering, rate limiting, Slack notifications, and confidence/risk thresholds. This enables continuous cluster health monitoring with intelligent auto-healing capabilities.

### How AI-Driven Investigation Works

**Traditional Problem**: Basic kubectl commands show symptoms but require manual analysis to identify root causes.

**AI Investigation Solution**: Multi-step analysis loop gathers relevant data and reasons through complex interdependencies.

**Investigation Flow**:
1. **Initial Analysis**: AI reviews issue description
2. **Data Gathering Loop**: AI requests specific kubectl operations to gather targeted information  
3. **Iterative Analysis**: Each data point informs the next investigation step
4. **Root Cause Identification**: AI synthesizes findings into comprehensive analysis
5. **Remediation Generation**: Specific kubectl commands with risk assessment and validation instructions

<!-- 
Test Scenario Setup for Complex Cross-Resource Example:
Create a database pod that references a non-existent PVC to demonstrate resource creation remediation

kubectl create namespace postgres-demo
kubectl apply -n postgres-demo -f - <<'EOF'
apiVersion: apps/v1
kind: Deployment
metadata:
  name: postgres-db
spec:
  replicas: 1
  selector:
    matchLabels:
      app: postgres
  template:
    metadata:
      labels:
        app: postgres
    spec:
      containers:
      - name: postgres
        image: postgres:13
        env:
        - name: POSTGRES_PASSWORD
          value: testpass
        - name: POSTGRES_DB
          value: testdb
        volumeMounts:
        - name: postgres-storage
          mountPath: /var/lib/postgresql/data
      volumes:
      - name: postgres-storage
        persistentVolumeClaim:
          claimName: postgres-pvc
EOF

This creates a deployment where:
- Pod will be Pending due to missing PVC "postgres-pvc"
- Tool should investigate and recommend creating the missing PVC 
- Demonstrates cross-resource problem (pod issue) requiring solution in different resource (storage)
- Uses separate namespace to keep demo isolated
-->

## Complete Workflow Examples

This section demonstrates both manual and automatic execution modes:

- **[Manual Mode Example](#manual-mode-example)** - User controls execution timing with approval choices
- **[Automatic Mode Example](#automatic-mode-example)** - Tool executes automatically based on confidence/risk thresholds


### Option 1: Automatic Execution

**User selects Choice 1** - Tool executes commands internally and returns complete results

```text
User: 1
```

```text
Agent: I'll execute the remediation automatically and validate the results.

[Uses remediate tool with executeChoice: 1, sessionId: "rem_2025-09-20T1653_818d80901b10ce62", mode: "manual"]

✅ **Issue Resolved Successfully!**

**Executed Command:**
kubectl apply -f - <<EOF
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: postgres-pvc
  namespace: postgres-demo
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 1Gi
  storageClassName: standard
EOF

**Result:** `persistentvolumeclaim/postgres-pvc created`

**Validation Confirmed (95% confidence):**
- postgres-pvc is now bound to persistent volume
- PostgreSQL pod successfully transitioned from Pending to Running state  
- Pod successfully mounted postgres-pvc volume at /var/lib/postgresql/data
- Database is ready and accessible

Your postgres-demo namespace database is now working correctly!
```

**What happened behind the scenes:**
- Tool executed the apply command internally
- Post-execution validation ran automatically (4 iterations, 6 data sources)
- AI confirmed PVC creation and binding to persistent volume
- AI verified PostgreSQL pod transition from Pending to Running state
- Complete end-to-end resolution in a single tool call

#### Post-Remediation Visualization

After the remediation completes, click the **Reload** button in the Web UI to see the updated visualization with remediation steps and validation results:


<!--
Test Scenario Setup for Automatic Mode Example:
Create a SQL custom resource with configuration issues to demonstrate automatic mode execution

kubectl create namespace sql-demo
kubectl apply -n sql-demo -f - <<'EOF'
apiVersion: devopstoolkit.live/v1beta1
kind: SQL
metadata:
  name: test-db
spec:
  size: small
  databases: ["myapp"]
  region: us-east-1
EOF

This creates a SQL custom resource that will likely have configuration issues requiring remediation.
The automatic mode will demonstrate the tool executing fixes without user interaction when confidence/risk thresholds are met.
-->

### Automatic Mode Example {#automatic-mode-example}

**Automatic Mode**: Tool executes first remediation automatically, then falls back to manual mode when validation discovers additional issues

```
User: Something is wrong with my database in sql-demo namespace. Remediate it automatically.
```

```
Agent: I'll help you remediate the database issue in the sql-demo namespace. Let me use the remediate tool to automatically analyze and fix the problem.

[Uses remediate tool with issue: "Database issue in sql-demo namespace", mode: "automatic", maxRiskLevel: "medium", confidenceThreshold: 0.7]

The AI analysis has identified the root cause of your database issue with 95% confidence:

Root Cause: The SQL resource 'test-db' is failing because the KCL function is trying to concatenate 'POSTGRES_' with a missing 'version' field in the resource specification.

Recommended Action:
- Add version field to the SQL resource with PostgreSQL version 13
- Risk Level: Low
- Command: kubectl patch sql/test-db -n sql-demo --type=merge -p '{"spec":{"version":"13"}}'

Choose execution method:

1. Execute automatically (Recommended)
- Run the command automatically
- Risk: Low

2. Execute via agent  
- I'll run the kubectl command using the Bash tool
- Then validate the results
- Risk: Low

Which option would you prefer? (1 or 2)
```

**What happened behind the scenes:**
- Agent interpreted "automatically" and set confidence/risk thresholds (70% confidence, medium risk allowed)
- Tool executed first remediation automatically (compositionRef issue): `kubectl patch sql/test-db -n sql-demo --type=merge -p '{"spec":{"crossplane":{"compositionRef":{"name":"google-postgresql"}}}}'` 
- Post-execution validation discovered a new issue (missing version field)
- **Safety mechanism activated**: Tool switches to manual mode when validation finds additional issues
- User sees current issue analysis but tool includes `"executed": true` and `"results": [...]` showing previous automatic execution
- This prevents potentially dangerous infinite execution loops while maintaining transparency

**Key Automatic Mode Features Demonstrated:**
- **Single-issue auto-execution**: First issue (compositionRef) was fixed without user intervention  
- **Intelligent thresholds**: Agent interpreted "automatically" as moderate risk tolerance (70% confidence, medium risk)
- **Safety fallback**: After first fix, requires manual approval for additional issues discovered during validation
- **Complete transparency**: JSON shows both previous execution results and current issue analysis
- **User control**: Users can specify execution preferences like "high confidence only" or "allow risky fixes"


# System Status Guide

**Quick diagnostics and health check for your DevOps AI Toolkit installation.**

## Prerequisites

Before using this guide, complete the [Deployment Guide](../setup/deployment.md).

## Overview

**What it does**: Provides comprehensive system diagnostics including version info, component health, connection status, and capability summary.

**Use when**: Troubleshooting setup issues, verifying installation, or checking what features are available.

## Usage Example

```md
User: Show dot-ai status
```

```md
Agent: [Uses mcp__dot-ai__version]

**System Health: ✅ Healthy**

| Component   | Status                          |
|-------------|---------------------------------|
| Version     | 0.179.0                         |
| Kubernetes  | Connected (v1.34.0, in-cluster) |
| AI Provider | Anthropic (claude-haiku-4-5)    |
| Vector DB   | Connected                       |
| Embedding   | OpenAI text-embedding-3-small   |
| Kyverno     | v1.16.1 (ready)                 |
| Tracing     | OTLP → Jaeger                   |

**Collections**

| Collection   | Status   | Documents |
|--------------|----------|-----------|
| capabilities | ✅ Ready | 523       |
| patterns     | ✅ Ready | 12        |
| policies     | ✅ Ready | 8         |

**Active Capabilities**

- Policy intent management
- Capability scanning
- Semantic search
- AI recommendations
- Kubernetes integration
- Kyverno policy generation

📊 **View visualization**: http://localhost:3000/v/ver-...
```

**What happened behind the scenes:**
- System checked connectivity to all components (Kubernetes, Vector DB, AI provider)
- Retrieved collection statistics from Qdrant
- Verified Kyverno installation and readiness
- Checked tracing configuration
- Visualization URL generated for exploring system status in Web UI

### Visualization (Optional)

The visualization URL opens an interactive dashboard showing system health, component status, and configuration details:


> **Note**: Visualization URLs require additional setup:
> - Configure `webUI.baseUrl` in your Helm values (see [Deployment Guide - Web UI Visualization](../setup/deployment.md#web-ui-visualization))
> - Install the Web UI (see [Web UI Documentation](/docs/ui))

## Troubleshooting Common Issues

### Kubernetes Not Connected

```text
| Kubernetes | ❌ Not connected |
```

**Causes:**
- `KUBECONFIG` environment variable not set
- Kubeconfig file doesn't exist or is invalid
- Cluster is unreachable

**Solutions:**
1. Verify kubeconfig exists: `ls $KUBECONFIG`
2. Test connectivity: `kubectl cluster-info`
3. Check the server has access to kubeconfig file

### Vector DB Not Connected

```text
| Vector DB | ❌ Not connected |
```

**Causes:**
- Qdrant service not running
- Wrong `QDRANT_URL` configuration
- Network connectivity issues

**Solutions:**
1. Verify Qdrant is running: `kubectl get pods -n dot-ai | grep qdrant`
2. Check `QDRANT_URL` environment variable
3. Test connectivity: `curl $QDRANT_URL/collections`

### Collections Not Created

```text
| capabilities | ⚠️ Not created |
```

**Causes:**
- Capability scan hasn't been run yet
- Vector DB connectivity issues

**Solutions:**
1. Run capability scan: See [Capability Management Guide](capability-management.md)
2. For patterns: See [Pattern Management Guide](../organizational-data/patterns.md)
3. For policies: See [Policy Management Guide](../organizational-data/policies.md)

### AI Provider Issues

```text
| AI Provider | ❌ Not configured |
```

**Causes:**
- Missing API key (e.g., `ANTHROPIC_API_KEY`, `OPENAI_API_KEY`)
- Invalid API key
- Provider service unavailable

**Solutions:**
1. Verify API key is set in environment or Helm values
2. Test API key validity with provider's API
3. Check [AI Model Configuration](../setup/deployment.md#ai-model-configuration)

## See Also

- **[Deployment Guide](../setup/deployment.md)** - Server deployment and configuration
- **[Capability Management Guide](capability-management.md)** - Scan cluster capabilities
- **[Tools and Features Overview](overview.md)** - Browse all available tools

## MCP

# MCP

**Connect to the DevOps AI Toolkit Engine via MCP protocol from your preferred coding assistant.**

MCP (Model Context Protocol) is an open protocol that lets AI coding assistants connect to external tools and data sources. When you connect your coding assistant to the DevOps AI Toolkit via MCP, all toolkit tools become available directly in your editor — no context switching required.

> **Alternative**: The [CLI](/docs/cli) provides command-line access to all toolkit capabilities — for AI agents (with lower token overhead), scripting, CI/CD pipelines, and direct resource access beyond what MCP exposes.

## Prerequisites

- DevOps AI Toolkit Engine deployed to Kubernetes — see [AI Engine Deployment](/docs/ai-engine/setup/deployment)
- An MCP-compatible client (Claude Code, Cursor, VS Code, Cline, etc.)
- Your server URL and auth token from the deployment step

## Step 1: Configure Your Client

Create an `.mcp.json` file in your project root:

```json
{
  "mcpServers": {
    "dot-ai": {
      "type": "http",
      "url": "http://dot-ai.127.0.0.1.nip.io",
      "headers": {
        "Authorization": "Bearer <your-auth-token>"
      }
    }
  }
}
```

Replace `<your-auth-token>` with the token from your deployment (run `echo $DOT_AI_AUTH_TOKEN` to view it).

**Notes**:
- Replace the URL with your actual hostname if you changed `ingress.host` during deployment.
- For production deployments with TLS, use `https://` in the URL.

## Step 2: Start Your Client

Start your MCP client. It will automatically connect to the server.

- **Claude Code**: Run `claude` in the directory containing `.mcp.json`
- **Cursor**: The MCP connection activates when you open the project
- **VS Code / Cline**: The MCP connection activates based on your extension settings

## Step 3: Verify the Connection

In your MCP client, ask:

```text
Show dot-ai status
```

You should see comprehensive system status including Kubernetes connectivity, vector database, and all available tools.

## Client Compatibility

The DevOps AI Toolkit works with any MCP-compatible coding agent or development tool.

### Popular MCP Clients

**Claude Code**
- Save `.mcp.json` in your project root
- Start with `claude` — tools are automatically available

**Cursor**
- Settings -> "MCP Servers" -> Add configuration -> Restart

**Cline (VS Code Extension)**
- Configure in VS Code settings or extension preferences

**VS Code (with MCP Extension)**
- Add configuration to `settings.json` under `mcp.servers`

**Other MCP Clients**
- Any client supporting the Model Context Protocol standard
- Use the HTTP configuration pattern shown in [Step 1](#step-1-configure-your-client)

## Next Steps

- **[Tools Overview](/docs/ai-engine/tools/overview)** — Explore all available tools and how they work together


## CLI

# CLI Documentation

**Command-line interface for AI-powered Kubernetes operations**

## What is the CLI?

The CLI provides command-line access to all [DevOps AI Toolkit](/docs/ai-engine/) capabilities. It's a lightweight HTTP client designed for both AI agents and human operators who prefer terminal-based workflows.

Unlike MCP (limited to 8 high-level tools to minimize context window usage), the CLI exposes **all REST API endpoints** since there's no token cost per command. This means you get access to direct resource queries, logs, events, and more—all from a single binary with zero runtime dependencies.

**Key benefits:**

- **Single binary** — No installation dependencies, just download and run
- **Cross-platform** — Linux, macOS, Windows (amd64 + arm64)
- **Token efficient** — Lower token overhead than MCP for AI agents
- **Complete API access** — All 26 REST API endpoints (MCP exposes 8 tools)
- **Composable** — Shell piping, scripting, and CI/CD integration

## When to Use the CLI

The CLI is ideal for:

- **Scripting and automation** — Shell scripts, CI/CD pipelines, scheduled jobs
- **AI agent integration** — Lower token overhead than MCP protocol
- **Direct API access** — Commands for resources, logs, events, namespaces not available via MCP
- **Composability** — Pipe output between commands, combine with other CLI tools

For details on DevOps AI Toolkit features (query, recommend, remediate, etc.), see the [main documentation](/docs/ai-engine/).

## Getting Started

**[Quick Start](quick-start.md)** — Set up your AI agent to use the CLI

## Documentation

### Setup

- **[Installation](setup/installation.md)** — Homebrew, Scoop, binary download
- **[Configuration](setup/configuration.md)** — Server URL, authentication, output format
- **[Shell Completion](setup/shell-completion.md)** — Bash, Zsh, Fish autocompletion

### Guides

- **[Commands Overview](guides/cli-commands-overview.md)** — All available commands
- **[Skills Generation](guides/skills-generation.md)** — Enable AI agents to discover and use the CLI
- **[Output Formats](guides/output-formats.md)** — YAML vs JSON
- **[Automation](guides/automation.md)** — Scripting and CI/CD integration

## Architecture

```
┌─────────────┐
│     CLI     │
└──────┬──────┘
       │ HTTP (GET/POST/DELETE)
       │ Bearer auth, JSON body
       ▼
┌─────────────────────┐
│ DevOps AI Toolkit   │
│ REST API Server     │
└─────────────────────┘
```

The CLI is a stateless HTTP client that reads the embedded OpenAPI spec and generates commands dynamically. All commands map directly to REST API endpoints.

## Related Projects

- **[DevOps AI Toolkit](/docs/ai-engine/)** — Main server (MCP + REST API)
- **[Web UI](/docs/ui/)** — Visualizations and dashboards
- **[Stack](/docs/stack/)** — Kubernetes deployment

# Automation

Use the CLI in scripts, CI/CD pipelines, and automated workflows.

## Exit Codes

The CLI uses standard exit codes for automation:

| Code | Meaning |
|------|---------|
| `0` | Success |
| `1` | Tool execution error (server returned error) |
| `2` | Connection error (server unreachable) |
| `3` | Usage error (invalid arguments, missing required params) |

## Error Handling in Scripts

**Check exit code:**
```bash
#!/bin/bash
if dot-ai <command>; then
  echo "Success"
else
  echo "Failed with exit code $?"
  exit 1
fi
```

**Handle specific errors:**
```bash
#!/bin/bash
dot-ai <command>
EXIT_CODE=$?

case $EXIT_CODE in
  0) echo "Success" ;;
  1) echo "Server error" ;;
  2) echo "Connection failed" ;;
  3) echo "Invalid usage" ;;
esac
```

## CI/CD Integration

### GitHub Actions

```yaml
name: Deploy
on: [push]

jobs:
  deploy:
    runs-on: ubuntu-latest
    steps:
      - name: Install CLI
        run: |
          curl -sL https://github.com/vfarcic/dot-ai-cli/releases/latest/download/dot-ai-linux-amd64 \
            -o /usr/local/bin/dot-ai && chmod +x /usr/local/bin/dot-ai

      - name: Deploy application
        env:
          DOT_AI_URL: ${{ secrets.DOT_AI_URL }}
          DOT_AI_AUTH_TOKEN: ${{ secrets.DOT_AI_AUTH_TOKEN }}
        run: |
          dot-ai <command> --output json
```

### GitLab CI

```yaml
deploy:
  image: ubuntu:latest
  before_script:
    - apt-get update && apt-get install -y curl
    - curl -sL https://github.com/vfarcic/dot-ai-cli/releases/latest/download/dot-ai-linux-amd64 -o /usr/local/bin/dot-ai
    - chmod +x /usr/local/bin/dot-ai
  script:
    - dot-ai <command> --output json
  variables:
    DOT_AI_URL: $DOT_AI_URL
    DOT_AI_AUTH_TOKEN: $DOT_AI_AUTH_TOKEN
```

## Processing Output

**Extract values with jq:**
```bash
#!/bin/bash
RESULT=$(dot-ai <command> --output json | jq -r '.result')
echo "Result: $RESULT"
```

**Loop over array results:**
```bash
#!/bin/bash
dot-ai resources --kind Deployment --output json | \
  jq -r '.items[].metadata.name' | \
  while read name; do
    echo "Processing: $name"
  done
```

## Configuration Best Practices

**Use environment variables in CI/CD:**
```bash
export DOT_AI_URL="https://dot-ai.example.com"
export DOT_AI_AUTH_TOKEN="${SECRET_TOKEN}"
export DOT_AI_OUTPUT_FORMAT="json"
```

**Don't hardcode credentials:**
```bash
# Bad
dot-ai <command> --token hardcoded-token

# Good
dot-ai <command> --token "${DOT_AI_AUTH_TOKEN}"
```

## Scripting Examples

**Conditional execution:**
```bash
#!/bin/bash
if dot-ai <command> --output json | jq -e '.healthy' > /dev/null; then
  echo "System healthy, proceeding..."
  # Continue with workflow
else
  echo "System unhealthy, aborting"
  exit 1
fi
```

**Retry logic:**
```bash
#!/bin/bash
MAX_RETRIES=3
RETRY_COUNT=0

while [ $RETRY_COUNT -lt $MAX_RETRIES ]; do
  if dot-ai <command>; then
    echo "Success"
    exit 0
  fi

  RETRY_COUNT=$((RETRY_COUNT + 1))
  echo "Retry $RETRY_COUNT/$MAX_RETRIES"
  sleep 5
done

echo "Failed after $MAX_RETRIES attempts"
exit 1
```

## Next Steps

- **[Output Formats](output-formats.md)** — Control output for parsing
- **[Configuration](../setup/configuration.md)** — Environment variables and flags
- **[Commands Overview](cli-commands-overview.md)** — Available commands

# Commands Overview

The CLI exposes all DevOps AI Toolkit server capabilities as commands. Commands are automatically generated from the server's OpenAPI specification.

## Discovering Commands

To see all available commands:

```bash
dot-ai --help
```

To see help for a specific command:

```bash
dot-ai <command> --help
```

For details on what each feature does, see the [server documentation](/docs/ai-engine/).

## Global Flags

These flags work with all commands:

| Flag | Environment Variable | Description |
|------|---------------------|-------------|
| `--server-url` | `DOT_AI_URL` | Server URL (default: `http://localhost:3456`) |
| `--token` | `DOT_AI_AUTH_TOKEN` | Authentication token |
| `--output` | `DOT_AI_OUTPUT_FORMAT` | Output format: `yaml` or `json` (default: `yaml`) |
| `--help` | - | Show command help |

## Usage Patterns

**Basic command execution:**
```bash
dot-ai <command> [arguments] [flags]
```

**With output format:**
```bash
dot-ai <command> --output json
```

**Remote server:**
```bash
dot-ai <command> --server-url https://remote:3456 --token mytoken
```

**Piping output:**
```bash
dot-ai <command> --output json | jq '.result'
```

## Next Steps

- **[Skills Generation](skills-generation.md)** — Enable AI agents to use the CLI
- **[Output Formats](output-formats.md)** — YAML vs JSON
- **[Automation](automation.md)** — Use in scripts and CI/CD
- **[Server Features](/docs/ai-engine/)** — What each command does

# Output Formats

Control how the CLI formats command output.

## Available Formats

### YAML (Default)

Human-readable structured output.

**When to use:**
- Interactive terminal use
- Reading output directly
- Debugging and development

**Example:**
```bash
dot-ai version
```

**Output:**
```yaml
version: 1.2.1
server:
  version: 1.2.1
  healthy: true
```

### JSON

Raw API response, machine-parseable.

**When to use:**
- Scripting and automation
- Piping to other tools (jq, etc.)
- AI agents processing output
- CI/CD pipelines

**Example:**
```bash
dot-ai version --output json
```

**Output:**
```json
{
  "version": "1.2.1",
  "server": {
    "version": "1.2.1",
    "healthy": true
  }
}
```

## Setting Output Format

**Command-line flag:**
```bash
dot-ai <command> --output json
dot-ai <command> --output yaml
```

**Environment variable:**
```bash
export DOT_AI_OUTPUT_FORMAT="json"
dot-ai <command>
```

**Default:** `yaml`

## Processing Output

**Extract fields with jq:**
```bash
dot-ai version --output json | jq '.server.version'
```

**Filter arrays:**
```bash
dot-ai resources --kind Deployment --output json | jq '.items[] | .metadata.name'
```

**Combine with other tools:**
```bash
dot-ai resources --kind Pod --output json | jq -r '.items[].metadata.name' | xargs -I {} echo "Pod: {}"
```

## For AI Agents

AI agents should use JSON output for structured parsing:

```bash
dot-ai <command> --output json
```

This ensures consistent, parseable responses without YAML formatting ambiguities.

## Next Steps

- **[Automation](automation.md)** — Use output in scripts and CI/CD
- **[Commands Overview](cli-commands-overview.md)** — See all available commands
- **[Configuration](../setup/configuration.md)** — Set default output format

# Skills Generation

Enable AI agents to use the DevOps AI Toolkit CLI and access server prompts as native skills.

## What Are Skills?

Skills are agent capabilities that make AI coding assistants (Claude Code, Cursor, Windsurf) aware of available tools and workflows. The CLI can generate skills from server capabilities.

## What Gets Generated

Skills generation serves two purposes:

### 1. CLI Awareness (Routing Skill)

Creates a `dot-ai` routing skill that makes agents aware of the CLI:
- Triggers on Kubernetes and DevOps operations
- Directs agents to use CLI instead of MCP
- Teaches agents to use `dot-ai --help` for command discovery
- Lower token overhead than MCP protocol

### 2. Server Prompts (Prompt Skills)

Exposes server prompts as native agent skills:
- Each server prompt becomes an agent skill (e.g., `dot-ai-projectSetup`, `dot-ai-query`)
- Users can invoke them as native skills in their agent
- Prefixed with `dot-ai-` to avoid naming conflicts

## Supported Agents

- **Claude Code** — `.claude/skills/`
- **Cursor** — `.cursor/skills/`
- **Windsurf** — `.windsurf/skills/`

Note: Cursor also auto-discovers skills from `.claude/skills/`, so Claude Code skills work in Cursor without duplication.

## Generate Skills

**For Claude Code:**
```bash
dot-ai skills generate --agent claude-code
```

**For Cursor:**
```bash
dot-ai skills generate --agent cursor
```

**For Windsurf:**
```bash
dot-ai skills generate --agent windsurf
```

**Custom path (unsupported agents):**
```bash
dot-ai skills generate --path ./custom/skills/
```

## Updating Skills

Re-running the command updates all `dot-ai-*` skills:

```bash
dot-ai skills generate --agent claude-code
```

Existing `dot-ai-*` skills are deleted and regenerated with the latest server capabilities.

## How It Works

1. CLI fetches prompts and tool metadata from the server
2. Generates a routing skill for CLI awareness
3. Creates individual skills for each server prompt
4. All skills use `dot-ai-` prefix for namespacing

## Agent Behavior

Once skills are generated:

**Routing:**
- Agents become aware of CLI for Kubernetes operations
- Agents prefer CLI over MCP when both are available
- Agents use `dot-ai --help` to discover commands

**Prompts:**
- Server prompts appear as native agent skills
- Users can invoke them directly in their coding assistant
- Skills stay in sync with server capabilities

## Next Steps

- **[Automation](automation.md)** — Use CLI in scripts and CI/CD
- **[Output Formats](output-formats.md)** — Control output format
- **[Configuration](../setup/configuration.md)** — Configure server URL

# Quick Start

Get your AI agent using the DevOps AI Toolkit CLI.

## Prerequisites

- Running [DevOps AI Toolkit server](/docs/ai-engine/setup/deployment)
- AI coding assistant: Claude Code, Cursor, or Windsurf

## Install the CLI

**macOS/Linux:**
```bash
brew install vfarcic/tap/dot-ai
```

For other platforms, see [Installation Guide](setup/installation.md).

## Configure Server Connection

Point the CLI to your server:

```bash
export DOT_AI_URL="http://dot-ai.127.0.0.1.nip.io"  # your server URL
export DOT_AI_AUTH_TOKEN="your-token"               # if authentication is enabled
```

See [Configuration Guide](setup/configuration.md) for details.

## Generate Agent Skills

Enable your AI agent to discover and use the CLI:

```bash
# For Claude Code
dot-ai skills generate --agent claude-code

# For Cursor
dot-ai skills generate --agent cursor

# For Windsurf
dot-ai skills generate --agent windsurf
```

See [Skills Generation](guides/skills-generation.md) for what this does.

## Verify It Works

Ask your agent to use the dot-ai CLI:

```
"Use the dot-ai CLI to check the server version"
```

Your agent should execute `dot-ai version` and show you the results. If this works, your agent is successfully using the CLI!

## What's Next

Your agent can now use all DevOps AI Toolkit capabilities via CLI. For details on what you can do, see the [server documentation](/docs/ai-engine/).

**CLI-specific topics:**
- **[Commands Overview](guides/cli-commands-overview.md)** — How to discover and use commands
- **[Output Formats](guides/output-formats.md)** — Control CLI output format
- **[Automation](guides/automation.md)** — Use CLI in scripts and CI/CD

# Configuration

Configure the CLI to connect to your DevOps AI Toolkit server.

## Server URL

Specify the server address:

**Environment variable:**
```bash
export DOT_AI_URL="https://your-server-url"
```

**Command-line flag:**
```bash
dot-ai query "test" --server-url https://your-server-url
```

**Default:** `http://localhost:3456`

## Authentication

Set the authentication token:

**Environment variable:**
```bash
export DOT_AI_AUTH_TOKEN="your-token-here"
```

**Command-line flag:**
```bash
dot-ai query "test" --token your-token-here
```

**Default:** No authentication (for local development)

## Output Format

Choose the output format:

**Environment variable:**
```bash
export DOT_AI_OUTPUT_FORMAT="json"  # or "yaml"
```

**Command-line flag:**
```bash
dot-ai query "test" --output json
```

**Default:** `yaml`

**Options:**
- `yaml` — Human-readable, structured output (default)
- `json` — Machine-parseable, raw API response

## Configuration Precedence

Settings are applied in this order (highest to lowest priority):

1. **Command-line flags** (`--server-url`, `--token`, `--output`)
2. **Environment variables** (`DOT_AI_URL`, `DOT_AI_AUTH_TOKEN`, `DOT_AI_OUTPUT_FORMAT`)
3. **Defaults** (`http://localhost:3456`, no token, `yaml`)

## Example Configuration

**For local development:**
```bash
# No configuration needed - defaults work
dot-ai version
```

**For remote server:**
```bash
# Set once in your shell profile
export DOT_AI_URL="https://dot-ai.example.com"
export DOT_AI_AUTH_TOKEN="your-token"

# Then use normally
dot-ai query "what pods are running?"
```

**For multiple environments:**
```bash
# Development
DOT_AI_URL="https://dev.example.com" dot-ai query "test"

# Production
DOT_AI_URL="https://prod.example.com" DOT_AI_AUTH_TOKEN="prod-token" dot-ai query "test"
```

## Next Steps

- **[Shell Completion](shell-completion.md)** — Enable command autocompletion
- **[Commands Overview](../guides/cli-commands-overview.md)** — See all available commands
- **[Automation](../guides/automation.md)** — Use in scripts and CI/CD

# Installation

Install the DevOps AI Toolkit CLI on your preferred platform.

## Prerequisites

- Access to a running [DevOps AI Toolkit server](/docs/ai-engine/setup/deployment)

## Homebrew (macOS/Linux)

```bash
brew install vfarcic/tap/dot-ai
```

## Scoop (Windows)

```bash
# Add the bucket
scoop bucket add dot-ai https://github.com/vfarcic/scoop-dot-ai

# Install
scoop install dot-ai
```

## Binary Download

Download the latest release for your platform:

**macOS (Apple Silicon):**
```bash
curl -sL https://github.com/vfarcic/dot-ai-cli/releases/latest/download/dot-ai-darwin-arm64 \
  -o /usr/local/bin/dot-ai && chmod +x /usr/local/bin/dot-ai
```

**macOS (Intel):**
```bash
curl -sL https://github.com/vfarcic/dot-ai-cli/releases/latest/download/dot-ai-darwin-amd64 \
  -o /usr/local/bin/dot-ai && chmod +x /usr/local/bin/dot-ai
```

**Linux (x86_64):**
```bash
curl -sL https://github.com/vfarcic/dot-ai-cli/releases/latest/download/dot-ai-linux-amd64 \
  -o /usr/local/bin/dot-ai && chmod +x /usr/local/bin/dot-ai
```

**Linux (ARM64):**
```bash
curl -sL https://github.com/vfarcic/dot-ai-cli/releases/latest/download/dot-ai-linux-arm64 \
  -o /usr/local/bin/dot-ai && chmod +x /usr/local/bin/dot-ai
```

**Windows:**

Download from [GitHub Releases](https://github.com/vfarcic/dot-ai-cli/releases/latest) and add to PATH.

## Configuration

Configure the server URL and authentication:

```bash
export DOT_AI_URL="https://your-server-url"
export DOT_AI_AUTH_TOKEN="your-token"
```

See [Configuration](configuration.md) for more options.

## Verification

Verify the CLI can connect to your server:

```bash
dot-ai version
```

You should see version and diagnostic information from the server.

## Next Steps

- **[Configuration](configuration.md)** — Detailed configuration options
- **[Shell Completion](shell-completion.md)** — Enable command autocompletion
- **[Commands Overview](../guides/cli-commands-overview.md)** — See all available commands

# Shell Completion

Enable command and flag autocompletion for your shell.

## Bash

**Linux:**
```bash
dot-ai completion bash | sudo tee /etc/bash_completion.d/dot-ai > /dev/null
```

**macOS:**
```bash
dot-ai completion bash > $(brew --prefix)/etc/bash_completion.d/dot-ai
```

Then restart your shell or source the completion file:
```bash
source $(brew --prefix)/etc/bash_completion.d/dot-ai
```

## Zsh

```bash
dot-ai completion zsh > "${fpath[1]}/_dot-ai"
```

Then restart your shell or run:
```bash
compinit
```

## Fish

```bash
dot-ai completion fish > ~/.config/fish/completions/dot-ai.fish
```

Then restart your shell or run:
```bash
source ~/.config/fish/completions/dot-ai.fish
```

## What Gets Completed

Shell completion provides:

- **Commands** — All available CLI commands
- **Flags** — Global and command-specific flags
- **Enum values** — Valid values for flags like `--output` (`yaml`, `json`)
- **Help** — Press tab to see available options

## Next Steps

- **[Commands Overview](../guides/cli-commands-overview.md)** — Learn all available commands
- **[Configuration](configuration.md)** — Configure server URL and authentication

## Controller

# DevOps AI Toolkit Controller

A Kubernetes controller that provides resource tracking, event-driven remediation, and resource visibility capabilities for your cluster.

## Mission

The DevOps AI Toolkit Controller bridges the gap between Kubernetes resources and intelligent operations. It enables:

- **Resource awareness** through logical grouping and health aggregation
- **Proactive remediation** through AI-powered event analysis
- **Resource discoverability** through semantic search integration

## Who Should Use This

- **Platform Engineers** building self-service Kubernetes platforms
- **DevOps Teams** looking to automate incident response
- **SREs** who want intelligent monitoring and remediation
- **Developers** deploying applications and needing resource visibility

## Scope

### In Scope

- Kubernetes resource tracking and lifecycle management
- Event-driven remediation with AI analysis
- Resource synchronization for semantic search
- Integration with DevOps AI Toolkit MCP

### Out of Scope

- Direct AI/LLM processing (delegated to MCP)
- Application-level monitoring
- Multi-cluster management
- GitOps workflows

## Features

The DevOps AI Toolkit Controller provides four main capabilities:

### 1. Solution CRD - Resource Tracking

Track and manage deployed Kubernetes resources as logical solutions:

- **Resource Grouping**: Links all resources (Deployments, Services, etc.) that compose a logical solution
- **Intent Preservation**: Stores the original user intent and deployment context
- **Metadata Storage**: Captures deployment rationale, patterns, policies, and documentation links
- **Health Monitoring**: Aggregates health status across all tracked resources
- **Automatic Cleanup**: Deleting a Solution CR automatically deletes all child resources via ownerReferences

**Works standalone** - No external dependencies required.

### 2. RemediationPolicy CRD - Event-Driven Remediation

Monitor Kubernetes events and automatically remediate issues using the [DevOps AI Toolkit](/docs/mcp):

- **Event Watching**: Configurable filtering of Kubernetes events by type, reason, and involved objects
- **Automatic Mode**: System detects, analyzes, and fixes issues without human intervention
- **Manual Mode**: System provides remediation recommendations via Slack for human execution
- **Slack Notifications**: Rich notifications with remediation results and next steps
- **Rate Limiting**: Prevents event storms with configurable cooldowns
- **Status Reporting**: Comprehensive observability through status updates

**Requires** [DevOps AI Toolkit MCP](/docs/mcp) for AI-powered analysis.

### 3. ResourceSyncConfig CRD - Resource Visibility

Enable semantic search and resource discovery across your cluster:

- **Resource Discovery**: Automatically discovers all resource types in your cluster
- **Change Tracking**: Watches for resource changes (create, update, delete)
- **Semantic Search**: Syncs resource metadata to MCP for natural language queries
- **Debounced Sync**: Batches changes to reduce API calls
- **Periodic Resync**: Full state sync catches any missed events

**Requires** [DevOps AI Toolkit MCP](/docs/mcp) for semantic search capabilities.

### 4. CapabilityScanConfig CRD - Autonomous Capability Discovery

Keep your cluster's capability data up-to-date for AI-powered recommendations:

- **Autonomous Discovery**: Automatically detects CRD changes (create, update, delete)
- **Event-Driven Scanning**: Triggers capability scans when new CRDs are installed
- **Startup Reconciliation**: Syncs cluster state with MCP on controller restart
- **Resource Filtering**: Include/exclude patterns for targeted scanning
- **Debounced Batching**: Groups rapid CRD changes into efficient batch requests

**Requires** [DevOps AI Toolkit MCP](/docs/mcp) for capability storage and analysis.

### 5. GitKnowledgeSource CRD - Documentation Ingestion

Automatically sync documentation from Git repositories into the knowledge base:

- **Document Ingestion**: Syncs markdown and other files to MCP knowledge base
- **Change Detection**: Only processes files changed since the last sync
- **Scheduled Sync**: Periodically re-syncs to capture updates
- **Pattern Matching**: Flexible glob patterns for file selection
- **Automatic Cleanup**: Removes documents from knowledge base when CR is deleted

**Requires** [DevOps AI Toolkit MCP](/docs/mcp) for knowledge base storage.

## Quick Start

> **Recommended**: For the easiest setup, we recommend installing the complete dot-ai stack which includes all components. See the [Stack Installation Guide](/docs/stack).
>
> The stack automatically applies CapabilityScanConfig and ResourceSyncConfig CRs. For RemediationPolicy and Solution CRs, see the [Remediation Guide](remediation-guide.md) and [Solution Guide](solution-guide.md) to configure them based on your needs.
>
> Continue below if you want to install this component individually (for non-Kubernetes setups or granular control).

### 1. Install Controller

```bash
# Set the version from https://github.com/vfarcic/dot-ai-controller/pkgs/container/dot-ai-controller%2Fcharts%2Fdot-ai-controller
export DOT_AI_CONTROLLER_VERSION="..."

helm install dot-ai-controller oci://ghcr.io/vfarcic/dot-ai-controller/charts/dot-ai-controller \
  --version $DOT_AI_CONTROLLER_VERSION \
  --namespace dot-ai \
  --create-namespace \
  --wait
```

This installs all five CRDs (Solution, RemediationPolicy, ResourceSyncConfig, CapabilityScanConfig, and GitKnowledgeSource) and the controller.

### 2. Choose Your Feature

**For Event Remediation:**

First, install the [DevOps AI Toolkit MCP](/docs/mcp), then:

```bash
# Create a RemediationPolicy to handle events
kubectl apply --filename - <<'EOF'
apiVersion: dot-ai.devopstoolkit.live/v1alpha1
kind: RemediationPolicy
metadata:
  name: auto-remediate
  namespace: dot-ai
spec:
  eventSelectors:
    - type: Warning
      reason: FailedScheduling
      mode: automatic
  mcpEndpoint: http://dot-ai-mcp.dot-ai.svc.cluster.local:3456/api/v1/tools/remediate
  mode: manual
EOF
```

See the [Remediation Guide](remediation-guide.md) for complete examples, configuration options, and best practices.

**For Resource Visibility:**

First, install the [DevOps AI Toolkit MCP](/docs/mcp), then:

```bash
# Create a secret with your MCP auth token (if not already created)
kubectl create secret generic dot-ai-secrets \
  --namespace dot-ai \
  --from-literal=auth-token=your-auth-token-here

# Create a ResourceSyncConfig to enable semantic search
kubectl apply --filename - <<'EOF'
apiVersion: dot-ai.devopstoolkit.live/v1alpha1
kind: ResourceSyncConfig
metadata:
  name: default-sync
  namespace: dot-ai
spec:
  mcpEndpoint: http://dot-ai-mcp.dot-ai.svc.cluster.local:3456/api/v1/resources/sync
  mcpAuthSecretRef:
    name: dot-ai-secrets
    key: auth-token
  debounceWindowSeconds: 10
  resyncIntervalMinutes: 60
EOF
```

See the [Resource Sync Guide](resource-sync-guide.md) for complete examples and semantic search usage.

**For Capability Discovery:**

First, install the [DevOps AI Toolkit MCP](/docs/mcp), then:

```bash
# Create a CapabilityScanConfig to enable autonomous scanning
kubectl apply --filename - <<'EOF'
apiVersion: dot-ai.devopstoolkit.live/v1alpha1
kind: CapabilityScanConfig
metadata:
  name: default-scan
  namespace: dot-ai
spec:
  mcp:
    endpoint: http://dot-ai-mcp.dot-ai.svc.cluster.local:3456/api/v1/tools/manageOrgData
    authSecretRef:
      name: dot-ai-secrets
      key: auth-token
EOF
```

See the [Capability Scan Guide](capability-scan-guide.md) for complete examples and configuration options.

**For Resource Tracking:**
```bash
# Create a Solution CR to track your deployed resources
kubectl apply --filename - <<'EOF'
apiVersion: dot-ai.devopstoolkit.live/v1alpha1
kind: Solution
metadata:
  name: my-app
  namespace: default
spec:
  intent: "Production web application with database"
  resources:
    - apiVersion: apps/v1
      kind: Deployment
      name: web-app
    - apiVersion: v1
      kind: Service
      name: web-app-service
    - apiVersion: apps/v1
      kind: StatefulSet
      name: postgresql
EOF
```

See the [Solution Guide](solution-guide.md) for complete examples and usage patterns.

## Documentation

- **[Setup Guide](setup-guide.md)** - Installation and prerequisites
- **[Remediation Guide](remediation-guide.md)** - Event-driven remediation
- **[Resource Sync Guide](resource-sync-guide.md)** - Resource visibility and semantic search
- **[Capability Scan Guide](capability-scan-guide.md)** - Autonomous capability discovery
- **[Knowledge Source Guide](knowledge-source-guide.md)** - Documentation ingestion from Git
- **[Solution Guide](solution-guide.md)** - Resource tracking and lifecycle management
- **[Troubleshooting Guide](troubleshooting.md)** - Common issues and solutions

## Architecture

```
┌─────────────────────────────────────────────────────┐
│  Kubernetes Cluster                                 │
│                                                     │
│  ┌──────────────────────┐                          │
│  │  Solution CR         │  (Parent Resource)       │
│  │  ─────────────       │                          │
│  │  metadata:           │                          │
│  │    intent: "..."     │                          │
│  │    resources: [...]  │                          │
│  └──────────────────────┘                          │
│           ▲                                         │
│           │ ownerReferences                         │
│           │                                         │
│  ┌────────┴──────────┬──────────────┬─────────┐   │
│  │                   │              │         │   │
│  ▼                   ▼              ▼         ▼   │
│  Deployment      Service         PVC      ConfigMap│
│  (child)         (child)       (child)   (child)  │
│                                                     │
│  ┌─────────────────────────────────────────────┐   │
│  │  Controller                                 │   │
│  │  ───────────                                │   │
│  │  • Watches Solution CRs                     │   │
│  │  • Manages ownerReferences                  │   │
│  │  • Tracks resource health                   │   │
│  │  • Processes events (RemediationPolicy)     │   │
│  │  • Syncs resources to MCP (ResourceSync)    │   │
│  │  • Scans capabilities (CapabilityScan)      │   │
│  └─────────────────────────────────────────────┘   │
│                                                     │
└─────────────────────────────────────────────────────┘
```

# Capability Scan Guide

This guide covers the CapabilityScanConfig CRD for autonomous capability discovery and scanning in your Kubernetes cluster.

## Overview

The CapabilityScanConfig enables:
- **Autonomous Discovery**: Automatically detects CRD changes (create, update, delete)
- **Event-Driven Scanning**: Triggers capability scans when new CRDs are installed
- **Startup Reconciliation**: Syncs cluster state with MCP on controller restart
- **Debounced Batching**: Groups rapid CRD changes into efficient batch requests

This feature works with the [DevOps AI Toolkit MCP](/docs/mcp) to keep your cluster's capability data up-to-date for AI-powered recommendations.

## Stack Installation

If you installed via the [DevOps AI Toolkit Stack](/docs/stack), CapabilityScanConfig is already configured. You can verify with:

```bash
kubectl get capabilityscanconfig -n dot-ai
```

Continue below only if you need to customize the configuration or installed the controller individually.

## Prerequisites

- Controller installed (see [Setup Guide](setup-guide.md))
- [DevOps AI Toolkit MCP](/docs/mcp) installed and running

## Quick Start

1. Create a secret with your MCP API key (if authentication is required):

```bash
kubectl create secret generic dot-ai-secrets \
  --namespace dot-ai \
  --from-literal=auth-token=your-auth-token-here
```

2. Create a CapabilityScanConfig to start scanning:

```yaml
apiVersion: dot-ai.devopstoolkit.live/v1alpha1
kind: CapabilityScanConfig
metadata:
  name: default-scan
  namespace: dot-ai
spec:
  mcp:
    endpoint: http://dot-ai.dot-ai.svc.cluster.local:3456/api/v1/tools/manageOrgData
    authSecretRef:
      name: dot-ai-secrets
      key: auth-token
```

3. Apply it:

```bash
kubectl apply -f capabilityscanconfig.yaml
```

The controller will perform an initial scan of all cluster resources and then watch for CRD changes.

## How It Works

### Startup Reconciliation

When the controller starts (or restarts), it performs a diff-and-sync:

1. **List Cluster Resources**: Uses Discovery API to get all resources (core + CRDs) matching include/exclude filters
2. **List MCP Capabilities**: Queries MCP for existing capability IDs
3. **Compute Diff**:
   - Resources in cluster but not in MCP → trigger targeted scan
   - Capabilities in MCP but not in cluster → delete orphaned

This ensures the controller recovers gracefully from restarts without missing any changes.

### Event-Driven Scanning

After startup, the controller watches for CRD events:

1. **CRD Created/Updated**: Queue for capability scan
2. **CRD Deleted**: Queue for capability deletion
3. **Debounce**: Wait for `debounceWindowSeconds` to collect more events
4. **Batch Request**: Send all queued scans in a single request

### Debouncing

When operators are installed, many CRDs may be created in rapid succession. Debouncing prevents overwhelming MCP with individual requests:

```text
Time 0s:   CRD-A created → add to buffer
Time 1s:   CRD-B created → add to buffer
Time 2s:   CRD-C created → add to buffer
...
Time 10s:  Flush buffer → single request: "CRD-A,CRD-B,CRD-C"
```

Configure the window based on your needs:
- **Lower values (1-5s)**: Faster scanning, more HTTP requests
- **Higher values (30-60s)**: Fewer requests, delayed scanning

### Fire-and-Forget Model

The controller uses a fire-and-forget pattern:
- Scans are triggered asynchronously (controller doesn't wait for completion)
- MCP performs the actual capability analysis in the background
- Failed scans are automatically retried on next controller restart

## Configuration

### Spec Fields

| Field | Type | Required | Default | Description |
|-------|------|----------|---------|-------------|
| `mcp.endpoint` | string | Yes | - | Full URL of the MCP manageOrgData endpoint |
| `mcp.collection` | string | No | capabilities | Qdrant collection name for storing capabilities |
| `mcp.authSecretRef` | SecretReference | Yes | - | Secret containing API key for MCP authentication |
| `includeResources` | []string | No | all | Patterns for resources to include in scanning |
| `excludeResources` | []string | No | - | Patterns for resources to exclude from scanning |
| `retry.maxAttempts` | int | No | 3 | Maximum retry attempts for MCP API calls |
| `retry.backoffSeconds` | int | No | 5 | Initial backoff duration in seconds |
| `retry.maxBackoffSeconds` | int | No | 300 | Maximum backoff duration in seconds |
| `debounceWindowSeconds` | int | No | 10 | Time window to batch CRD events before syncing |

### Resource Filtering

Use `includeResources` and `excludeResources` to control which resources are scanned. Filters apply to:
- **Initial scan**: All resources discovered via Discovery API (core + CRDs)
- **Event-driven scanning**: CRD create/update/delete events

**Pattern Format**:
- `Kind.group` for grouped resources (e.g., `Deployment.apps`, `RDSInstance.database.aws.crossplane.io`)
- `Kind` for core resources (e.g., `Service`, `ConfigMap`)
- Wildcards supported: `*.crossplane.io`, `*.apps`, `*`

**Example: Whitelist - Scan Only Crossplane Resources**:

```yaml
spec:
  includeResources:
    - "*.crossplane.io"
```

**Example: Blocklist - Scan Everything Except High-Volume Resources**:

```yaml
spec:
  excludeResources:
    - "Event"
    - "Lease.coordination.k8s.io"
    - "EndpointSlice.discovery.k8s.io"
```

**Example: Combined - Crossplane Resources Except Provider Configs**:

```yaml
spec:
  includeResources:
    - "*.crossplane.io"
  excludeResources:
    - "ProviderConfig.*"
```

**Processing Order**:
1. If `includeResources` is specified, only those patterns are scanned
2. `excludeResources` is applied as a blocklist after includes
3. If neither is specified, all resources are scanned

## Status

Check the status to verify scanning is working:

```bash
kubectl get capabilityscanconfig default-scan -o yaml
```

### Status Fields

| Field | Description |
|-------|-------------|
| `initialScanComplete` | Whether startup reconciliation has completed |
| `lastScanTime` | Timestamp of last successful scan trigger |
| `lastError` | Last error message if any |
| `conditions` | Standard Kubernetes conditions |

### Conditions

| Type | Description |
|------|-------------|
| `Ready` | True when controller is watching CRDs and connected to MCP |

## Example: Full Configuration

```yaml
apiVersion: dot-ai.devopstoolkit.live/v1alpha1
kind: CapabilityScanConfig
metadata:
  name: production-scan
  namespace: dot-ai
spec:
  # MCP configuration
  mcp:
    endpoint: http://dot-ai.dot-ai.svc.cluster.local:3456/api/v1/tools/manageOrgData
    collection: capabilities
    authSecretRef:
      name: dot-ai-secrets
      key: auth-token

  # Only scan Crossplane and ArgoCD resources
  includeResources:
    - "*.crossplane.io"
    - "*.aws.crossplane.io"
    - "*.gcp.crossplane.io"
    - "*.azure.crossplane.io"
    - "applications.argoproj.io"
    - "applicationsets.argoproj.io"

  # Exclude internal resources
  excludeResources:
    - "*.internal.company.com"

  # Retry configuration for MCP API calls
  retry:
    maxAttempts: 5
    backoffSeconds: 10
    maxBackoffSeconds: 300

  # Batch CRD events for 15 seconds before sending
  debounceWindowSeconds: 15
```

## Use Cases

### Crossplane Provider Installation

When you install a Crossplane provider:

```bash
kubectl apply -f provider-aws.yaml
```

The controller:
1. Detects new CRDs (`RDSInstance.database.aws.crossplane.io`, `Bucket.s3.aws.crossplane.io`, etc.)
2. Waits for debounce window (batches all CRDs)
3. Sends single scan request to MCP
4. MCP analyzes and stores capabilities

MCP can now provide AI recommendations that include the newly available AWS resources.

### Operator Removal

When you remove an operator:

```bash
kubectl delete -f provider-aws.yaml
```

The controller:
1. Detects CRD deletions
2. Sends delete requests to MCP for each capability
3. MCP removes stale capability data

MCP recommendations no longer suggest the removed resources.

### Controller Restart Recovery

If the controller pod restarts:

1. Controller performs startup reconciliation
2. Compares cluster CRDs with MCP capabilities
3. Syncs any differences (missed events during downtime)
4. Resumes event watching

No manual intervention required.

## Troubleshooting

### Controller Not Starting

Check the Ready condition:

```bash
kubectl get capabilityscanconfig default-scan -o jsonpath='{.status.conditions}'
```

Common issues:
- Invalid `mcp.endpoint` URL
- MCP service not reachable
- Missing RBAC permissions

### Scans Not Triggering

1. Check if CRD matches include/exclude filters:

```bash
# View configured filters
kubectl get capabilityscanconfig default-scan -o jsonpath='{.spec.includeResources}'
kubectl get capabilityscanconfig default-scan -o jsonpath='{.spec.excludeResources}'
```

2. Check controller logs:

```bash
kubectl logs -l app.kubernetes.io/name=dot-ai-controller -n dot-ai --tail=50
```

Look for messages about CRD events and filtering decisions.

### MCP Connection Errors

Check `lastError` in status:

```bash
kubectl get capabilityscanconfig default-scan -o jsonpath='{.status.lastError}'
```

Common issues:
- MCP endpoint unreachable (check service/DNS)
- Authentication failure (check secret exists and has correct key)
- MCP server overloaded (check MCP logs)

### Initial Scan Not Completing

Check if initial scan is marked complete:

```bash
kubectl get capabilityscanconfig default-scan -o jsonpath='{.status.initialScanComplete}'
```

If false, check controller logs for errors during startup reconciliation.

### Debounce Window Too Long/Short

Adjust `debounceWindowSeconds` based on your operator installation patterns:

```yaml
spec:
  # For frequent small changes
  debounceWindowSeconds: 5

  # For large operator installations
  debounceWindowSeconds: 30
```

## Cleanup

Delete the CapabilityScanConfig to stop scanning:

```bash
kubectl delete capabilityscanconfig default-scan
```

This stops the CRD watcher but does not delete capability data from MCP. To remove capability data, use the MCP `manageOrgData` tool with `operation: deleteAll`. See the [Capability Management Guide](/docs/ai-engine/tools/capability-management) for details.

## Next Steps

- Learn about [Resource Sync](resource-sync-guide.md) for semantic search of cluster resources
- Explore [Remediation Policies](remediation-guide.md) for event-driven remediation
- Check [Troubleshooting Guide](troubleshooting.md) for common issues

# Knowledge Source Guide

This guide covers the GitKnowledgeSource CRD for automatically syncing documentation from Git repositories into the DevOps AI Toolkit knowledge base.

## Overview

The GitKnowledgeSource enables:
- **Document Ingestion**: Automatically syncs markdown and other files to the knowledge base
- **Change Detection**: Only processes files changed since the last sync
- **Scheduled Sync**: Periodically re-syncs to capture updates
- **Automatic Cleanup**: Removes documents from knowledge base when the resource is deleted

Once documents are synced, they become searchable through the DevOps AI Toolkit's semantic search capabilities.

## Stack Installation

If you installed via the [DevOps AI Toolkit Stack](/docs/stack), you can create GitKnowledgeSource resources immediately. Verify the CRD is available:

```bash
kubectl get crds gitknowledgesources.dot-ai.devopstoolkit.live
```

Continue below to configure a GitKnowledgeSource for your documentation.

## Prerequisites

- Controller installed (see [Setup Guide](setup-guide.md))
- [DevOps AI Toolkit MCP](/docs/mcp) installed and running

## Quick Start

1. Ensure the MCP authentication secret exists:

```bash
kubectl get secret dot-ai-secrets -n dot-ai
```

If not, create it:

```bash
kubectl create secret generic dot-ai-secrets \
  --namespace dot-ai \
  --from-literal=auth-token=your-auth-token-here
```

2. Create a GitKnowledgeSource to sync documentation from a Git repository:

```yaml
apiVersion: dot-ai.devopstoolkit.live/v1alpha1
kind: GitKnowledgeSource
metadata:
  name: my-docs
  namespace: dot-ai
spec:
  repository:
    url: https://github.com/your-org/your-repo.git
    branch: main
  paths:
    - "docs/**/*.md"
    - "README.md"
  mcpServer:
    url: http://dot-ai.dot-ai.svc:3456
    authSecretRef:
      name: dot-ai-secrets
      key: auth-token
```

3. Apply it:

```bash
kubectl apply -f gitknowledgesource.yaml
```

4. Check the sync status:

```bash
kubectl get gitknowledgesource my-docs -n dot-ai
```

Expected output:
```text
NAME      ACTIVE   DOCUMENTS   LAST SYNC              NEXT SYNC
my-docs   true     9           2026-02-05T16:40:14Z   2026-02-06T16:40:14Z
```

## How It Works

### Sync Process

1. **Clone**: Controller performs a shallow clone of the repository
2. **Pattern Match**: Finds files matching `paths` patterns, excluding `exclude` patterns
3. **Change Detection**: Compares current commit with `lastSyncedCommit` to find changed files
4. **Ingest**: Sends changed documents to MCP knowledge base with `sourceIdentifier`
5. **Cleanup**: Deletes the local clone (no persistent storage required)
6. **Schedule**: Queues next sync based on `schedule` field

### First Sync vs Incremental Sync

- **First sync**: Processes all matching files (full sync)
- **Subsequent syncs**: Only processes files changed since `lastSyncedCommit`
- **Spec changes**: Modifying `paths` or other spec fields triggers a full sync

### What Gets Synced

Each document is ingested to MCP with:
- **Content**: The file contents
- **URI**: `https://github.com/{org}/{repo}/blob/{branch}/{path}`
- **Source Identifier**: `{namespace}/{name}` for bulk operations
- **Custom Metadata**: Values from `spec.metadata` field

### Cleanup on Deletion

When a GitKnowledgeSource is deleted:
1. Controller detects deletion via finalizer
2. Checks `deletionPolicy` (`Delete` or `Retain`)
3. If `Delete`: Calls MCP to remove all documents with matching `sourceIdentifier`
4. Removes finalizer, allowing CR deletion to complete

## Configuration

### Spec Fields

| Field | Type | Required | Default | Description |
|-------|------|----------|---------|-------------|
| `repository.url` | string | Yes | - | Git repository URL (HTTPS only) |
| `repository.branch` | string | No | `main` | Branch to sync |
| `repository.depth` | int | No | `1` | Shallow clone depth |
| `repository.secretRef` | SecretReference | No | - | Secret with token for private repos |
| `paths` | []string | Yes | - | Glob patterns for files to sync (e.g., `docs/**/*.md`) |
| `exclude` | []string | No | - | Glob patterns to exclude |
| `schedule` | string | No | `@every 24h` | Sync schedule (cron or interval) |
| `mcpServer.url` | string | Yes | - | MCP server endpoint URL |
| `mcpServer.authSecretRef` | SecretReference | Yes | - | Secret with MCP auth token |
| `mcpServer.httpTimeoutSeconds` | int | No | `120` | HTTP timeout in seconds for MCP API calls (min: 5, max: 600) |
| `metadata` | map[string]string | No | - | Custom metadata attached to all documents |
| `maxFileSizeBytes` | int | No | - | Skip files larger than this size |
| `deletionPolicy` | string | No | `Delete` | `Delete` or `Retain` documents on CR deletion |

### Repository Authentication

For private repositories, create a secret with a personal access token:

```bash
kubectl create secret generic github-token \
  --namespace dot-ai \
  --from-literal=token=ghp_xxxxxxxxxxxx
```

Reference it in the GitKnowledgeSource:

```yaml
spec:
  repository:
    url: https://github.com/your-org/private-repo.git
    secretRef:
      name: github-token
      key: token
```

### Path Patterns

The `paths` field uses glob patterns to match files:

| Pattern | Matches |
|---------|---------|
| `docs/**/*.md` | All markdown files under `docs/` recursively |
| `README.md` | Only the root README |
| `**/*.md` | All markdown files in the repository |
| `docs/*.md` | Markdown files directly in `docs/` (not subdirectories) |

Use `exclude` to skip specific paths:

```yaml
spec:
  paths:
    - "docs/**/*.md"
  exclude:
    - "docs/internal/**"
    - "docs/drafts/**"
```

### Schedule Configuration

The `schedule` field accepts cron expressions or interval syntax:

| Format | Example | Description |
|--------|---------|-------------|
| Interval | `@every 24h` | Sync every 24 hours (default) |
| Interval | `@every 6h` | Sync every 6 hours |
| Interval | `@every 30m` | Sync every 30 minutes |
| Cron | `0 3 * * *` | Daily at 3:00 AM |
| Cron | `0 */6 * * *` | Every 6 hours |

The default `@every 24h` means each GitKnowledgeSource syncs 24 hours after its last sync, naturally staggering syncs based on creation time.

**Invalid schedules**: If you specify an invalid schedule expression, the controller will sync once, then set a `ScheduleError` condition and stop scheduling. Fix the schedule to resume.

### File Size Limits

Use `maxFileSizeBytes` to skip large files:

```yaml
spec:
  maxFileSizeBytes: 1048576  # 1MB limit
```

Skipped files appear in the status:

```bash
kubectl get gitknowledgesource my-docs -n dot-ai -o jsonpath='{.status.skippedFiles}' | jq
```

### Deletion Policy

The `deletionPolicy` controls what happens when the GitKnowledgeSource is deleted:

| Value | Behavior |
|-------|----------|
| `Delete` (default) | Remove all synced documents from MCP knowledge base |
| `Retain` | Keep documents in MCP (useful for migrations) |

```yaml
spec:
  deletionPolicy: Retain  # Keep docs when CR is deleted
```

## Status

Check the status to monitor sync progress:

```bash
kubectl get gitknowledgesource my-docs -n dot-ai -o yaml
```

### Status Fields

| Field | Description |
|-------|-------------|
| `active` | Whether the source is actively syncing |
| `documentCount` | Total documents synced to MCP |
| `lastSyncTime` | Timestamp of last successful sync |
| `lastSyncedCommit` | Git commit SHA of last sync |
| `nextScheduledSync` | When the next sync will occur |
| `skippedDocuments` | Count of files skipped (e.g., size limit) |
| `skippedFiles` | Details of skipped files with reasons |
| `syncErrors` | Count of sync errors |
| `lastError` | Most recent error message |
| `observedGeneration` | Last processed spec generation |
| `conditions` | Standard Kubernetes conditions |

### Conditions

| Type | Description |
|------|-------------|
| `Ready` | True when source is active and configured correctly |
| `Synced` | True when last sync completed successfully |
| `Scheduled` | True when next sync is scheduled |

### Example Status

```yaml
status:
  active: true
  documentCount: 9
  lastSyncTime: "2026-02-05T16:40:14Z"
  lastSyncedCommit: "c32655af7f70361835a533e57533caaf4e8b750a"
  nextScheduledSync: "2026-02-06T16:40:14Z"
  conditions:
  - type: Ready
    status: "True"
    reason: Active
    message: "GitKnowledgeSource is active and syncing"
  - type: Synced
    status: "True"
    reason: SyncComplete
    message: "Successfully synced 9 documents"
  - type: Scheduled
    status: "True"
    reason: Scheduled
    message: "Next sync scheduled for 2026-02-06T16:40:14Z"
```

## Troubleshooting

### Sync Not Starting

Check the Ready condition:

```bash
kubectl get gitknowledgesource my-docs -n dot-ai -o jsonpath='{.status.conditions}' | jq
```

Common issues:
- **CloneError**: Invalid repository URL or authentication failure
- **MCP unreachable**: Check MCP server URL and network connectivity
- **Missing secret**: Verify auth secret exists and has correct keys

### Clone Errors

If you see "read-only file system" errors:
- Ensure the controller deployment has a writable `/tmp` volume mount

If you see authentication errors for private repos:
- Verify the secret exists: `kubectl get secret <name> -n dot-ai`
- Check the token has read access to the repository
- Ensure `secretRef.key` matches the key in the secret

### Documents Not Appearing in Search

1. Check sync completed successfully:
```bash
kubectl get gitknowledgesource my-docs -n dot-ai -o jsonpath='{.status.documentCount}'
```

2. Verify MCP is running:
```bash
kubectl get pods -n dot-ai -l app=dot-ai
```

3. Check for sync errors:
```bash
kubectl get gitknowledgesource my-docs -n dot-ai -o jsonpath='{.status.lastError}'
```

### Schedule Not Working

Check the Scheduled condition:

```bash
kubectl get gitknowledgesource my-docs -n dot-ai -o jsonpath='{.status.conditions}' | jq '.[] | select(.type=="Scheduled")'
```

If `ScheduleError`, the schedule expression is invalid. Fix the `spec.schedule` field.

## Git Provider Compatibility

GitKnowledgeSource uses standard Git HTTPS protocol and should work with any Git provider:
- GitHub
- GitLab
- Bitbucket
- Gitea
- Self-hosted Git servers

Testing has been performed primarily with GitHub. If you encounter issues with other providers, please [report them on GitHub](https://github.com/vfarcic/dot-ai-controller/issues).

## Next Steps

- Learn about [Resource Sync](resource-sync-guide.md) for cluster resource visibility
- Explore [Remediation Policies](remediation-guide.md) for event-driven remediation
- Check [Troubleshooting Guide](troubleshooting.md) for common issues

# Remediation Guide

This guide covers the RemediationPolicy CRD for event-driven remediation using the DevOps AI Toolkit.

## Overview

The RemediationPolicy CRD monitors Kubernetes events and forwards them to the DevOps AI Toolkit MCP for analysis and remediation. It supports:

- **Event Watching**: Configurable filtering of Kubernetes events by type, reason, and involved objects
- **Automatic Mode**: System detects, analyzes, and fixes issues without human intervention
- **Manual Mode**: System detects and analyzes issues, provides recommendations via Slack for humans to execute
- **Slack Notifications**: Rich notifications with remediation results and next steps
- **Rate Limiting**: Prevents event storms with configurable cooldowns
- **Status Reporting**: Comprehensive observability through status updates

## Prerequisites

- Controller installed (see [Setup Guide](setup-guide.md))
- **DevOps AI Toolkit MCP installed** - See [DevOps AI Toolkit documentation](/docs/mcp)
- Slack webhook URL (optional, for Slack notifications)
- Google Chat webhook URL (optional, for Google Chat notifications - requires Google Workspace paid account)

## Create a RemediationPolicy

Create a RemediationPolicy to start processing events:

```bash
# Create a Secret containing your MCP auth token (required for Kubernetes deployments)
# Use the same token configured in the MCP server's DOT_AI_AUTH_TOKEN environment variable
kubectl create secret generic dot-ai-secrets \
  --from-literal=auth-token="your-mcp-auth-token" \
  --namespace dot-ai

# Create a Secret containing your Slack webhook URL (optional, for notifications)
kubectl create secret generic slack-webhook \
  --from-literal=url="https://hooks.slack.com/services/YOUR/WEBHOOK/URL" \
  --namespace dot-ai

# Apply the policy
kubectl apply --filename - <<EOF
apiVersion: dot-ai.devopstoolkit.live/v1alpha1
kind: RemediationPolicy
metadata:
  name: sample-policy
  namespace: dot-ai
spec:
  # Multiple event selectors with different configurations
  # NOTE: Controller checks selectors in order and processes event on FIRST match
  eventSelectors:
    # Monitor pod scheduling failures - AUTOMATIC remediation
    - type: Warning
      reason: FailedScheduling
      involvedObjectKind: Pod
      mode: automatic                 # Safe to automatically fix scheduling issues
      confidenceThreshold: 0.85       # High confidence required for infrastructure changes
      maxRiskLevel: medium            # Allow medium-risk actions for infrastructure fixes

    # Monitor ALL remaining Warning events for ANY resource - MANUAL mode
    - type: Warning
      reason: ""                      # Empty = wildcard, matches ANY event reason
      involvedObjectKind: ""          # Empty = wildcard, matches ANY resource type

  # MCP endpoint using internal service URL
  mcpEndpoint: http://dot-ai-mcp.dot-ai.svc.cluster.local:3456/api/v1/tools/remediate
  mcpAuthSecretRef:                   # MCP authentication (required)
    name: dot-ai-secrets              # Secret name (must be in same namespace)
    key: auth-token                   # Key within the Secret containing the auth token
  mcpTool: remediate

  # Manual mode as global default (conservative approach)
  mode: manual

  # Global automatic mode configuration (used when mode=automatic)
  confidenceThreshold: 0.8            # Require 80% confidence for automatic actions
  maxRiskLevel: low                   # Only allow low-risk automatic actions by default

  # Conservative rate limiting for production
  rateLimiting:
    eventsPerMinute: 5                # Lower rate for production safety
    cooldownMinutes: 15               # Longer cooldown to prevent spam

  # Slack notifications configuration
  notifications:
    slack:
      enabled: true                    # Enable Slack notifications
      webhookUrlSecretRef:             # Reference to Secret containing webhook URL
        name: slack-webhook            # Secret name (must be in same namespace)
        key: url                       # Key within the Secret
      channel: "#alerts"               # Channel where notifications will be sent
      notifyOnStart: true              # Notify when remediation starts
      notifyOnComplete: true           # Notify when remediation completes
EOF

# Verify the policy was created
kubectl get remediationpolicies --namespace dot-ai
```

## Understanding Remediation Modes

### Automatic Mode

The system detects, analyzes, and fixes issues without human intervention.

**Use when:**
- Issues are low-risk and well-understood
- Fast remediation is critical
- You have high confidence thresholds configured

**Safety controls:**
- `confidenceThreshold`: Minimum confidence (0.0-1.0) required for execution
- `maxRiskLevel`: Maximum risk level allowed (`low`, `medium`, `high`)

### Manual Mode

The system detects and analyzes issues, then provides remediation recommendations via Slack for humans to execute.

**Use when:**
- Issues require human judgment
- Changes affect production systems
- You want approval before execution

**Workflow:**
1. Controller detects event
2. MCP analyzes issue and generates recommendations
3. Slack notification sent with specific kubectl commands
4. Human reviews and executes commands when ready

## Example 1: Manual Remediation

Let's test manual remediation by creating a memory-constrained application that will trigger OOMKilled warnings:

```bash
# Create a test namespace
kubectl create namespace memory-demo

# Create a memory-hungry application with insufficient memory limits
kubectl apply --filename - <<'EOF'
apiVersion: apps/v1
kind: Deployment
metadata:
  name: memory-hungry-app
  namespace: memory-demo
spec:
  replicas: 1
  selector:
    matchLabels:
      app: memory-hungry
  template:
    metadata:
      labels:
        app: memory-hungry
    spec:
      containers:
      - name: app
        image: ghcr.io/distroless/busybox:latest
        command: ["/bin/sh"]
        args: ["-c", "dd if=/dev/zero of=/tmp/memory.tmp bs=1M count=200; sleep 3600"]
        resources:
          limits:
            memory: "64Mi"
          requests:
            memory: "32Mi"
EOF

# Check the pod status (should show OOMKilled restarts)
kubectl get pods --selector app=memory-hungry --namespace memory-demo

# View the Warning events
kubectl get events --namespace memory-demo --field-selector type=Warning,reason=BackOff

# Check controller logs to see event processing
kubectl logs --selector app.kubernetes.io/name=dot-ai-controller --namespace dot-ai --tail 20
```

### What Happens in Manual Mode

1. **Event Detection**: The controller detects Warning events for container restarts due to OOMKilled
2. **Policy Matching**: Matches the second event selector (wildcard Warning + manual mode from global policy)
3. **MCP Analysis**: The system analyzes the issue and generates specific remediation commands
4. **Slack Notification**: Sends detailed notification with:
   - Problem analysis and root cause
   - **Recommended commands** to fix the issue (e.g., `kubectl patch deployment...` to increase memory limits)
   - Step-by-step remediation instructions
5. **Human Decision**: Users can:
   - Execute the recommended kubectl commands directly
   - Apply changes to YAML files in Git (GitOps workflow)
   - Modify Helm values and redeploy
   - Use any other approach they prefer

**Key Difference**: In manual mode, the controller gets remediation recommendations and forwards them to Slack, but **never executes** the commands. Humans review the recommendations and decide how/when to implement them.

**Slack Notification - Manual Remediation Started:**

### Manual Remediation Analysis Results

The analysis completed successfully, providing detailed recommendations without executing them:

**Slack Notification - Manual Analysis Completed:**

**Manual Mode Results:**
- **Analysis Confidence**: 95%
- **Execution Time**: 79.78 seconds
- **Root Cause**: Container attempts to allocate 200MB but has 64Mi limit, causing OOMKilled failures
- **Recommended Action**: Specific `kubectl patch` command to increase memory limits to 256Mi with requests of 128Mi
- **Human Decision Required**: The system provides the exact command but waits for human execution

Users can now execute the recommended command:
```bash
kubectl patch deployment/memory-hungry-app -n memory-demo -p '{"spec":{"template":{"spec":{"containers":[{"name":"app","resources":{"limits":{"memory":"256Mi"},"requests":{"memory":"128Mi"}}}]}}}}'
```

## Example 2: Automatic Remediation

Now let's test automatic remediation with a real scenario - a PostgreSQL deployment that fails due to a missing PersistentVolumeClaim:

```bash
# Create a test namespace
kubectl create namespace postgres-demo

# Deploy PostgreSQL without creating the required PVC (this will cause FailedScheduling)
kubectl apply --namespace postgres-demo --filename - <<'EOF'
apiVersion: apps/v1
kind: Deployment
metadata:
  name: postgres-db
spec:
  replicas: 1
  selector:
    matchLabels:
      app: postgres
  template:
    metadata:
      labels:
        app: postgres
    spec:
      containers:
      - name: postgres
        image: postgres:13
        env:
        - name: POSTGRES_PASSWORD
          value: testpass
        - name: POSTGRES_DB
          value: testdb
        volumeMounts:
        - name: postgres-storage
          mountPath: /var/lib/postgresql/data
      volumes:
      - name: postgres-storage
        persistentVolumeClaim:
          claimName: postgres-pvc
EOF

# Check the pod status (should be Pending)
kubectl get pods --namespace postgres-demo

# View the Warning events that will trigger remediation
kubectl get events --namespace postgres-demo --field-selector type=Warning

# Check controller logs to see event processing
kubectl logs --selector app.kubernetes.io/name=dot-ai-controller --namespace dot-ai --tail 20
```

### What Happens in Automatic Mode

1. **Event Detection**: The controller detects the `FailedScheduling` Warning event
2. **Policy Matching**: Matches the first event selector (FailedScheduling + Pod + automatic mode)
3. **Slack Notification**: Sends "Remediation Started" notification with:
   - Event details (Warning/FailedScheduling)
   - Resource info (Pod/postgres-db-...)
   - Namespace (postgres-demo)
   - Mode (automatic)
   - Policy (sample-policy)
4. **MCP Request**: Sends automatic remediation request with:
   - High confidence threshold (0.85)
   - Medium risk level allowed
   - Detailed event description

**Slack Notification - Remediation Started:**

### Automatic Remediation Success

The system automatically resolved the issue! Here's what happened:

```bash
# The missing PVC was automatically created
kubectl get pvc --namespace postgres-demo

# The pod is now running successfully
kubectl get pods --namespace postgres-demo

# Check the remediation success event
kubectl get events --namespace dot-ai --field-selector reason=McpRequestSucceeded
```

**Remediation Results:**
- **Execution Time**: 138.89 seconds
- **Analysis Confidence**: 95%
- **Actions Taken**: 1 remediation action
- **Validation**: ✅ Passed - Pod transitioned from Pending to Running
- **Root Cause**: The postgres-pvc PersistentVolumeClaim was missing from the postgres-demo namespace
- **Solution**: Automatically created the missing PVC with appropriate storage configuration

**What the System Did:**
1. **Analyzed** the FailedScheduling event and identified the missing PersistentVolumeClaim
2. **Created** the postgres-pvc PersistentVolumeClaim in the postgres-demo namespace with:
   - 1Gi capacity
   - ReadWriteOnce access mode
   - Standard storage class
3. **Validated** the fix by confirming the pod moved from Pending to Running status

This demonstrates the power of **automatic remediation** - the system detected, analyzed, fixed, and validated the issue without human intervention, all while maintaining safety through confidence thresholds and risk levels.

**Slack Notification - Remediation Completed:**

## RemediationPolicy Configuration

### Event Selectors

Event selectors filter which Kubernetes events trigger remediation:

```yaml
eventSelectors:
  # Specific event type
  - type: Warning                    # Event type: Warning, Normal
    reason: FailedScheduling         # Specific event reason
    involvedObjectKind: Pod          # Resource type: Pod, Deployment, etc.
    namespace: production            # Optional: specific namespace
    mode: automatic                  # Override global mode for this selector

  # Wildcard selector (matches everything)
  - type: Warning
    reason: ""                       # Empty = matches ANY reason
    involvedObjectKind: ""           # Empty = matches ANY resource type
```

**Important**: Selectors are evaluated in order. The first matching selector's configuration is used.

### Message Filtering

Filter events based on message content using regex patterns:

```yaml
eventSelectors:
  - type: Warning
    reason: BackOff
    message: "pulling image.*nginx"     # Regex pattern to match event message
```

**Pattern Syntax**: Uses Go regex syntax (RE2). Common patterns:
- `.*` - Match any characters
- `^` - Start of string
- `$` - End of string
- `(?i)` - Case-insensitive flag
- `[0-9]+` - One or more digits

**Examples**:

```yaml
# Match specific image pull failures
eventSelectors:
  - type: Warning
    reason: BackOff
    message: "Failed to pull image.*postgres"

# Match any error code in message
eventSelectors:
  - type: Warning
    message: "error code [0-9]+"

# Case-insensitive matching for timeout messages
eventSelectors:
  - type: Warning
    message: "(?i)timeout"

# Match OOMKilled messages
eventSelectors:
  - type: Warning
    reason: BackOff
    involvedObjectKind: Pod
    message: "Container.*was OOMKilled"

# Combine filters - match BackOff events with specific messages
eventSelectors:
  - type: Warning
    reason: BackOff
    involvedObjectKind: Pod
    namespace: preprod
    message: "Back-off.*pulling image"
    mode: manual
    confidenceThreshold: 0.85
    maxRiskLevel: medium
```

**Wildcard**: Empty or omitted `message` field matches all events (no message filtering).

### Mode Configuration

```yaml
# Global default mode
mode: manual                         # manual or automatic

# Per-selector override
eventSelectors:
  - type: Warning
    reason: FailedScheduling
    mode: automatic                  # Override global mode for this selector
```

### Safety Thresholds

```yaml
# Global thresholds (used when mode=automatic)
confidenceThreshold: 0.8             # 0.0-1.0, higher = more conservative
maxRiskLevel: low                    # low, medium, or high

# Per-selector override
eventSelectors:
  - type: Warning
    reason: FailedScheduling
    mode: automatic
    confidenceThreshold: 0.9         # Require 90% confidence for this event type
    maxRiskLevel: medium             # Allow medium-risk actions
```

### Other Settings

```yaml
rateLimiting:
  eventsPerMinute: 10                # Maximum events processed per minute
  cooldownMinutes: 5                 # Wait time after processing an event

persistence:
  enabled: true                      # Persist cooldown state across restarts (default: true)
```

### Notifications

You can configure Slack, Google Chat, or both simultaneously.

```yaml
# First, create Secrets for your webhook URLs:
# kubectl create secret generic slack-webhook --from-literal=url="https://hooks.slack.com/services/..." --namespace dot-ai
# kubectl create secret generic gchat-webhook --from-literal=url="https://chat.googleapis.com/v1/spaces/..." --namespace dot-ai

notifications:
  # Slack notifications
  slack:
    enabled: true
    webhookUrlSecretRef:             # RECOMMENDED: Use Secret reference
      name: slack-webhook            # Secret name (must be in same namespace)
      key: url                       # Key within the Secret
    channel: "#alerts"
    notifyOnStart: true              # Notify when remediation starts
    notifyOnComplete: true           # Notify when remediation completes
  # Google Chat notifications (requires Google Workspace paid account)
  googleChat:
    enabled: true
    webhookUrlSecretRef:             # RECOMMENDED: Use Secret reference
      name: gchat-webhook            # Secret name (must be in same namespace)
      key: url                       # Key within the Secret
    notifyOnStart: true              # Notify when remediation starts
    notifyOnComplete: true           # Notify when remediation completes
```

## Monitoring RemediationPolicies

### View Policy Status

```bash
# List all policies
kubectl get remediationpolicies --all-namespaces

# Get detailed status
kubectl get remediationpolicy sample-policy --namespace dot-ai --output yaml

# Watch for updates
kubectl get remediationpolicies --namespace dot-ai --watch
```

### Check Policy Metrics

```bash
# View policy status
kubectl get remediationpolicy sample-policy --namespace dot-ai --output jsonpath='{.status}' | jq

# Key metrics:
# - totalEventsProcessed: Total events matched by this policy
# - successfulRemediations: Successful remediation attempts
# - failedRemediations: Failed remediation attempts
# - rateLimitedEvents: Events skipped due to rate limiting
```

### Controller Logs

```bash
# View recent logs
kubectl logs --selector app.kubernetes.io/name=dot-ai-controller --namespace dot-ai --tail 50

# Follow logs in real-time
kubectl logs --selector app.kubernetes.io/name=dot-ai-controller --namespace dot-ai --follow

# Search for specific events
kubectl logs --selector app.kubernetes.io/name=dot-ai-controller --namespace dot-ai | grep "FailedScheduling"
```

## Best Practices

### Start Conservative

1. **Begin with manual mode** for all events
2. **Monitor Slack notifications** to understand what issues occur
3. **Gradually enable automatic mode** for specific, low-risk event types
4. **Increase confidence thresholds** as you gain trust in the system

### Event Selector Ordering

Order selectors from most specific to least specific:

```yaml
eventSelectors:
  # 1. Specific high-confidence scenarios (automatic)
  - type: Warning
    reason: FailedScheduling
    involvedObjectKind: Pod
    namespace: development
    mode: automatic
    confidenceThreshold: 0.9

  # 2. Broader scenarios (manual)
  - type: Warning
    reason: FailedScheduling
    involvedObjectKind: Pod
    mode: manual

  # 3. Catch-all (manual, low priority)
  - type: Warning
    reason: ""
    mode: manual
```

### Production Safety

For production environments:

```yaml
spec:
  mode: manual                       # Default to manual for safety
  confidenceThreshold: 0.9           # High confidence required
  maxRiskLevel: low                  # Only low-risk automatic actions
  rateLimiting:
    eventsPerMinute: 3               # Conservative rate limit
    cooldownMinutes: 30              # Long cooldown to prevent storms
```

### Testing Strategy

1. **Test in development first** with automatic mode enabled
2. **Review Slack notifications** for accuracy and usefulness
3. **Validate automatic remediations** work as expected
4. **Gradually roll out to production** starting with manual mode

## Troubleshooting

### Events Not Being Processed

1. Check if policy exists and is in Ready state:
```bash
kubectl get remediationpolicy --namespace dot-ai
kubectl describe remediationpolicy sample-policy --namespace dot-ai
```

2. Verify MCP endpoint is accessible:
```bash
kubectl get svc --namespace dot-ai
kubectl run test-curl --image=curlimages/curl --rm -it --restart=Never -- curl http://dot-ai-mcp.dot-ai.svc.cluster.local:3456/health
```

3. Check controller logs for errors:
```bash
kubectl logs --selector app.kubernetes.io/name=dot-ai-controller --namespace dot-ai --tail 100 | grep -i error
```

### Slack Notifications Not Received

1. Verify Secret exists and contains webhook URL:
```bash
# Check if Secret exists
kubectl get secret slack-webhook --namespace dot-ai

# Verify Secret contains the url key (webhook URL will be base64 encoded)
kubectl get secret slack-webhook --namespace dot-ai -o jsonpath='{.data.url}' | base64 -d
```

2. Test webhook manually (use decoded URL from step 1):
```bash
curl -X POST -H 'Content-type: application/json' \
  --data '{"text":"Test notification"}' \
  YOUR_WEBHOOK_URL
```

3. Check RemediationPolicy references the correct Secret:
```bash
kubectl get remediationpolicy sample-policy --namespace dot-ai -o yaml | grep -A2 webhookUrlSecretRef
```

4. Check controller logs for Secret resolution errors:
```bash
kubectl logs --selector app.kubernetes.io/name=dot-ai-controller --namespace dot-ai | grep -i "slack\|secret"
```

5. Check NotificationsHealthy status condition:
```bash
kubectl describe remediationpolicy sample-policy --namespace dot-ai | grep -A5 "NotificationsHealthy"
```

### Secret Resolution Errors

If you see errors related to Secret resolution in controller logs or status conditions:

**Error: "Secret not found"**
```bash
# Verify the Secret exists in the correct namespace
kubectl get secret slack-webhook --namespace dot-ai

# If missing, create it:
kubectl create secret generic slack-webhook \
  --from-literal=url="https://hooks.slack.com/services/YOUR/WEBHOOK/URL" \
  --namespace dot-ai
```

**Error: "Key not found in Secret"**
```bash
# Check what keys exist in the Secret
kubectl get secret slack-webhook --namespace dot-ai -o jsonpath='{.data}' | jq 'keys'

# The Secret must contain the key specified in webhookUrlSecretRef.key
# Default key is "url" - ensure your Secret uses this key or update the CR
```

**Error: "Secret must be in the same namespace"**
```bash
# Secrets must be in the same namespace as the RemediationPolicy CR
# Check RemediationPolicy namespace:
kubectl get remediationpolicy sample-policy --all-namespaces

# Ensure Secret is in the same namespace:
kubectl get secret slack-webhook --namespace dot-ai
```

**Webhook URL is empty or invalid**
```bash
# Verify the Secret contains a valid URL
kubectl get secret slack-webhook --namespace dot-ai -o jsonpath='{.data.url}' | base64 -d

# Should output a full webhook URL starting with https://
# If empty or malformed, update the Secret:
kubectl create secret generic slack-webhook \
  --from-literal=url="https://hooks.slack.com/services/YOUR/WEBHOOK/URL" \
  --namespace dot-ai \
  --dry-run=client -o yaml | kubectl apply -f -
```

### Rate Limiting Issues

If events are being rate limited:

```bash
# Check rate limited count
kubectl get remediationpolicy sample-policy --namespace dot-ai -o jsonpath='{.status.rateLimitedEvents}'

# Adjust rate limits
kubectl patch remediationpolicy sample-policy --namespace dot-ai --type merge -p '
spec:
  rateLimiting:
    eventsPerMinute: 20
    cooldownMinutes: 2
'
```

## Cleanup

Remove test resources:

```bash
# Delete test namespaces
kubectl delete namespace memory-demo postgres-demo

# Delete remediation policy
kubectl delete remediationpolicy sample-policy --namespace dot-ai
```

## Next Steps

- Review [Solution Guide](solution-guide.md) for resource tracking and lifecycle management
- Learn about [Capability Scanning](capability-scan-guide.md) for autonomous capability discovery
- Check [Troubleshooting Guide](troubleshooting.md) for common issues
- Explore [DevOps AI Toolkit](/docs/mcp) for MCP capabilities

# Resource Sync Guide

This guide covers the ResourceSyncConfig CRD for enabling resource visibility and semantic search in your Kubernetes cluster.

## Overview

The ResourceSyncConfig enables:
- **Resource Discovery**: Automatically discovers all resource types in your cluster
- **Change Tracking**: Watches for resource changes (create, update, delete)
- **Semantic Search**: Syncs resource metadata to MCP for natural language queries

## Stack Installation

If you installed via the [DevOps AI Toolkit Stack](/docs/stack), ResourceSyncConfig is already configured. You can verify with:

```bash
kubectl get resourcesyncconfig -n dot-ai
```

Continue below only if you need to customize the configuration or installed the controller individually.

## Prerequisites

- Controller installed (see [Setup Guide](setup-guide.md))
- [DevOps AI Toolkit MCP](/docs/mcp) installed and running

## Quick Start

1. Create a secret with your MCP API key:

```bash
kubectl create secret generic dot-ai-secrets \
  --namespace dot-ai \
  --from-literal=auth-token=your-auth-token-here
```

2. Create a ResourceSyncConfig to start syncing resources:

```yaml
apiVersion: dot-ai.devopstoolkit.live/v1alpha1
kind: ResourceSyncConfig
metadata:
  name: default-sync
  namespace: dot-ai
spec:
  mcpEndpoint: http://dot-ai.dot-ai.svc.cluster.local:3456/api/v1/resources/sync
  mcpAuthSecretRef:
    name: dot-ai-secrets
    key: auth-token
  debounceWindowSeconds: 10
  resyncIntervalMinutes: 60
```

3. Apply it:

```bash
kubectl apply -f resourcesyncconfig.yaml
```

## How It Works

1. **Discovery**: Controller discovers all resource types via the Kubernetes Discovery API
2. **Informers**: Dynamic informers are created for each resource type
3. **Change Detection**: Informer event handlers detect create/update/delete events
4. **Debouncing**: Changes are batched in a time window to reduce API calls
5. **Sync to MCP**: Batched changes are sent to MCP via HTTP
6. **Periodic Resync**: Full state is sent periodically to catch any missed events

### What Gets Synced

For each resource, the following metadata is synced to MCP:
- Kind, APIVersion, Name, Namespace
- Labels and select annotations (description-related)
- Creation and update timestamps

This metadata enables semantic search to discover resources (e.g., "find all databases", "list deployments in production").

### What's NOT Synced

The following are **not** synced to reduce traffic and storage:
- **Resource status** - fetched on-demand from Kubernetes API when needed
- **Resource spec** - fetched on-demand from Kubernetes API when needed
- High-volume resources: Events, Leases, EndpointSlices
- Large annotations like `kubectl.kubernetes.io/last-applied-configuration`

This design means resource discovery happens via semantic search in Qdrant, while current state (status/spec) is always fetched fresh from the Kubernetes API.

## Configuration

### Spec Fields

| Field | Type | Required | Default | Description |
|-------|------|----------|---------|-------------|
| `mcpEndpoint` | string | Yes | - | Full URL of the MCP resource sync endpoint |
| `mcpAuthSecretRef` | SecretReference | Yes | - | Secret containing API key for MCP authentication |
| `debounceWindowSeconds` | int | No | 10 | Time window to batch changes before syncing |
| `resyncIntervalMinutes` | int | No | 60 | Full resync interval (catches missed events) |

## Status

Check the status to verify the sync is working:

```bash
kubectl get resourcesyncconfig default-sync -o yaml
```

### Status Fields

| Field | Description |
|-------|-------------|
| `active` | Whether the watcher is running |
| `watchedResourceTypes` | Number of resource types being watched |
| `totalResourcesSynced` | Total resources synced to MCP |
| `lastResyncTime` | Time of last full resync |
| `syncErrors` | Count of sync errors |
| `conditions` | Standard Kubernetes conditions |

### Conditions

| Type | Description |
|------|-------------|
| `Ready` | True when watcher is active and syncing |

## Semantic Search

Once resources are synced, you can search using natural language through MCP:

```
"which databases are we running?"
"list deployments in production namespace"
"find all services related to payments"
```

MCP uses semantic embeddings to understand intent, so you don't need to know exact resource kinds or field names.

### Query Flow

For questions about resource state (e.g., "what's the status of my databases"):
1. **Discovery**: Semantic search finds relevant resources in Qdrant
2. **Fetch**: Current status is fetched from Kubernetes API
3. **Response**: AI synthesizes the answer with fresh data

This ensures status information is always current, not potentially stale from a sync lag.

## Example: Full Configuration

```yaml
apiVersion: dot-ai.devopstoolkit.live/v1alpha1
kind: ResourceSyncConfig
metadata:
  name: production-sync
  namespace: dot-ai
spec:
  # MCP endpoint for resource sync (full URL)
  mcpEndpoint: http://dot-ai.dot-ai.svc.cluster.local:3456/api/v1/resources/sync

  # Required: authentication for MCP
  mcpAuthSecretRef:
    name: dot-ai-secrets
    key: auth-token

  # Batch changes for 10 seconds before syncing
  debounceWindowSeconds: 10

  # Full resync every hour to catch any missed events
  resyncIntervalMinutes: 60
```

## Troubleshooting

### Watcher Not Starting

Check the Ready condition:

```bash
kubectl get resourcesyncconfig default-sync -o jsonpath='{.status.conditions}'
```

Common issues:
- Invalid `mcpEndpoint` URL
- MCP service not reachable
- Missing RBAC permissions

### No Resources Being Synced

Check the status:

```bash
kubectl get resourcesyncconfig default-sync -o jsonpath='{.status.watchedResourceTypes}'
```

If zero, check controller logs:

```bash
kubectl logs -l app.kubernetes.io/name=dot-ai-controller -n dot-ai --tail=50
```

### Sync Errors

Check the `syncErrors` status field and controller logs for details:

```bash
kubectl get resourcesyncconfig default-sync -o jsonpath='{.status.syncErrors}'
```

## Cleanup

Delete the ResourceSyncConfig to stop syncing:

```bash
kubectl delete resourcesyncconfig default-sync
```

This stops the watchers but does not delete synced data from MCP/Qdrant.

## Next Steps

- Learn about [Capability Scanning](capability-scan-guide.md) for autonomous capability discovery
- Explore [Remediation Policies](remediation-guide.md) for event-driven remediation
- Check [Troubleshooting Guide](troubleshooting.md) for common issues

# Setup Guide

This guide covers installation and initial setup of the DevOps AI Toolkit Controller.

> **Recommended**: For the easiest setup, we recommend installing the complete dot-ai stack which includes all components. See the [Stack Installation Guide](/docs/stack).
>
> The stack automatically applies CapabilityScanConfig and ResourceSyncConfig CRs. For RemediationPolicy and Solution CRs, see the [Remediation Guide](remediation-guide.md) and [Solution Guide](solution-guide.md) to configure them based on your needs.
>
> Continue below if you want to install this component individually (for non-Kubernetes setups or granular control).

## Prerequisites

- **kubectl** v1.11.3+
- **Helm** v3.0+
- **Kubernetes cluster** v1.11.3+

## Optional: Test Cluster Setup

If you don't have a Kubernetes cluster, create one locally using Kind:

```bash
# Use isolated kubeconfig
export KUBECONFIG=$PWD/kubeconfig.yaml

# Create test cluster
kind create cluster --name dot-ai-controller-test

# Verify cluster access
kubectl cluster-info
```

## Install Controller

The controller provides four features:
- **Solution CRD**: Resource tracking and lifecycle management (standalone)
- **RemediationPolicy CRD**: Event-driven remediation (requires [DevOps AI Toolkit MCP](/docs/mcp))
- **ResourceSyncConfig CRD**: Resource visibility and semantic search (requires [DevOps AI Toolkit MCP](/docs/mcp))
- **CapabilityScanConfig CRD**: Autonomous capability discovery (requires [DevOps AI Toolkit MCP](/docs/mcp))

### Install via Helm

```bash
# Set the version from https://github.com/vfarcic/dot-ai-controller/pkgs/container/dot-ai-controller%2Fcharts%2Fdot-ai-controller
export DOT_AI_CONTROLLER_VERSION="..."

helm install dot-ai-controller oci://ghcr.io/vfarcic/dot-ai-controller/charts/dot-ai-controller \
  --version $DOT_AI_CONTROLLER_VERSION \
  --namespace dot-ai \
  --create-namespace \
  --wait
```

This installs:
- Controller deployment
- Solution CRD
- RemediationPolicy CRD
- ResourceSyncConfig CRD
- CapabilityScanConfig CRD
- RBAC permissions

### Configuration Reference

| Parameter | Description | Default |
|-----------|-------------|---------|
| `annotations` | Global annotations applied to all resources (e.g., `reloader.stakater.com/auto: "true"`) | `{}` |
| `image.repository` | Container image repository | `ghcr.io/vfarcic/dot-ai-controller` |
| `image.tag` | Container image tag | Chart appVersion |
| `image.pullPolicy` | Image pull policy | `IfNotPresent` |
| `resources.requests.memory` | Memory request | `128Mi` |
| `resources.requests.cpu` | CPU request | `10m` |
| `resources.limits.memory` | Memory limit | `512Mi` |
| `resources.limits.cpu` | CPU limit | `500m` |

### Verify Installation

```bash
# Check controller is running
kubectl get pods --selector app.kubernetes.io/name=dot-ai-controller --namespace dot-ai

# Check controller logs
kubectl logs --selector app.kubernetes.io/name=dot-ai-controller --namespace dot-ai --tail 10

# Verify CRDs are installed
kubectl get crds | grep dot-ai.devopstoolkit.live
```

You should see all four CRDs:
```text
capabilityscanconfigs.dot-ai.devopstoolkit.live
remediationpolicies.dot-ai.devopstoolkit.live
resourcesyncconfigs.dot-ai.devopstoolkit.live
solutions.dot-ai.devopstoolkit.live
```

## Optional: Install DevOps AI Toolkit MCP

**Required for RemediationPolicy, ResourceSyncConfig, and CapabilityScanConfig features.** If you're only using the Solution CRD, skip this step.

For MCP installation instructions, see the [DevOps AI Toolkit documentation](/docs/mcp).

The controller expects the MCP service at:
- RemediationPolicy: `http://dot-ai-mcp.dot-ai.svc.cluster.local:3456/api/v1/tools/remediate`
- ResourceSyncConfig: `http://dot-ai-mcp.dot-ai.svc.cluster.local:3456/api/v1/resources/sync`
- CapabilityScanConfig: `http://dot-ai-mcp.dot-ai.svc.cluster.local:3456/api/v1/tools/manageOrgData`

## What's Next

Choose which features you want to use:

- **Solution CRD**: [Solution Guide](solution-guide.md) - Resource tracking and lifecycle management (works standalone, no MCP needed)
- **RemediationPolicy CRD**: [Remediation Guide](remediation-guide.md) - Event-driven remediation (requires MCP)
- **ResourceSyncConfig CRD**: [Resource Sync Guide](resource-sync-guide.md) - Resource visibility and semantic search (requires MCP)
- **CapabilityScanConfig CRD**: [Capability Scan Guide](capability-scan-guide.md) - Autonomous capability discovery (requires MCP)

## Cleanup

### Remove Controller

```bash
# Uninstall controller
helm uninstall dot-ai-controller --namespace dot-ai

# Delete namespace
kubectl delete namespace dot-ai
```

### Remove Test Cluster

If you used Kind:

```bash
# Delete cluster
kind delete cluster --name dot-ai-controller-test

# Remove kubeconfig
rm kubeconfig.yaml
```

# Solution CRD Guide

## Video: Kubernetes ownerReferences and Resource Grouping

[![Kubernetes ownerReferences and Resource Grouping with the Solution CRD](https://img.youtube.com/vi/UEkhIMx6B6E/maxresdefault.jpg)](https://youtu.be/UEkhIMx6B6E)

This video explains the problem of understanding what resources compose an application in Kubernetes, explores how Kubernetes ownership and ownerReferences work for garbage collection, and demonstrates how the Solution CRD provides a better approach by wrapping related resources into logical groups with status, context, and relationships.

## Overview

The **Solution CRD** is a Kubernetes Custom Resource Definition that tracks deployed solutions and their constituent resources. It acts as a parent resource that groups all Kubernetes resources (Deployments, Services, ConfigMaps, etc.) that compose a logical solution, preserving metadata and context not available in individual resources.

### Purpose

When deploying applications to Kubernetes, the Solution CRD provides:

1. **Resource Grouping**: Links related Kubernetes resources into a logical solution
2. **Intent Preservation**: Stores the original user intent that led to the deployment
3. **Metadata Storage**: Captures information not available in individual resources:
   - Deployment rationale and decision-making context
   - Configuration trade-offs and choices
   - Documentation links
   - Patterns and policies applied
4. **Health Monitoring**: Aggregates health status across all tracked resources
5. **Automatic Cleanup**: Deleting a Solution CR automatically deletes all child resources

### Key Design Principles

- **Kubernetes-Native**: Uses standard Kubernetes patterns and conventions
- **Metadata Store**: Captures information NOT in individual resources
- **Automatic Ownership**: Controller dynamically manages ownerReferences
- **Lifecycle Management**: Tracks solution state from deployment through operation

## Prerequisites

- Controller installed (see [Setup Guide](setup-guide.md))

The Solution CRD is automatically installed with the controller. Verify it's available:

```bash
# Verify Solution CRD is installed
kubectl get crds solutions.dot-ai.devopstoolkit.live

# Check controller is running
kubectl get pods --selector app.kubernetes.io/name=dot-ai-controller --namespace dot-ai
```

## Solution CRD Schema

### Spec Fields

```yaml
spec:
  # Original user intent that led to this deployment (required)
  intent: string

  # Solution metadata (information not in individual resources)
  context:
    createdBy: string       # Tool or user that created this solution
    rationale: string       # Why this solution was deployed this way
    patterns: []string      # Organizational patterns applied
    policies: []string      # Policies applied to this solution

  # List of Kubernetes resources that compose this solution (required)
  resources:
    - apiVersion: string    # e.g., "apps/v1"
      kind: string          # e.g., "Deployment"
      name: string          # Resource name
      namespace: string     # Optional for cluster-scoped resources

  # Documentation URL (optional)
  documentationURL: string  # Link to deployment documentation
```

### Status Fields

```yaml
status:
  # Overall state of the solution
  state: string  # pending, deployed, degraded, failed

  # Generation tracking
  observedGeneration: int64

  # Resource health summary
  resources:
    total: int     # Total resources tracked
    ready: int     # Resources that are ready
    failed: int    # Resources that have failed

  # Standard Kubernetes conditions
  conditions:
    - type: Ready
      status: "True" | "False" | "Unknown"
      reason: string
      message: string
```

## Quick Start: Your First Solution

Let's create a simple web application with a PostgreSQL database and track it with a Solution CR.

### Step 1: Create a Namespace

```bash
kubectl create namespace my-app
```

### Step 2: Deploy Application Resources

Deploy your application components (Deployment, Service, etc.):

```bash
kubectl apply --filename - <<'EOF'
apiVersion: v1
kind: Service
metadata:
  name: web-app-service
  namespace: my-app
spec:
  selector:
    app: web
  ports:
  - port: 80
    targetPort: 80
apiVersion: v1
kind: Service
metadata:
  name: postgresql
  namespace: my-app
spec:
  clusterIP: None
  selector:
    app: postgresql
  ports:
  - port: 5432
    targetPort: 5432
EOF
```

### Step 3: Create a Solution CR

Now create a Solution CR that tracks all these resources:

```bash
kubectl apply --filename - <<'EOF'
apiVersion: dot-ai.devopstoolkit.live/v1alpha1
kind: Solution
metadata:
  name: my-web-app
  namespace: my-app
spec:
  # Describe what this solution is for
  intent: "Deploy a web application with PostgreSQL database for production workloads"

  # Add context about this deployment
  context:
    createdBy: "platform-team"
    rationale: "Deployed to meet Q1 customer requirements. PostgreSQL chosen for ACID compliance."
    patterns:
      - "stateful-workload"
      - "multi-tier-application"
    policies:
      - "production-sla"
      - "data-retention-90d"

  # List all resources that compose this solution
  resources:
    - apiVersion: apps/v1
      kind: Deployment
      name: web-app
      namespace: my-app
    - apiVersion: v1
      kind: Service
      name: web-app-service
      namespace: my-app
    - apiVersion: apps/v1
      kind: StatefulSet
      name: postgresql
      namespace: my-app
    - apiVersion: v1
      kind: Service
      name: postgresql
      namespace: my-app
EOF
```

### Step 4: Verify the Solution

Check that the Solution was created and the controller has processed it:

```bash
# View the Solution
kubectl get solutions --namespace my-app

# Get detailed status
kubectl get solution my-web-app --namespace my-app --output yaml

# View controller logs
kubectl logs --selector app.kubernetes.io/name=dot-ai-controller --namespace dot-ai --tail 30
```

Expected output:
```
NAME         INTENT                                       STATE      RESOURCES   AGE
my-web-app   Deploy a web application with PostgreSQL...  deployed   4/4         2m
```

### Step 5: Verify ownerReferences Were Added

The controller automatically adds ownerReferences to all tracked resources:

```bash
# Check ownerReference on Deployment
kubectl get deployment web-app --namespace my-app --output jsonpath='{.metadata.ownerReferences}' | jq

# Check ownerReference on Service
kubectl get service web-app-service --namespace my-app --output jsonpath='{.metadata.ownerReferences}' | jq
```

You should see ownerReferences pointing to the Solution CR:
```json
[
  {
    "apiVersion": "dot-ai.devopstoolkit.live/v1alpha1",
    "kind": "Solution",
    "name": "my-web-app",
    "uid": "...",
    "controller": true,
    "blockOwnerDeletion": true
  }
]
```

## Understanding Solution Status

The Solution controller monitors all tracked resources and updates the status:

### Status States

- **pending**: Initial state or resources not yet ready
- **deployed**: All resources are healthy and ready
- **degraded**: Some resources are unhealthy or missing
- **failed**: Critical failure in resource deployment

### Health Checking

The controller uses multiple strategies to determine resource health:

1. **Conditions** (highest priority): Checks for `Ready`, `Available`, `Healthy`, or `Synced` conditions
2. **Replica Counts**: For Deployments, StatefulSets, DaemonSets - compares readyReplicas vs desired
3. **Existence** (fallback): Resource exists in the cluster

### Status Updates

The controller reconciles every 30 seconds to keep status current:

```bash
# Watch status updates
kubectl get solutions --namespace my-app --watch

# Get detailed status with conditions
kubectl get solution my-web-app --namespace my-app --output jsonpath='{.status}' | jq
```

## Testing Health Monitoring

Let's test how the controller detects unhealthy resources:

### Simulate a Failed Deployment

Scale a deployment to an impossible replica count:

```bash
# Scale to more replicas than cluster can handle
kubectl scale deployment web-app --namespace my-app --replicas=100

# Watch Solution status change to degraded
kubectl get solution my-web-app --namespace my-app --watch
```

The Solution status will show:
```yaml
status:
  state: degraded
  resources:
    total: 4
    ready: 3
    failed: 1
  conditions:
  - type: Ready
    status: "False"
    reason: ResourcesNotReady
    message: "Ready: 3/4, Failed: 1"
```

### Restore Health

```bash
# Scale back to normal
kubectl scale deployment web-app --namespace my-app --replicas=2

# Watch status return to deployed
kubectl get solution my-web-app --namespace my-app --watch
```

## Garbage Collection

One of the most powerful features is automatic cleanup via ownerReferences:

```bash
# Delete the Solution CR
kubectl delete solution my-web-app --namespace my-app

# All tracked resources are automatically deleted
kubectl get all --namespace my-app
# (should show: No resources found)
```

**Important**: Deleting a Solution CR deletes ALL child resources. This is by design for clean solution removal.

## Advanced Usage

### Tracking Existing Resources

You can create a Solution CR for resources that already exist. The controller will add ownerReferences dynamically:

```bash
# Create resources first
kubectl create deployment nginx --image=nginx --namespace my-app

# Then create Solution referencing existing resources
kubectl apply --filename - <<'EOF'
apiVersion: dot-ai.devopstoolkit.live/v1alpha1
kind: Solution
metadata:
  name: nginx-solution
  namespace: my-app
spec:
  intent: "Simple nginx web server"
  resources:
    - apiVersion: apps/v1
      kind: Deployment
      name: nginx
      namespace: my-app
EOF

# ownerReference is added after Solution creation
kubectl get deployment nginx --namespace my-app --output jsonpath='{.metadata.ownerReferences}'
```

### Cross-Namespace Resources

Currently, Solutions track resources in their own namespace. All resource references without an explicit namespace default to the Solution's namespace.

### Documentation Links

You can link to external documentation:

```yaml
spec:
  intent: "Production web application"
  documentationURL: "https://docs.example.com/apps/web-app"
  resources:
    - ...
```

This field helps teams track where deployment documentation is stored.

## Kubectl Tips

### List All Solutions

```bash
# All namespaces
kubectl get solutions --all-namespaces

# Specific namespace
kubectl get solutions --namespace my-app

# Watch for changes
kubectl get solutions --all-namespaces --watch
```

### Filter by State

```bash
# Find degraded solutions
kubectl get solutions --all-namespaces --output json | \
  jq -r '.items[] | select(.status.state=="degraded") | "\(.metadata.namespace)/\(.metadata.name)"'

# Find solutions with failed resources
kubectl get solutions --all-namespaces --output json | \
  jq -r '.items[] | select(.status.resources.failed > 0) | "\(.metadata.namespace)/\(.metadata.name): \(.status.resources.failed) failed"'
```

### Inspect Resource Health

```bash
# Get detailed status
kubectl get solution my-web-app --namespace my-app --output yaml

# Just the state
kubectl get solution my-web-app --namespace my-app --output jsonpath='{.status.state}'

# Resource counts
kubectl get solution my-web-app --namespace my-app --output jsonpath='{.status.resources}' | jq
```

## Common Patterns

### Pattern 1: Multi-Tier Application

```yaml
apiVersion: dot-ai.devopstoolkit.live/v1alpha1
kind: Solution
metadata:
  name: ecommerce-app
  namespace: production
spec:
  intent: "E-commerce platform with web frontend, API backend, and PostgreSQL database"
  context:
    createdBy: "ecommerce-team"
    patterns:
      - "three-tier-architecture"
      - "stateful-backend"
  resources:
    - apiVersion: apps/v1
      kind: Deployment
      name: frontend
    - apiVersion: v1
      kind: Service
      name: frontend
    - apiVersion: apps/v1
      kind: Deployment
      name: api-backend
    - apiVersion: v1
      kind: Service
      name: api-backend
    - apiVersion: apps/v1
      kind: StatefulSet
      name: postgresql
    - apiVersion: v1
      kind: Service
      name: postgresql
    - apiVersion: v1
      kind: ConfigMap
      name: app-config
```

### Pattern 2: Microservice with Dependencies

```yaml
apiVersion: dot-ai.devopstoolkit.live/v1alpha1
kind: Solution
metadata:
  name: order-service
  namespace: microservices
spec:
  intent: "Order processing microservice with Redis cache and message queue"
  context:
    createdBy: "platform-team"
    rationale: "Separated from monolith for scalability"
    policies:
      - "auto-scaling-enabled"
      - "circuit-breaker-required"
  resources:
    - apiVersion: apps/v1
      kind: Deployment
      name: order-service
    - apiVersion: v1
      kind: Service
      name: order-service
    - apiVersion: apps/v1
      kind: Deployment
      name: redis
    - apiVersion: v1
      kind: Service
      name: redis
    - apiVersion: v1
      kind: ConfigMap
      name: order-config
    - apiVersion: v1
      kind: Secret
      name: order-secrets
```

### Pattern 3: Data Pipeline

```yaml
apiVersion: dot-ai.devopstoolkit.live/v1alpha1
kind: Solution
metadata:
  name: analytics-pipeline
  namespace: data
spec:
  intent: "Analytics pipeline for processing customer data"
  context:
    createdBy: "data-team"
    patterns:
      - "batch-processing"
      - "data-pipeline"
    policies:
      - "pii-encryption-required"
      - "gdpr-compliant"
  resources:
    - apiVersion: batch/v1
      kind: CronJob
      name: data-ingestion
    - apiVersion: apps/v1
      kind: StatefulSet
      name: kafka
    - apiVersion: v1
      kind: Service
      name: kafka
    - apiVersion: apps/v1
      kind: Deployment
      name: data-processor
    - apiVersion: v1
      kind: PersistentVolumeClaim
      name: processed-data
```

## Troubleshooting

### Solution Shows "degraded" State

1. Check which resources are failing:
```bash
kubectl get solution <name> -n <namespace> -o jsonpath='{.status.resources}' | jq
```

2. Inspect individual resources:
```bash
kubectl get deployment <name> -n <namespace>
kubectl describe deployment <name> -n <namespace>
```

3. Check controller logs:
```bash
kubectl logs --selector app.kubernetes.io/name=dot-ai-controller --namespace dot-ai --tail 50
```

### ownerReferences Not Added

1. Verify resource exists and is in correct namespace:
```bash
kubectl get <kind> <name> -n <namespace>
```

2. Check Solution references match exactly:
```bash
kubectl get solution <name> -n <namespace> -o yaml | grep -A 10 resources:
```

3. Wait for controller to reconcile (30 seconds) or check logs:
```bash
kubectl logs --selector app.kubernetes.io/name=dot-ai-controller --namespace dot-ai | grep ownerReference
```

### Solution Status Not Updating

1. Verify controller is running:
```bash
kubectl get pods --selector app.kubernetes.io/name=dot-ai-controller --namespace dot-ai
```

2. Check for controller errors:
```bash
kubectl logs --selector app.kubernetes.io/name=dot-ai-controller --namespace dot-ai --tail 100
```

3. Verify controller has RBAC permissions:
```bash
kubectl get clusterrole dot-ai-controller-manager-role -o yaml
```

### Resources Not Deleted with Solution

1. Check if ownerReferences were added:
```bash
kubectl get <kind> <name> -n <namespace> -o jsonpath='{.metadata.ownerReferences}'
```

2. If missing, controller may not have permission. Check RBAC:
```bash
kubectl logs --selector app.kubernetes.io/name=dot-ai-controller --namespace dot-ai | grep -i "forbidden\|permission"
```

## Current Limitations

- **Namespace Scoped**: Solutions only track resources in the same namespace
- **Namespaced Resources Only**: Cannot currently track cluster-scoped resources (ClusterRoles, PVs, etc.)
- **No Configuration Drift Detection**: Controller only tracks resource health, not configuration changes

## Future Enhancements

Planned features for future releases:

- **Solution Updates**: Support updating deployed solutions via Solution CR changes
- **Rollback Support**: Track solution versions and enable rollback
- **Advanced Health Checks**: Custom health checks beyond basic resource status
- **Cost Tracking**: Integration with cloud cost APIs
- **Cross-Namespace Solutions**: Support for solutions spanning multiple namespaces
- **Template System**: Solution templates for common patterns

## Next Steps

- Explore the [Remediation Guide](remediation-guide.md) for event-driven remediation
- Learn about [Capability Scanning](capability-scan-guide.md) for autonomous capability discovery
- Check [Troubleshooting Guide](troubleshooting.md) for common issues

# Troubleshooting Guide

This guide covers common issues encountered when running the DevOps AI Toolkit Controller and their solutions.

## Common Issues and Solutions

### 1. Controller Pod Not Starting

**Symptoms:**
```bash
kubectl get pods --namespace dot-ai
# Shows controller pod in CrashLoopBackOff or ImagePullBackOff
```

**Diagnosis:**
```bash
kubectl logs --selector app.kubernetes.io/name=dot-ai-controller --namespace dot-ai
kubectl describe pod --selector app.kubernetes.io/name=dot-ai-controller --namespace dot-ai
```

**Common Causes:**
- **RBAC Issues**: Missing leader election permissions (we encountered this during testing)
- **Image Issues**: Wrong architecture or missing image
- **Resource Constraints**: Insufficient memory/CPU limits

**Solution:**
```bash
# Check if leader election RBAC is missing (error we fixed during testing):
# "leases.coordination.k8s.io is forbidden"
kubectl get clusterrole dot-ai-controller-manager-role --output yaml

# Add missing leader election permissions if needed:
kubectl patch clusterrole dot-ai-controller-manager-role --type='json' \
  --patch='[{"op": "add", "path": "/rules/-", "value": {"apiGroups": ["coordination.k8s.io"], "resources": ["leases"], "verbs": ["create", "get", "list", "update"]}}]'
```

### 2. Events Not Being Processed

**Symptoms:**
```bash
kubectl logs --selector app.kubernetes.io/name=dot-ai-controller --namespace dot-ai --tail 50
# Shows: "No RemediationPolicies found - event will not be processed"
```

**Diagnosis:**
```bash
# Check if RemediationPolicies exist
kubectl get remediationpolicies --all-namespaces

# Check policy selectors
kubectl get remediationpolicies --namespace dot-ai --output yaml
```

**Common Causes:**
- No RemediationPolicy created
- Event doesn't match policy selectors
- Policy in wrong namespace

### 3. MCP Connection Failures

**Symptoms:**
```bash
# Controller logs show:
# "❌ HTTP request failed" or "Failed to send MCP request"
```

**Diagnosis:**
```bash
# Check MCP pod status
kubectl get pods --namespace dot-ai --selector app.kubernetes.io/name=dot-ai

# Test MCP connectivity from controller
kubectl exec --namespace dot-ai deployment/dot-ai-controller-manager -- \
  curl -v http://dot-ai-mcp.dot-ai.svc.cluster.local:3456/health
```

**Common Causes:**
- MCP pod not running
- Wrong MCP endpoint URL in RemediationPolicy
- Network policies blocking communication

### 4. Slack Notifications Not Working

**Symptoms:**
```bash
# Controller logs show:
# "failed to send Slack start notification"
```

**Diagnosis:**
```bash
# Check Slack webhook configuration
kubectl get remediationpolicies --namespace dot-ai --output yaml | grep --after-context 5 slack

# Test webhook manually
curl -X POST -H 'Content-type: application/json' \
  --data '{"text":"Test message"}' \
  YOUR_SLACK_WEBHOOK_URL
```

**Common Causes:**
- Invalid Slack webhook URL
- Slack webhook disabled (`enabled: false`)
- Network connectivity issues

### 5. Rate Limiting Active

**Symptoms:**
```bash
# Controller logs show:
# "Event processing rate limited" and "cooldown active for Xm Ys more"
```

**This is Expected Behavior:** Rate limiting prevents spam processing of duplicate events. The default settings are:
- `eventsPerMinute: 5`  
- `cooldownMinutes: 15`

**To Adjust:** Modify your RemediationPolicy:
```yaml
rateLimiting:
  eventsPerMinute: 10    # Increase if needed
  cooldownMinutes: 5     # Decrease if needed
```

### 6. MCP Analysis Failures

**Symptoms:**
```bash
# Controller logs show:
# "MCP remediation failed" or "McpRemediationFailed" events
```

**Diagnosis:**
```bash
# Check MCP logs for detailed error messages
kubectl logs --namespace dot-ai --selector app.kubernetes.io/name=dot-ai --tail 50

# Check RemediationPolicy status
kubectl describe remediationpolicies --namespace dot-ai
```

**Common Causes:**
- Invalid Anthropic API key
- API rate limits exceeded
- Network connectivity to Anthropic services
- Malformed event data

### 7. ResourceSyncConfig Not Syncing

**Symptoms:**
```bash
# ResourceSyncConfig status shows syncErrors or not active
kubectl get resourcesyncconfigs --output yaml
```

**Diagnosis:**
```bash
# Check ResourceSyncConfig status
kubectl get resourcesyncconfigs --output jsonpath='{.items[*].status}'

# Check controller logs for sync errors
kubectl logs --selector app.kubernetes.io/name=dot-ai-controller --namespace dot-ai | grep -i "resourcesync\|sync"

# Verify MCP endpoint is reachable
kubectl exec --namespace dot-ai deployment/dot-ai-controller-manager -- \
  curl -v http://dot-ai-mcp.dot-ai.svc.cluster.local:3456/api/v1/resources/sync
```

**Common Causes:**
- MCP resource sync endpoint not available
- Wrong `mcpEndpoint` URL in ResourceSyncConfig
- Network policies blocking communication
- RBAC permissions missing for resource discovery

**Solution:**
```bash
# Verify the MCP endpoint URL is correct
kubectl get resourcesyncconfigs --output jsonpath='{.items[*].spec.mcpEndpoint}'

# Check if watcher is active
kubectl get resourcesyncconfigs --output jsonpath='{.items[*].status.active}'

# Check watched resource types count
kubectl get resourcesyncconfigs --output jsonpath='{.items[*].status.watchedResourceTypes}'
```

### 8. CapabilityScanConfig Not Scanning

**Symptoms:**
```bash
# CapabilityScanConfig status shows errors or not ready
kubectl get capabilityscanconfigs --output yaml
```

**Diagnosis:**
```bash
# Check CapabilityScanConfig status
kubectl get capabilityscanconfigs --output jsonpath='{.items[*].status}'

# Check controller logs for scan errors
kubectl logs --selector app.kubernetes.io/name=dot-ai-controller --namespace dot-ai | grep -i "capabilityscan"

# Verify auth secret exists
kubectl get secret dot-ai-secrets --namespace dot-ai
```

**Common Causes:**
- MCP endpoint not available
- Wrong `mcp.endpoint` URL in CapabilityScanConfig
- Missing or invalid `mcp.authSecretRef` secret
- Resource filters excluding all resources

**Solution:**
```bash
# Verify the MCP endpoint URL is correct
kubectl get capabilityscanconfigs --output jsonpath='{.items[*].spec.mcp.endpoint}'

# Check if initial scan completed
kubectl get capabilityscanconfigs --output jsonpath='{.items[*].status.initialScanComplete}'

# Check last error
kubectl get capabilityscanconfigs --output jsonpath='{.items[*].status.lastError}'

# Verify include/exclude filters aren't too restrictive
kubectl get capabilityscanconfigs --output jsonpath='{.items[*].spec.includeResources}'
```

### 9. GitKnowledgeSource Not Syncing

**Symptoms:**
```bash
# GitKnowledgeSource status shows errors or Synced condition is False
kubectl get gitknowledgesources --output yaml
```

**Diagnosis:**
```bash
# Check GitKnowledgeSource status
kubectl get gitknowledgesources -n dot-ai -o jsonpath='{.items[*].status}'

# Check controller logs for sync errors
kubectl logs --selector app.kubernetes.io/name=dot-ai-controller --namespace dot-ai | grep -i "gitknowledge\|clone"

# Verify MCP endpoint is reachable
kubectl exec --namespace dot-ai deployment/dot-ai-controller-manager -- \
  curl -v http://dot-ai.dot-ai.svc:3456/health
```

**Common Causes:**
- **CloneError with "read-only file system"**: Controller deployment missing `/tmp` volume mount
- **Authentication failure**: Invalid or missing token for private repositories
- **MCP unreachable**: Wrong MCP server URL or network issues
- **Invalid path patterns**: Glob patterns not matching any files

**Solution:**
```bash
# Check for read-only filesystem error (needs /tmp volume)
kubectl get gitknowledgesources -n dot-ai -o jsonpath='{.items[*].status.lastError}'

# Verify the controller has /tmp volume mounted
kubectl get deployment dot-ai-controller-manager -n dot-ai -o jsonpath='{.spec.template.spec.containers[0].volumeMounts}'

# If missing, patch to add /tmp volume:
kubectl patch deployment dot-ai-controller-manager -n dot-ai --type='json' -p='[
  {"op": "add", "path": "/spec/template/spec/volumes", "value": [{"name": "tmp-dir", "emptyDir": {}}]},
  {"op": "add", "path": "/spec/template/spec/containers/0/volumeMounts", "value": [{"name": "tmp-dir", "mountPath": "/tmp"}]}
]'

# For private repo auth issues, verify secret exists
kubectl get secret <secret-name> -n dot-ai -o jsonpath='{.data.<key>}' | base64 -d
```

### 10. ResourceSync High Traffic or Performance Issues

**Symptoms:**
- High CPU/memory usage on controller
- Frequent sync requests to MCP
- Slow cluster performance

**Diagnosis:**
```bash
# Check sync frequency and resource counts
kubectl get resourcesyncconfigs --output yaml | grep -A5 status

# Check debounce and resync settings
kubectl get resourcesyncconfigs --output yaml | grep -E "debounceWindowSeconds|resyncIntervalMinutes"
```

**Solution:**

Adjust debounce and resync intervals in your ResourceSyncConfig:
```yaml
spec:
  debounceWindowSeconds: 30   # Increase to batch more changes
  resyncIntervalMinutes: 120  # Increase to reduce full resyncs
```

## Getting Help

### Collect Diagnostic Information

When reporting issues, include this diagnostic information:

```bash
# Controller status and logs
kubectl get pods --namespace dot-ai
kubectl logs --selector app.kubernetes.io/name=dot-ai-controller --namespace dot-ai --tail 100

# MCP status and logs
kubectl logs --namespace dot-ai --selector app.kubernetes.io/name=dot-ai --tail 50

# RemediationPolicy configuration
kubectl get remediationpolicies --namespace dot-ai --output yaml

# ResourceSyncConfig configuration and status
kubectl get resourcesyncconfigs --all-namespaces --output yaml

# CapabilityScanConfig configuration and status
kubectl get capabilityscanconfigs --all-namespaces --output yaml

# Recent events
kubectl get events --namespace dot-ai --sort-by='.lastTimestamp' --field-selector type=Warning
```

### Enable Debug Logging

For more detailed troubleshooting, you can increase log verbosity:

```bash
# Edit the controller deployment to add debug flags
kubectl patch deployment dot-ai-controller-manager --namespace dot-ai --patch='
{
  "spec": {
    "template": {
      "spec": {
        "containers": [
          {
            "name": "manager",
            "args": ["--leader-elect", "--health-probe-bind-address=:8081", "-v=2"]
          }
        ]
      }
    }
  }
}'
```

## Resource Requirements

The default resource limits are:

**Controller:**
- Limits: 500m CPU, 128Mi memory
- Requests: 10m CPU, 64Mi memory

**MCP:**
- Limits: 1 CPU, 2Gi memory  
- Requests: 200m CPU, 512Mi memory

These should be sufficient for most use cases, but may need adjustment for high-volume environments.
## Web UI

# DevOps AI Toolkit Web UI

**Kubernetes dashboard with AI-powered operations and rich visualizations for the DevOps AI Toolkit MCP server.**


**DevOps AI Toolkit Web UI** - Rich visualizations for AI-powered DevOps workflows.

# Kubernetes Setup Guide

**Deploy DevOps AI Toolkit Web UI to Kubernetes using Helm chart.**

> **Recommended**: For the easiest setup, install the complete dot-ai stack which includes all components (MCP server, Web UI, and Controller). See the [Stack Installation Guide](/docs/stack).
>
> Continue below if you want to install this component individually (for non-Kubernetes setups or granular control).

## Prerequisites

- Kubernetes cluster (1.19+) with kubectl access
- Helm 3.x installed
- [dot-ai MCP server](/docs/ai-engine/setup/deployment) deployed and accessible
- Auth token for the dot-ai MCP server

## Quick Start

### Step 1: Set Environment Variables

```bash
# Set the version from https://github.com/vfarcic/dot-ai-ui/pkgs/container/dot-ai-ui%2Fcharts%2Fdot-ai-ui
export DOT_AI_UI_VERSION="..."

# Use the same auth token as your dot-ai MCP server
export DOT_AI_AUTH_TOKEN="your-dot-ai-auth-token"

# Token for UI login
export DOT_AI_UI_AUTH_TOKEN="your-ui-access-token"

# Ingress class - change to match your ingress controller (traefik, haproxy, etc.)
export INGRESS_CLASS_NAME="nginx"
```

### Step 2: Install the Web UI

```bash
helm install dot-ai-ui \
  oci://ghcr.io/vfarcic/dot-ai-ui/charts/dot-ai-ui:$DOT_AI_UI_VERSION \
  --set dotAi.url="http://dot-ai:3456" \
  --set dotAi.auth.token="$DOT_AI_AUTH_TOKEN" \
  --set uiAuth.token="$DOT_AI_UI_AUTH_TOKEN" \
  --set ingress.enabled=true \
  --set ingress.className="$INGRESS_CLASS_NAME" \
  --set ingress.host="dot-ai-ui.127.0.0.1.nip.io" \
  --namespace dot-ai \
  --wait
```

**Notes**:
- Replace `dot-ai-ui.127.0.0.1.nip.io` with your desired hostname.
- The `dotAi.url` should point to your dot-ai MCP server service. If deployed in the same namespace with default settings, `http://dot-ai:3456` works.
- The chart includes nginx timeout annotations by default (10 min). **If using a different ingress controller**, override the annotations (see [Ingress Timeout Configuration](#ingress-timeout-configuration)).
- For all available configuration options, see the [Helm values file](https://github.com/vfarcic/dot-ai-ui/blob/main/charts/values.yaml).

### Step 3: Configure dot-ai MCP Server

Update your dot-ai MCP server to include visualization URLs in responses:

```bash
helm upgrade dot-ai-mcp oci://ghcr.io/vfarcic/dot-ai/charts/dot-ai:$DOT_AI_VERSION \
  --set webUi.baseUrl="http://dot-ai-ui.127.0.0.1.nip.io" \
  --namespace dot-ai \
  --reuse-values
```

### Step 4: Verify Installation

Open your browser and navigate to the Web UI hostname. You should see the DevOps AI Toolkit Web UI home page.

## Configuration Reference

| Parameter | Description | Default |
|-----------|-------------|---------|
| `annotations` | Global annotations applied to all resources (e.g., `reloader.stakater.com/auto: "true"`) | `{}` |
| `image.repository` | Container image repository | `ghcr.io/vfarcic/dot-ai-ui` |
| `image.tag` | Container image tag | Chart appVersion |
| `image.pullPolicy` | Image pull policy | `IfNotPresent` |
| `dotAi.url` | URL of the dot-ai MCP server | `http://dot-ai:3456` |
| `dotAi.auth.secretRef.name` | Name of existing secret with auth token | `dot-ai-secrets` |
| `dotAi.auth.secretRef.key` | Key in existing secret | `auth-token` |
| `dotAi.auth.token` | Auth token (if not using secretRef) | `""` |
| `uiAuth.secretRef.name` | Existing secret with UI auth token | `""` |
| `uiAuth.secretRef.key` | Key in existing secret | `ui-auth-token` |
| `uiAuth.token` | UI auth token (if not using secretRef) | `""` |
| `ingress.enabled` | Enable Ingress resource | `false` |
| `ingress.className` | Ingress class name | `nginx` |
| `ingress.host` | Ingress hostname | `dot-ai-ui.127.0.0.1.nip.io` |
| `ingress.annotations` | Ingress annotations (includes nginx timeouts by default) | nginx timeout annotations |
| `ingress.tls.enabled` | Enable TLS | `false` |
| `ingress.tls.secretName` | TLS secret name | `""` |
| `ingress.tls.clusterIssuer` | cert-manager ClusterIssuer | `""` |
| `resources.requests.memory` | Memory request | `128Mi` |
| `resources.requests.cpu` | CPU request | `50m` |
| `resources.limits.memory` | Memory limit | `256Mi` |
| `resources.limits.cpu` | CPU limit | `200m` |

## Using Secret Reference

For production, reference existing secrets instead of passing tokens directly:

```bash
helm install dot-ai-ui \
  oci://ghcr.io/vfarcic/dot-ai-ui/charts/dot-ai-ui:$DOT_AI_UI_VERSION \
  --set dotAi.url="http://dot-ai:3456" \
  --set dotAi.auth.secretRef.name="dot-ai-secrets" \
  --set dotAi.auth.secretRef.key="auth-token" \
  --set uiAuth.secretRef.name="dot-ai-secrets" \
  --set uiAuth.secretRef.key="ui-auth-token" \
  --set ingress.enabled=true \
  --set ingress.className="$INGRESS_CLASS_NAME" \
  --set ingress.host="dot-ai-ui.example.com" \
  --namespace dot-ai \
  --wait
```

## TLS Configuration

Enable HTTPS with cert-manager:

```bash
helm install dot-ai-ui \
  oci://ghcr.io/vfarcic/dot-ai-ui/charts/dot-ai-ui:$DOT_AI_UI_VERSION \
  --set dotAi.url="http://dot-ai:3456" \
  --set dotAi.auth.secretRef.name="dot-ai-secrets" \
  --set uiAuth.secretRef.name="dot-ai-secrets" \
  --set uiAuth.secretRef.key="ui-auth-token" \
  --set ingress.enabled=true \
  --set ingress.className="$INGRESS_CLASS_NAME" \
  --set ingress.host="dot-ai-ui.example.com" \
  --set ingress.tls.enabled=true \
  --set ingress.tls.clusterIssuer="letsencrypt" \
  --namespace dot-ai \
  --wait
```

Then update your `.mcp.json` URL to use `https://`.

## Ingress Timeout Configuration

The Web UI proxies requests to the MCP server for AI-powered operations (query, remediate, operate, recommend) that can take several minutes. The chart includes **nginx timeout annotations by default** (10 minutes). If you use a different ingress controller, override `ingress.annotations` with the appropriate settings:

### Nginx (default)

```yaml
ingress:
  annotations:
    nginx.ingress.kubernetes.io/proxy-read-timeout: "600"
    nginx.ingress.kubernetes.io/proxy-send-timeout: "600"
```

### Traefik

```yaml
ingress:
  className: traefik
  annotations:
    traefik.ingress.kubernetes.io/router.middlewares: dot-ai-timeout@kubernetescrd
```

> **Note**: Traefik requires a separate `Middleware` resource for timeout configuration. See the [Traefik docs](https://doc.traefik.io/traefik/middlewares/http/headers/).

### HAProxy

```yaml
ingress:
  className: haproxy
  annotations:
    haproxy.org/timeout-http-request: "600s"
```

### AWS ALB

```yaml
ingress:
  className: alb
  annotations:
    alb.ingress.kubernetes.io/target-group-attributes: idle_timeout.timeout_seconds=600
```

## Gateway API (Alternative to Ingress)

For Kubernetes 1.26+ with Gateway API support, you can use HTTPRoute instead of Ingress.

### Prerequisites

- Kubernetes 1.26+ cluster
- Gateway API CRDs installed
- Gateway controller running (Istio, Envoy Gateway, Kong, etc.)
- Existing Gateway resource

### Reference Existing Gateway

```bash
helm install dot-ai-ui \
  oci://ghcr.io/vfarcic/dot-ai-ui/charts/dot-ai-ui:$DOT_AI_UI_VERSION \
  --set dotAi.url="http://dot-ai:3456" \
  --set dotAi.auth.secretRef.name="dot-ai-secrets" \
  --set uiAuth.secretRef.name="dot-ai-secrets" \
  --set uiAuth.secretRef.key="ui-auth-token" \
  --set ingress.enabled=false \
  --set gateway.name="cluster-gateway" \
  --set gateway.namespace="gateway-system" \
  --namespace dot-ai \
  --wait
```

### Gateway Configuration Reference

| Parameter | Description | Default |
|-----------|-------------|---------|
| `gateway.name` | Existing Gateway name to reference | `""` |
| `gateway.namespace` | Gateway namespace (for cross-namespace) | `""` |
| `gateway.create` | Create new Gateway (dev/testing only) | `false` |
| `gateway.className` | GatewayClass name (when create=true) | `""` |
| `gateway.annotations` | Annotations for Gateway (when create=true) | `{}` |
| `gateway.timeouts.request` | Max time for entire request (HTTPRoute) | `"600s"` |
| `gateway.timeouts.backendRequest` | Max time waiting for backend response (HTTPRoute) | `"600s"` |
| `gateway.listeners.http.enabled` | Enable HTTP listener on port 80 | `true` |
| `gateway.listeners.http.hostname` | Hostname for HTTP listener | `""` |
| `gateway.listeners.https.enabled` | Enable HTTPS listener on port 443 | `false` |
| `gateway.listeners.https.hostname` | Hostname for HTTPS listener | `""` |
| `gateway.listeners.https.secretName` | TLS secret name for HTTPS | `""` |

### Timeout Configuration

The chart sets HTTPRoute timeouts to **10 minutes** by default (`gateway.timeouts.request` and `gateway.timeouts.backendRequest`), which accommodates AI-powered operations. You can override these values:

```bash
helm install dot-ai-ui ... \
  --set gateway.timeouts.request="1800s" \
  --set gateway.timeouts.backendRequest="1800s"
```

**Not all Gateway controllers support HTTPRoute timeouts.** For example, GKE's Gateway controller rejects them with error GWCER104. To disable HTTPRoute timeouts, set both values to empty strings:

```bash
helm install dot-ai-ui ... \
  --set gateway.timeouts.request="" \
  --set gateway.timeouts.backendRequest=""
```

### Cloud Provider Considerations

Some cloud providers require provider-specific timeout configuration instead of HTTPRoute timeouts:

- **GKE**: Does **not** support HTTPRoute timeouts (GWCER104). You must disable them (set to `""`) and use `GCPBackendPolicy` instead (see below).
- **Health check configuration**: Custom health check intervals or thresholds.
- **Security policies**: WAF rules, rate limiting at the gateway level.

**GKE Example**: Disable HTTPRoute timeouts and create a `GCPBackendPolicy` for timeout configuration:

```yaml
apiVersion: networking.gke.io/v1
kind: GCPBackendPolicy
metadata:
  name: dot-ai-ui
  namespace: dot-ai
spec:
  default:
    timeoutSec: 3600
  targetRef:
    group: ""
    kind: Service
    name: dot-ai-ui
```

Consult your cloud provider's Gateway API documentation for equivalent configurations on other platforms.

## Additional Configuration

| Parameter | Description | Default |
|-----------|-------------|---------|
| `extraEnv` | Additional environment variables | `[]` |