Skip to main content

Observability Guide

Complete guide for distributed tracing and observability in the DevOps AI Toolkit MCP server.

Overview

What it does: Provides OpenTelemetry-based distributed tracing for debugging complex workflows, measuring AI provider performance, and understanding Kubernetes operation latency.

Use when: You need to understand where time is spent in multi-step workflows, debug performance issues, or monitor AI/Kubernetes operations in production.

📖 Full Guide: This document covers tracing setup, configuration, backend integration, and trace interpretation specific to the DevOps AI Toolkit.

What is Distributed Tracing?

Learn about distributed tracing concepts and OpenTelemetry:

This guide focuses on DevOps AI Toolkit-specific tracing implementation, configuration, and usage patterns.

Prerequisites

  • DevOps AI Toolkit MCP server configured (see MCP Setup)
  • Basic understanding of distributed tracing concepts (optional but helpful)
  • Backend for viewing traces (Jaeger, Grafana Tempo, vendor service) or use console output

Quick Start

Environment Variables

Add tracing environment variables to your MCP client configuration (see MCP Setup for how to configure environment variables).

VariableRequiredDefaultDescription
OTEL_TRACING_ENABLEDYesfalseEnable/disable tracing
OTEL_SERVICE_NAMENodot-ai-mcpService name in traces
OTEL_EXPORTER_TYPENoconsoleExporter type: console, otlp, jaeger, zipkin
OTEL_EXPORTER_OTLP_ENDPOINTRequired for OTLP-OTLP endpoint URL (e.g., http://localhost:4318/v1/traces)
OTEL_SAMPLING_PROBABILITYNo1.0Sampling rate: 0.0 to 1.0 (1.0 = 100%, 0.1 = 10%)
OTEL_DEBUGNofalseEnable debug logging for tracing

Verify Tracing Status

After configuring and restarting your MCP client, verify tracing status:

User: Show me the system status

Agent: The system is healthy and all components are operational:

...

Tracing: Enabled
- Exporter: console
- Service Name: dot-ai-mcp
- Status: initialized

The agent will report tracing configuration as part of the system status.

What Gets Traced

The DevOps AI Toolkit automatically traces all operations without requiring code changes:

MCP Tool Execution

  • All MCP tools (recommendations, remediation, capability management, etc.)
  • Tool parameters and execution duration
  • Success/failure status
  • Session IDs for workflow correlation

AI Provider Operations

  • Chat completions: Claude, OpenAI, Google, xAI, and custom endpoints
  • Tool loop iterations: Multi-step AI workflows with per-iteration visibility
  • Embeddings generation: Vector embeddings for semantic search
  • Token usage: Input tokens, output tokens, cache metrics
  • Model information: Provider names and specific model versions

Kubernetes Operations

  • API client calls: All Kubernetes API operations through the client library
  • kubectl commands: CLI command execution with operation details
  • Resource information: Resource types, namespaces, and operation latency

Vector Database Operations

  • Search queries: Semantic and keyword searches with result counts
  • Document operations: Upserts, deletions, and retrievals
  • Collection management: Collection operations and health checks
  • Performance metrics: Query latency and result quality scores

Backend Integration

Jaeger

Jaeger is an open-source distributed tracing platform. Run Jaeger locally with Docker:

docker run -d --name jaeger \
-p 16686:16686 \
-p 4318:4318 \
jaegertracing/all-in-one:latest

Configure the MCP server to send traces to Jaeger:

OTEL_TRACING_ENABLED=true
OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4318/v1/traces

Access the Jaeger UI at http://localhost:16686 to view traces.

Other Backends

Any tracing backend that supports OpenTelemetry OTLP protocol should work with the same configuration pattern:

OTEL_TRACING_ENABLED=true
OTEL_EXPORTER_OTLP_ENDPOINT=<your-backend-otlp-endpoint>

Refer to your backend's documentation for the specific OTLP endpoint URL.

Viewing Traces

Jaeger UI

Open Jaeger UI at http://localhost:16686 (if using local Jaeger setup).

Finding Traces:

  1. Select dot-ai-mcp from the Service dropdown
  2. Click "Find Traces" button
  3. View list of recent traces with duration and span count

Trace Details:

  • Click on a trace to see the complete request flow
  • Spans are displayed in a waterfall timeline showing parent-child relationships
  • Each span shows operation name, duration, and timing relative to the trace start
  • Click on individual spans to see detailed attributes

Understanding Trace Information

Tool Execution Span:

  • Operation name: execute_tool <tool-name>
  • Shows total time for tool execution
  • Contains session ID and tool parameters

AI Provider Spans:

  • Operation names: chat <model>, tool_loop <model>, embeddings <model>
  • Token usage: gen_ai.usage.input_tokens, gen_ai.usage.output_tokens
  • Cache metrics: gen_ai.usage.cache_read_tokens, gen_ai.usage.cache_creation_tokens
  • Model details: gen_ai.request.model, gen_ai.provider.name

Kubernetes Operation Spans:

  • Operation names: API method names or kubectl <command>
  • Attributes: k8s.api, k8s.method, k8s.operation, k8s.resource
  • Shows latency for Kubernetes API calls

Vector Database Spans:

  • Operation names: search, upsert, delete, list, etc.
  • Attributes: db.operation.name, db.collection.name
  • Result metrics: db.query.result_count, db.vector.top_score

Trace Hierarchy

All spans from a single tool invocation share the same trace ID and follow this hierarchy:

execute_tool <tool-name>                    (root span)
├── chat <model> (AI operation)
│ └── POST https://api.anthropic.com (HTTP call)
├── search (vector DB query)
│ └── POST http://localhost:6333 (HTTP call)
└── k8s.listNamespacedDeployment (Kubernetes API)
└── GET https://kubernetes/apis/apps (HTTP call)

This hierarchy helps identify which operations are taking the most time and where bottlenecks occur.