# SRE Agent - Multi-Agent Site Reliability Engineering Assistant
## Table of Contents
- [Overview](#overview)
- [Key Features](#key-features)
- [Architecture](#architecture)
- [Specialized Agents](#specialized-agents)
- [Getting Started](#getting-started)
- [Usage](#usage)
- [Configuration](#configuration)
- [Demo Environment](#demo-environment)
- [Example Use Cases](#example-use-cases)
- [Production Deployment](#production-deployment)
- [Verification of Results](#verification-of-results)
- [Development](#development)
- [License](#license)
## Overview
The SRE Agent is a multi-agent system for Site Reliability Engineers that helps investigate infrastructure issues. Built on the Model Context Protocol (MCP) and powered by Amazon Nova and Anthropic Claude models (Claude can be accessed through Amazon Bedrock or directly through Anthropic), this system uses specialized AI agents that collaborate to investigate issues, analyze logs, monitor performance metrics, and execute operational procedures. The AgentCore Gateway provides access to data sources and systems available as MCP tools.
The system provides an intelligent assistant that understands context, investigates issues, and provides actionable insights. It integrates with existing infrastructure APIs and provides a conversational interface for infrastructure operations.
## Key Features
**Multi-Agent Orchestration**
- Specialized agents collaborate on infrastructure investigations
- Each agent focuses on specific domains: Kubernetes, logs, metrics, or operational procedures
- Real-time streaming shows investigation progress and agent reasoning
**Conversational Interface**
- Single-query investigations for quick answers
- Interactive multi-turn conversations with context preservation
- Natural language queries for infrastructure operations
**Integration and Security**
- AgentCore Gateway provides secure API access with authentication
- Health monitoring and retry logic for reliable operations
- Unified interface abstracts multiple backend systems
**Documentation and Reporting**
- Markdown reports generated for each investigation
Built on LangGraph, the core orchestrates multiple specialized agents. The Supervisor Agent coordinates investigations by analyzing queries and routing them to appropriate specialist agents. Each agent accesses domain-specific tools through the MCP protocol to interact with infrastructure APIs.
The collaboration model enables complex investigations across multiple domains. For example, investigating a pod failure might involve the Kubernetes Agent identifying resource constraints, the Metrics Agent providing historical analysis, and the Logs Agent correlating errors.
#### AgentCore Gateway
The gateway provides secure communication between AI agents and infrastructure APIs. Built on the Model Context Protocol (MCP), it offers:
- Standardized interface for tool discovery and execution
- Authentication through bearer tokens
- Circuit breakers for resilience
- Health monitoring for reliable operations
- Protocol translation and retry logic
#### Infrastructure Integration
The system works with both demo and production environments:
**Demo Environment**: Four specialized API servers simulate Kubernetes clusters, log aggregation systems, metrics databases, and operational procedures. This allows evaluation without impacting production systems.
**Production Environment**: The same agents connect to real infrastructure APIs through the AgentCore Gateway. Gateway configuration can be updated to point to production endpoints while maintaining the same agent logic.
## Specialized Agents
### Kubernetes Infrastructure Agent
Handles container orchestration and cluster operations. This agent investigates issues across pods, deployments, services, and nodes by examining cluster state, analyzing pod health, resource utilization, and recent events.
**Capabilities:**
- Check pod status across namespaces
- Examine deployment configurations and rollout history
- Investigate cluster events for anomalies
- Analyze resource usage patterns
- Monitor node health and capacity
### Application Logs Agent
Processes log data to find relevant information. This agent understands log patterns, identifies anomalies, and correlates events across multiple services.
**Capabilities:**
- Full-text search with regex support
- Error log aggregation and categorization
- Pattern detection for recurring issues
- Time-based correlation of events
- Statistical analysis of log volumes
### Performance Metrics Agent
Monitors system metrics and identifies performance issues. This agent understands relationships between different metrics and provides both real-time analysis and historical trending.
Provides access to documented procedures, troubleshooting guides, and best practices. This agent helps standardize incident response by retrieving relevant procedures based on the current situation.
**Capabilities:**
- Incident-specific playbooks for common scenarios
- Detailed troubleshooting guides with step-by-step instructions
> **⚠️ IMPORTANT:** Amazon Bedrock AgentCore Gateway **only works with HTTPS endpoints**. You must have valid SSL certificates for your backend servers.
Before installing the SRE Agent, ensure you have the following prerequisites in place:
#### System Requirements
- **Python 3.12 or higher** for optimal performance and compatibility
- **`uv` package manager** for fast, reliable Python package management
- **EC2 Instance**: We recommend a `t3.xlarge` instance or larger for running the SRE Agent backend servers
#### SSL Certificate Requirements
You **must** have valid SSL certificates for the machine running this solution. The AgentCore Gateway requires HTTPS endpoints for all backend services.
**SSL Certificate Setup:**
You need an HTTPS certificate and private key to proceed. If you use `your-backend-domain.com` as the domain for your backend servers, then you will need an SSL certificate for `your-backend-domain.com` and your services will be accessible to AgentCore Gateway as `https://your-backend-domain.com:8011`, `https://your-backend-domain.com:8012`, etc.
**Recommended Setup:**
1.**Use an EC2 instance** (recommended: `t3.xlarge`) with a public IP and domain name
2.**Generate SSL certificates using Let's Encrypt, no-ip, or other similar services**
3.**Place the SSL certificate and private key files** in `/etc/ssl/certs/` and `/etc/ssl/private/` folders respectively on your EC2 machine
Before starting the backend servers, you **must** update the OpenAPI specification files in `backend/openapi_specs/` to replace `https://your-backend-domain.com` with your actual domain name for which you have the SSL certificate. This is required because AgentCore needs a publicly reachable domain to connect to the backend services.
Update these files:
-`backend/openapi_specs/k8s_api.yaml`
-`backend/openapi_specs/logs_api.yaml`
-`backend/openapi_specs/metrics_api.yaml`
-`backend/openapi_specs/runbooks_api.yaml`
Replace `https://your-backend-domain.com:XXXX` with your actual domain (e.g., `https://mydomain.com:8011`).
#### AI Model Access
You'll need either:
- **Anthropic API key** for direct Claude access, OR
- **AWS credentials** configured for Amazon Bedrock access
The system is designed to work with Claude 3.5 Sonnet, which provides the best balance of performance and cost for infrastructure operations.
The fastest way to experience the SRE Agent is through our demo environment, which provides realistic mock data without requiring access to real infrastructure.
> **📋 Note:** Ensure you have completed the SSL certificate setup from the Prerequisites section before proceeding.
sre-agent --prompt "What's the status of the database pods?"
```
The system will orchestrate multiple agents to investigate your query, showing real-time progress as each agent contributes its findings. You'll see the Kubernetes Agent checking pod status, the Logs Agent searching for errors, and other agents collaborating to provide a comprehensive analysis.
## Usage
### Interactive Mode
The interactive mode provides a conversational interface for ongoing investigations. This mode maintains context across multiple queries, allowing you to drill deeper into issues and ask follow-up questions based on previous findings:
```bash
sre-agent --interactive
# Available commands in interactive mode:
# /help - Show available commands
# /agents - List available specialist agents
# /history - Show conversation history
# /save - Save the current conversation
# /clear - Clear conversation history
# /exit - Exit the interactive session
```
Interactive mode is particularly useful for complex investigations where you need to explore multiple hypotheses or when the initial findings suggest deeper issues that require further analysis.
### Single Query Mode
For quick investigations or automation scenarios, you can use single query mode to get immediate answers:
```bash
# Investigate specific pod issues
sre-agent --prompt "Why are the payment-service pods crash looping?"
# Analyze performance degradation
sre-agent --prompt "Investigate high latency in the API gateway over the last hour"
# Search for error patterns
sre-agent --prompt "Find all database connection errors in the last 24 hours"
# Check healthy service status
sre-agent --prompt "How is the product catalog service performing?"
```
Each query triggers a complete investigation cycle, with the Supervisor Agent coordinating specialist agents to gather relevant information and provide actionable insights.
The SRE Agent uses environment variables for sensitive configuration values. Create a `.env` file in the `sre_agent/` directory with the following required variables:
```bash
# Required: API key for Claude model access
# For Anthropic direct access:
ANTHROPIC_API_KEY=sk-ant-api-key-here
# For Amazon Bedrock access:
AWS_DEFAULT_REGION=us-east-1
AWS_PROFILE=your-profile-name # Or use AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY
# Required: AgentCore Gateway authentication
GATEWAY_ACCESS_TOKEN=your-gateway-token-here # Generated by gateway setup
DEBUG=false # Enable debug mode for verbose output
```
**Note**: The SRE Agent looks for the `.env` file in the `sre_agent/` directory, not the project root. This allows for modular configuration management.
### Agent Configuration
The agent behavior is configured through `sre_agent/config/agent_config.yaml`. This file defines the mapping between agents and their available tools, as well as LLM parameters:
```yaml
# Agent to tool mapping
agents:
kubernetes_agent:
name: "Kubernetes Infrastructure Agent"
description: "Specializes in Kubernetes operations and troubleshooting"
tools:
- get_pod_status
- get_deployment_status
- get_cluster_events
- get_resource_usage
- get_node_status
logs_agent:
name: "Application Logs Agent"
description: "Expert in log analysis and pattern detection"
tools:
- search_logs
- get_error_logs
- analyze_log_patterns
- get_recent_logs
- count_log_events
metrics_agent:
name: "Performance Metrics Agent"
description: "Analyzes performance metrics and trends"
tools:
- get_performance_metrics
- get_error_rates
- get_resource_metrics
- get_availability_metrics
- analyze_trends
runbooks_agent:
name: "Operational Runbooks Agent"
description: "Provides operational procedures and guides"
tools:
- search_runbooks
- get_incident_playbook
- get_troubleshooting_guide
- get_escalation_procedures
- get_common_resolutions
# Global tools available to all agents
global_tools:
- get_current_time # Local tool for timestamp operations
# LLM configuration
llm:
provider: anthropic # Options: anthropic, bedrock
model_id: claude-3-5-sonnet-20241022
temperature: 0.1 # Low temperature for consistent technical responses
max_tokens: 4096 # Maximum response length
# Gateway configuration
gateway:
uri: "https://your-gateway-url.com" # Updated during setup
```
### Gateway Configuration
The AgentCore Gateway is configured through `gateway/config.yaml`. This configuration is managed by the setup scripts but can be customized:
```yaml
# AgentCore Gateway Configuration Template
# Copy this file to config.yaml and update with your environment-specific settings
gateway_description: "AgentCore Gateway for API Integration"
# Target Configuration
target_description: "S3 target for OpenAPI schema"
```
## Demo Environment
The SRE Agent includes a demo environment that simulates infrastructure operations. This allows you to explore the system's capabilities without connecting to production systems.
**Important Note**: The data in [`backend/data`](backend/data) is synthetically generated, and the backend directory contains stub servers that showcase how a real SRE agent backend could work. In a production environment, these implementations would need to be replaced with real implementations that connect to actual systems, use vector databases, and integrate with other data sources. This demo serves as an illustration of the architecture, where the backend components are designed to be plug-and-play replaceable.
# - Kubernetes API (port 8011): Simulates a K8s cluster with multiple namespaces
# - Logs API (port 8012): Provides searchable application logs with error injection
# - Metrics API (port 8013): Generates realistic performance metrics with anomalies
# - Runbooks API (port 8014): Serves operational procedures and troubleshooting guides
```
### Demo Scenarios
The demo environment includes several pre-configured scenarios that showcase the SRE Agent's capabilities:
**Database Pod Failure Scenario**: The demo includes failing database pods in the production namespace with associated error logs and resource exhaustion metrics. This scenario demonstrates how the agents collaborate to identify memory leaks as the root cause.
**API Gateway Latency Scenario**: Simulated high latency in the API gateway with corresponding slow query logs and CPU spikes. This showcases the system's ability to correlate issues across different data sources.
**Cascading Failure Scenario**: A complex scenario where a failing authentication service causes cascading failures across multiple services. This demonstrates the agent's ability to trace issues through distributed systems.
### Customizing Demo Data
The demo data is stored in JSON files under `backend/data/` and can be customized to match your specific use cases:
│ └── application_logs.json # Log entries with various severity levels
├── metrics_data/
│ └── performance_metrics.json # Time-series metrics data
└── runbooks_data/
└── runbooks.json # Operational procedures
```
### Stopping the Demo
```bash
# Stop all demo servers
cd backend
./scripts/stop_demo_backend.sh
```
## Example Use Cases
### Investigating Pod Failures
When database pods start failing, the SRE Agent orchestrates a comprehensive investigation:
```bash
sre-agent --prompt "Our database pods are crash looping in production"
```
The investigation unfolds systematically: The Supervisor Agent first engages the Kubernetes Agent to check pod status and recent events. Upon discovering OOMKilled events, it automatically involves the Metrics Agent to analyze memory usage trends over time. The Logs Agent searches for memory-related errors and identifies queries causing excessive memory allocation. Finally, the Runbooks Agent provides specific remediation steps, including query optimization guidelines and pod resource adjustments.
This investigation typically completes in under 30 seconds, providing actionable insights that would traditionally require manual correlation across multiple tools and dashboards.
### Diagnosing Performance Issues
For performance-related investigations, the agent system excels at correlating metrics across multiple dimensions:
```bash
sre-agent --prompt "API response times have degraded 3x in the last hour"
```
The multi-agent collaboration reveals the complete picture: The Metrics Agent identifies the exact time when latency increased and correlates it with CPU usage patterns. The Logs Agent discovers database connection pool exhaustion errors coinciding with the performance degradation. The Kubernetes Agent finds that a recent deployment increased replica count but not database connections. The Runbooks Agent provides the specific configuration changes needed to resolve the issue.
### Interactive Troubleshooting Session
For complex issues requiring iterative investigation, the interactive mode provides a powerful interface:
```bash
sre-agent --interactive
👤 You: We're seeing intermittent 502 errors from the payment service
[Deeper investigation into message processing rates and database performance]
💬 Final Response: Database analysis shows lock contention on the payments table
during batch processing. The account_balance update triggers are causing
row-level locks that block payment processing...
```
### Proactive Monitoring
The SRE Agent can also be used for proactive health checks and trend analysis:
```bash
# Morning health check
sre-agent --prompt "Perform a comprehensive health check of all production services"
# Capacity planning
sre-agent --prompt "Analyze resource utilization trends and predict when we'll need to scale"
# Security audit
sre-agent --prompt "Check for any suspicious patterns in authentication logs"
```
## Production Deployment
### Infrastructure Requirements
For production deployment, the SRE Agent can be deployed in multiple ways depending on your operational requirements:
**Self-Managed Deployment**: The system is designed to be stateless and can be deployed as a containerized service or directly on virtual machines. Minimum requirements include Python 3.12+ runtime environment, 4GB RAM for agent orchestration (8GB recommended for high-volume operations), network access to infrastructure APIs and Amazon Bedrock endpoints, and secure storage for API credentials and tokens.
**Amazon Bedrock AgentCore Runtime (Recommended)**: For a fully managed experience, the SRE Agent can be deployed through Amazon Bedrock AgentCore Runtime, which provides a serverless, auto-scaling environment. This eliminates infrastructure management overhead, automatically scales based on investigation workload, and provides built-in security, monitoring, and cost optimization. The AgentCore Runtime handles agent orchestration, memory management, and provides enterprise-grade availability without requiring manual capacity planning or server maintenance.
### Security Considerations
The SRE Agent handles sensitive infrastructure data and requires appropriate security measures. Implement API authentication using OAuth2 or API keys for all infrastructure endpoints. Use AWS IAM roles for Bedrock access instead of long-lived credentials. Enable TLS encryption for all API communications. Implement audit logging for all agent actions and investigations. Use secret management systems (AWS Secrets Manager) for credential storage.
## Verification of Results
The SRE Agent includes tools for verifying that investigation results are accurate and based on actual data rather than hallucinated information.
### Ground Truth Verification
For result verification, we provide a data dump utility that creates a comprehensive ground truth dataset:
```bash
# Generate complete data dump for verification
cd backend/scripts
./dump_data_contents.sh
```
This script processes all files in the [`backend/data`](backend/data) directory (including `.json`, `.txt`, and `.log` files) and creates a comprehensive dump at [`backend/data/all_data_dump.txt`](backend/data/all_data_dump.txt). This file serves as ground truth for verifying that agent responses are factual and not fabricated.
### Report Verification
The [`reports`](reports) folder contains investigation reports for several example queries. You can verify these reports against the ground truth data using the LLM-as-a-judge verification system:
>**⚠️ Important Note**: The system prompts and agent logic in [`sre_agent/agent_nodes.py`](sre_agent/agent_nodes.py) require further refinement before production use. This implementation demonstrates the architectural approach and provides a foundation for building production-ready SRE agents, but the prompts, error handling, and agent coordination logic need additional tuning for real-world reliability.
## Development
### Running Tests
The SRE Agent includes comprehensive test coverage to ensure reliability:
```bash
# Run all tests
pytest
# Run tests with coverage report
pytest --cov=sre_agent --cov-report=html
open htmlcov/index.html # View coverage report
# Run specific test categories
pytest tests/unit/ # Fast unit tests
pytest tests/integration/ # Integration tests with mocked APIs
pytest tests/e2e/ # End-to-end tests with demo backend
# Run tests in parallel for speed
pytest -n auto
# Run with verbose output for debugging
pytest -vv -s
```
### Code Quality
Maintain code quality using automated tools:
```bash
# Format code with black
black sre_agent/ tests/
# Check type hints with mypy
mypy sre_agent/
# Lint code with ruff
ruff check sre_agent/
# Run all quality checks
make quality
```
### Contributing
We welcome contributions to improve the SRE Agent. Before submitting a pull request, please:
1. Review existing issues and discussions to avoid duplicate work
2. Create an issue describing your proposed changes for discussion
3. Follow the existing code style and patterns
4. Add tests for new functionality
5. Update documentation as needed
6. Ensure all tests pass and quality checks succeed
For major changes, please open an RFC (Request for Comments) issue first to discuss the proposed architecture and implementation approach.
## License
The SRE Agent is open source software licensed under the MIT License. See the [LICENSE](../../../LICENSE) file for full details.