iSharkFly-Docs/amazon-bedrock-agentcore-samples

mirror of https://github.com/awslabs/amazon-bedrock-agentcore-samples.git synced 2025-09-08 20:50:46 +00:00

History

* feat: Add AWS Operations Agent with AgentCore Runtime

- Complete rewrite of AWS Operations Agent using Amazon Bedrock AgentCore
- Added comprehensive deployment scripts for DIY and SDK runtime modes
- Implemented OAuth2/PKCE authentication with Okta integration
- Added MCP (Model Context Protocol) tool support for AWS service operations
- Sanitized all sensitive information (account IDs, domains, client IDs) with placeholders
- Added support for 17 AWS services: EC2, S3, Lambda, CloudFormation, IAM, RDS, CloudWatch, Cost Explorer, ECS, EKS, SNS, SQS, DynamoDB, Route53, API Gateway, SES, Bedrock, SageMaker
- Includes chatbot client, gateway management scripts, and comprehensive testing
- Ready for public GitHub with security-cleared configuration files

Security: All sensitive values replaced with <YOUR_AWS_ACCOUNT_ID>, <YOUR_OKTA_DOMAIN>, <YOUR_OKTA_CLIENT_ID> placeholders

* Update AWS Operations Agent architecture diagram

* feat: Enhance AWS Operations Agent with improved testing and deployment

- Update README with new local container testing approach using run-*-local-container.sh scripts
- Replace deprecated SAM-based MCP Lambda deployment with ZIP-based deployment
- Add no-cache flag to Docker builds to ensure clean builds
- Update deployment scripts to use consolidated configuration files
- Add comprehensive cleanup scripts for all deployment components
- Improve error handling and credential validation in deployment scripts
- Add new MCP tool deployment using ZIP packaging instead of Docker containers
- Update configuration management to use dynamic-config.yaml structure
- Add local testing capabilities with containerized agents
- Remove outdated test scripts and replace with interactive chat client approach

* fix: Update IAM policy configurations

- Update bac-permissions-policy.json with enhanced permissions
- Update bac-trust-policy.json for improved trust relationships

* fix: Update Docker configurations for agent runtimes

- Update Dockerfile.diy with improved container configuration
- Update Dockerfile.sdk with enhanced build settings

* fix: Update OAuth iframe flow configuration

- Update iframe-oauth-flow.html with improved OAuth handling

* feat: Update AWS Operations Agent configuration and cleanup

- Update IAM permissions policy with enhanced access controls
- Update IAM trust policy with improved security conditions
- Enhance OAuth iframe flow with better UX and error handling
- Improve chatbot client with enhanced local testing capabilities
- Remove cache files and duplicate code for cleaner repository

* docs: Add architecture diagrams and update README

- Add architecture-2.jpg and flow.jpg diagrams for better visualization
- Update README.md with enhanced documentation and diagrams

* Save current work before resolving merge conflicts

* Keep AWS-operations-agent changes (local version takes precedence)

* Fix: Remove merge conflict markers from AWS-operations-agent files - restore clean version

* Fix deployment and cleanup script issues

Major improvements and fixes:

Configuration Management:
- Fix role assignment in gateway creation (use bac-execution-role instead of Lambda role)
- Add missing role_arn cleanup in MCP tool deletion script
- Fix OAuth provider deletion script configuration clearing
- Improve memory deletion script to preserve quote consistency
- Add Lambda invoke permissions to bac-permissions-policy.json

Script Improvements:
- Reorganize deletion scripts: 11-delete-oauth-provider.sh, 12-delete-memory.sh, 13-cleanup-everything.sh
- Fix interactive prompt handling in cleanup scripts (echo -e format)
- Add yq support with sed fallbacks for better YAML manipulation
- Remove obsolete 04-deploy-mcp-tool-lambda-zip.sh script

Architecture Fixes:
- Correct gateway role assignment to use runtime.role_arn (bac-execution-role)
- Ensure proper role separation between gateway and Lambda execution
- Fix configuration cleanup to clear all dynamic config fields consistently

Documentation:
- Update README with clear configuration instructions
- Maintain security best practices with placeholder values
- Add comprehensive deployment and cleanup guidance

These changes address systematic issues with cleanup scripts, role assignments,
and configuration management while maintaining security best practices.

* Update README.md with comprehensive documentation

Enhanced documentation includes:
- Complete project structure with 75 files
- Step-by-step deployment guide with all 13 scripts
- Clear configuration instructions with security best practices
- Dual agent architecture documentation (DIY + SDK)
- Authentication flow and security implementation details
- Troubleshooting guide and operational procedures
- Local testing and container development guidance
- Tool integration and MCP protocol documentation

The README now provides complete guidance for deploying and operating
the AWS Support Agent with Amazon Bedrock AgentCore system.

---------

Co-authored-by: name <alias@amazon.com>

2025-08-09 13:51:24 -07:00

backend

feat(02-use-cases): integrate AgentCore Memory with SRE Agent for intelligent context-aware incident response (#210 )

2025-08-06 17:49:56 -04:00

deployment

feat(02-usecases): add observability support and documentation improvements (#220 )

2025-08-08 09:22:15 -04:00

docs

feat(02-use-cases): Add observability support and update documentation (#222 )

2025-08-08 15:25:55 -04:00

gateway

Configuration Management Fixes (#223 )

2025-08-09 13:51:24 -07:00

scripts

Configuration Management Fixes (#223 )

2025-08-09 13:51:24 -07:00

sre_agent

feat(02-usecases): add observability support and documentation improvements (#220 )

2025-08-08 09:22:15 -04:00

tests

Configuration Management Fixes (#223 )

2025-08-09 13:51:24 -07:00

.gitignore

Configuration Management Fixes (#223 )

2025-08-09 13:51:24 -07:00

docker-compose.yaml

feat(02-usecases): add observability support and documentation improvements (#220 )

2025-08-08 09:22:15 -04:00

Dockerfile

feat(02-usecases): add observability support and documentation improvements (#220 )

2025-08-08 09:22:15 -04:00

Dockerfile.x86_64

fix(SRE Agent)- Deploy SRE Agent on Amazon Bedrock AgentCore Runtime with Enhanced Architecture (#158 )

2025-07-27 15:05:03 -04:00

Makefile

renaming folders (#102 )

2025-07-21 10:45:13 -04:00

mypy.ini

fix(SRE Agent)- Deploy SRE Agent on Amazon Bedrock AgentCore Runtime with Enhanced Architecture (#158 )

2025-07-27 15:05:03 -04:00

pyproject.toml

feat(02-usecases): add observability support and documentation improvements (#220 )

2025-08-08 09:22:15 -04:00

README.md

feat(02-use-cases): Add observability support and update documentation (#222 )

2025-08-08 15:25:55 -04:00

uv.lock

feat(02-use-cases): integrate AgentCore Memory with SRE Agent for intelligent context-aware incident response (#210 )

2025-08-06 17:49:56 -04:00

verify_report.py

feat(02-use-cases): integrate AgentCore Memory with SRE Agent for intelligent context-aware incident response (#210 )

2025-08-06 17:49:56 -04:00

README.md

SRE Agent - Multi-Agent Site Reliability Engineering Assistant

Overview

The SRE Agent is a multi-agent system for Site Reliability Engineers that helps investigate infrastructure issues. Built on the Model Context Protocol (MCP) and powered by Amazon Nova and Anthropic Claude models (Claude can be accessed through Amazon Bedrock or directly through Anthropic), this system uses specialized AI agents that collaborate to investigate issues, analyze logs, monitor performance metrics, and execute operational procedures. The AgentCore Gateway provides access to data sources and systems available as MCP tools. This example also demonstrates how to deploy the agent using the Amazon Bedrock AgentCore Runtime for production environments.

Use case details

Information	Details
Use case type	conversational
Agent type	Multi-agent
Use case components	Tools (MCP-based), observability (logs, metrics), operational runbooks
Use case vertical	DevOps/SRE
Example complexity	Advanced
SDK used	Amazon Bedrock AgentCore SDK, LangGraph, MCP

Assets

Asset	Description
Demo video 1 (SRE-Agent CLI, VSCode integration)	Walkthrough of the SRE Agent investigating and resolving infrastructure issues using CLI and VSCode
Demo video 2 (Cursor integration)	Demonstration of AgentCore Gateway with SRE tools integration with Cursor IDE
AI generated podcast	Audio discussion explaining the SRE Agent's capabilities and architecture

Use case Architecture

Use case key Features

Multi-Agent Orchestration: Specialized agents collaborate on infrastructure investigations with real-time streaming
Conversational Interface: Single-query investigations and interactive multi-turn conversations with context preservation
Long-term Memory Integration: Amazon Bedrock Agent Memory provides persistent user preferences and infrastructure knowledge across sessions
User Personalization: Tailored reports and escalation procedures based on individual user preferences and roles
MCP-based Integration: AgentCore Gateway provides secure API access with authentication and health monitoring
Specialized Agents: Four domain-specific agents for Kubernetes, logs, metrics, and operational procedures
Documentation and Reporting: Markdown reports generated for each investigation with audit trail

Detailed Documentation

For comprehensive information about the SRE Agent system, please refer to the following detailed documentation:

System Components - In-depth architecture and component explanations
Memory System - Long-term memory integration, user personalization, and cross-session learning
Configuration - Complete configuration guides for environment variables, agents, and gateway
Deployment Guide - Complete deployment guide for Amazon Bedrock AgentCore Runtime
Security - Security best practices and considerations for production deployment
Demo Environment - Demo scenarios, data customization, and testing setup
Example Use Cases - Detailed walkthroughs and interactive troubleshooting examples
Verification - Ground truth verification and report validation
Development - Testing, code quality, and contribution guidelines

Prerequisites

Requirement	Description
Python 3.12+ and `uv`	Python runtime and package manager. See use-case setup
Amazon EC2 Instance	Recommended: `t3.xlarge` or larger
Valid SSL certificates	⚠️ IMPORTANT: Amazon Bedrock AgentCore Gateway only works with HTTPS endpoints. For example, you can register your Amazon EC2 with no-ip.com and obtain a certificate from letsencrypt.org, or use any other domain registration and SSL certificate provider. You'll need the domain name as `BACKEND_DOMAIN` and certificate paths in the use-case setup section
EC2 instance port configuration	Required inbound ports (443, 8011-8014). See EC2 instance port configuration
IAM role with BedrockAgentCoreFullAccess policy	Required permissions and trust policy for AgentCore service. See IAM role with BedrockAgentCoreFullAccess policy
Identity Provider (IDP)	Amazon Cognito, Auth0, or Okta for JWT authentication. For automated Cognito setup, use `deployment/setup_cognito.sh`. See Authentication setup

Note: All prerequisites must be completed before proceeding to the use case setup. The setup will fail without proper SSL certificates, IAM permissions, and identity provider configuration.

Use case setup

Configuration Guide: For detailed information about all configuration files used in this project, see the Configuration Documentation.

# Clone the repository
git clone https://github.com/awslabs/amazon-bedrock-agentcore-samples
cd amazon-bedrock-agentcore-samples/02-use-cases/SRE-agent

# Create and activate a virtual environment
uv venv --python 3.12
source .venv/bin/activate  # On Windows: .venv\Scripts\activate

# Install the SRE Agent and dependencies
uv pip install -e .

# Configure environment variables
cp .env.example sre_agent/.env
# Edit sre_agent/.env and add your Anthropic API key:
# ANTHROPIC_API_KEY=sk-ant-your-key-here

# Openapi Templates get replaced with your backend domain and saved as .yaml
BACKEND_DOMAIN=api.mycompany.com ./backend/openapi_specs/generate_specs.sh

# Get your EC2 instance private IP for server binding
TOKEN=$(curl -X PUT "http://169.254.169.254/latest/api/token" \
  -H "X-aws-ec2-metadata-token-ttl-seconds: 21600" -s)
PRIVATE_IP=$(curl -H "X-aws-ec2-metadata-token: $TOKEN" \
  -s http://169.254.169.254/latest/meta-data/local-ipv4)

# Start the demo backend servers with SSL
cd backend
./scripts/start_demo_backend.sh \
  --host $PRIVATE_IP  \
  --ssl-keyfile /opt/ssl/privkey.pem \
  --ssl-certfile /opt/ssl/fullchain.pem
cd ..

# Create and configure the AgentCore Gateway
cd gateway
./create_gateway.sh
./mcp_cmds.sh
cd ..

# Update the gateway URI in agent configuration
GATEWAY_URI=$(cat gateway/.gateway_uri)
sed -i "s|uri: \".*\"|uri: \"$GATEWAY_URI\"|" sre_agent/config/agent_config.yaml

# Copy the gateway access token to your .env file
sed -i '/^GATEWAY_ACCESS_TOKEN=/d' sre_agent/.env
echo "GATEWAY_ACCESS_TOKEN=$(cat gateway/.access_token)" >> sre_agent/.env

# Initialize memory system and add user preferences
uv run python scripts/manage_memories.py update

# Note: Memory system takes 10-12 minutes to be ready
# Check memory status after 10 minutes:
uv run python scripts/manage_memories.py list

# Once memory shows as ready, run update again to ensure preferences are loaded:
uv run python scripts/manage_memories.py update

Local Setup Complete: Your SRE Agent is now running locally on your EC2 instance and is exercising the AgentCore Gateway and Memory services. If you want to deploy this agent on AgentCore Runtime so you can integrate it into your applications (like a chatbot, Slack bot, etc.), follow the instructions in the Development to Production Deployment Flow section below.

Execution instructions

Memory-Enhanced Personalized Investigations

The SRE Agent includes a sophisticated memory system that personalizes investigations based on user preferences. The system comes preconfigured with two user personas in scripts/user_config.yaml:

Alice: Technical detailed investigations with comprehensive analysis and team alerts
Carol: Executive-focused investigations with business impact analysis and strategic alerts

When running investigations with different user IDs, the agent produces similar technical findings but presents them according to each user's preferences:

# Alice's detailed technical investigation
USER_ID=Alice sre-agent --prompt "API response times have degraded 3x in the last hour" --provider bedrock

# Carol's executive-focused investigation  
USER_ID=Carol sre-agent --prompt "API response times have degraded 3x in the last hour" --provider bedrock

Both commands will identify identical technical issues but present them differently:

Alice receives detailed technical analysis with step-by-step troubleshooting and team notifications
Carol receives executive summaries focused on business impact with rapid escalation timelines

For a detailed comparison showing how the memory system personalizes identical incidents, see: Memory System Report Comparison

Single Query Mode

# Investigate specific pod issues
sre-agent --prompt "Why are the payment-service pods crash looping?"

# Analyze performance degradation
sre-agent --prompt "Investigate high latency in the API gateway over the last hour"

# Search for error patterns
sre-agent --prompt "Find all database connection errors in the last 24 hours"

Interactive Mode

# Start interactive conversation
sre-agent --interactive

# Available commands in interactive mode:
# /help     - Show available commands
# /agents   - List available specialist agents
# /history  - Show conversation history
# /save     - Save the current conversation
# /clear    - Clear conversation history
# /exit     - Exit the interactive session

Advanced Options

# Use Amazon Bedrock
sre-agent --provider bedrock --query "Check cluster health"

# Save investigation reports to custom directory
sre-agent --output-dir ./investigations --query "Analyze memory usage trends"

# Use Amazon Bedrock with specific profile
AWS_PROFILE=production sre-agent --provider bedrock --interactive

Development to Production Deployment Flow

The SRE Agent follows a structured deployment process from local development to production on Amazon Bedrock AgentCore Runtime. For detailed instructions, see the Deployment Guide.

STEP 1: LOCAL DEVELOPMENT
┌─────────────────────────────────────────────────────────────────────┐
│  Develop Python Package (sre_agent/)                                │
│  └─> Test locally with CLI: uv run sre-agent --prompt "..."         │
│      └─> Agent connects to AgentCore Gateway via MCP protocol       │
└─────────────────────────────────────────────────────────────────────┘
                                    ↓
STEP 2: CONTAINERIZATION  
┌─────────────────────────────────────────────────────────────────────┐
│  Add agent_runtime.py (FastAPI server wrapper)                      │
│  └─> Create Dockerfile (ARM64 for AgentCore)                        │
│      └─> Uses deployment/build_and_deploy.sh script                 │
└─────────────────────────────────────────────────────────────────────┘
                                    ↓
STEP 3: LOCAL CONTAINER TESTING
┌─────────────────────────────────────────────────────────────────────┐
│  Build: LOCAL_BUILD=true ./deployment/build_and_deploy.sh           │
│  └─> Run: docker run -p 8080:8080 sre_agent:latest                  │
│      └─> Test: curl -X POST http://localhost:8080/invocations       │
│          └─> Container connects to same AgentCore Gateway           │
└─────────────────────────────────────────────────────────────────────┘
                                    ↓
STEP 4: PRODUCTION DEPLOYMENT
┌─────────────────────────────────────────────────────────────────────┐
│  Build & Push: ./deployment/build_and_deploy.sh                     │
│  └─> Pushes container to Amazon ECR                                 │
│      └─> deployment/deploy_agent_runtime.py deploys to AgentCore    │
│          └─> Test: uv run python deployment/invoke_agent_runtime.py │
│              └─> Production agent uses production Gateway           │
└─────────────────────────────────────────────────────────────────────┘

Key Points:
• Core agent code (sre_agent/) remains unchanged
• Deployment/ folder contains all deployment-specific utilities
• Same agent works locally and in production via environment config
• AgentCore Gateway provides MCP tools access at all stages

Deploying Your Agent on Amazon Bedrock AgentCore Runtime

For production deployments, you can deploy the SRE Agent directly to Amazon Bedrock AgentCore Runtime. This provides a scalable, managed environment for running your agent with enterprise-grade security and monitoring.

The AgentCore Runtime deployment supports:

Container-based deployment with automatic scaling
Multiple LLM providers (Amazon Bedrock or Anthropic Claude)
Debug mode for troubleshooting and development
Environment-based configuration for different deployment stages
Secure credential management through AWS IAM and environment variables

For complete step-by-step instructions including local testing, container building, and production deployment, see the Deployment Guide.

AgentCore Observability

Adding observability to an Agent deployed on the AgentCore Runtime is straightforward using the observability primitive. This enables comprehensive monitoring through Amazon CloudWatch with metrics, traces, and logs.

Setting Up Observability

1. Add OpenTelemetry Packages

The required OpenTelemetry packages are already included in pyproject.toml:

dependencies = [
    # ... other dependencies ...
    "opentelemetry-instrumentation-langchain",
    "aws-opentelemetry-distro~=0.10.1",
]

2. Configure Observability for Agents

Follow the Amazon Bedrock AgentCore observability configuration guide to enable metrics in Amazon CloudWatch.

3. Enable OpenTelemetry Instrumentation

When starting the container, use the opentelemetry-instrument utility to automatically instrument your application. This is configured in the Dockerfile:

# Run application with OpenTelemetry instrumentation
CMD ["uv", "run", "opentelemetry-instrument", "uvicorn", "sre_agent.agent_runtime:app", "--host", "0.0.0.0", "--port", "8080"]

Viewing Metrics and Traces

Once deployed with observability enabled, you can monitor your agent's performance through:

Amazon CloudWatch Metrics: View request rates, latencies, and error rates
AWS X-Ray Traces: Analyze distributed traces to understand request flow
CloudWatch Logs: Access structured logs for debugging and analysis

The observability primitive automatically captures:

LLM invocation metrics (tokens, latency, model usage)
Tool execution traces (duration, success/failure)
Memory operations (retrieval, storage)
End-to-end request tracing across all agent components

Maintenance and Operations

Restarting Backend Servers and Refreshing Access Token

To maintain connectivity with the Amazon Bedrock AgentCore Gateway, you need to periodically restart backend servers and refresh the access token. Run the gateway configuration script:

# Important: Run this from within the virtual environment
source .venv/bin/activate  # If not already activated
./scripts/configure_gateway.sh

What this script does:

Stops running backend servers to ensure clean restart
Generates a new access token for AgentCore Gateway authentication
Gets the EC2 instance private IP for proper SSL binding
Starts backend servers with SSL certificates (HTTPS) or HTTP fallback
Updates gateway URI in the agent configuration from gateway/.gateway_uri
Updates access token in the .env file for agent authentication

Important: You must run this script every 24 hours because the access token expires after 24 hours. If you don't refresh the token:

The SRE agent will lose connection to the AgentCore gateway
No MCP tools will be available (Kubernetes, logs, metrics, runbooks APIs)
Investigations will fail as agents cannot access backend services

For more details, see the configure_gateway.sh script.

Troubleshooting Gateway Connection Issues

If you encounter "gateway connection failed" or "MCP tools unavailable" errors:

Check if the access token has expired (24-hour limit)
Run ./scripts/configure_gateway.sh to refresh authentication (from within the virtual environment)
Verify backend servers are running with ps aux | grep python
Check SSL certificate validity if using HTTPS

Clean up instructions

Complete AWS Resource Cleanup

For complete cleanup of all AWS resources (Gateway, Runtime, and local files):

# Complete cleanup - deletes AWS resources and local files
./scripts/cleanup.sh

# Or with custom names
./scripts/cleanup.sh --gateway-name my-gateway --runtime-name my-runtime

# Force cleanup without confirmation prompts
./scripts/cleanup.sh --force

This script will:

Stop backend servers
Delete the AgentCore Gateway and all its targets
Delete memory resources
Delete the AgentCore Runtime
Remove generated files (gateway URIs, tokens, agent ARNs, memory IDs)

Manual Local Cleanup Only

If you only want to clean up local files without touching AWS resources:

# Stop all demo servers
cd backend
./scripts/stop_demo_backend.sh
cd ..

# Clean up generated files only
rm -rf gateway/.gateway_uri gateway/.access_token
rm -rf deployment/.agent_arn .memory_id

# Note: .env, .venv, and reports/ are preserved for development continuity

Disclaimer

The examples provided in this repository are for experimental and educational purposes only. They demonstrate concepts and techniques but are not intended for direct use in production environments. Make sure to have Amazon Bedrock Guardrails in place to protect against prompt injection.

Important Note: The data in backend/data is synthetically generated, and the backend directory contains stub servers that showcase how a real SRE agent backend could work. In a production environment, these implementations would need to be replaced with real implementations that connect to actual systems, use vector databases, and integrate with other data sources.