Dheeraj Oruganty e346e83bf1
fix(02-use-cases): SRE-Agent Deployment (#179)
* Add missing credential_provider_name parameter to config.yaml.example

* Fix get_config function to properly parse YAML values with inline comments

* Enhanced get_config to prevent copy-paste whitespace errors in AWS identifiers

* Improve LLM provider configuration and error handling with bedrock as default

* Add OpenAPI templating system and fix hardcoded regions

* Add backend template build to Readme

* delete old yaml files

* Fix Cognito setup with automation script and missing domain creation steps

* docs: Add EC2 instance port configuration documentation

- Document required inbound ports (443, 8011-8014)
- Include SSL/TLS security requirements
- Add AWS security group best practices
- Provide port usage summary table

* docs: Add hyperlinks to prerequisites in README

- Link EC2 port configuration documentation
- Link IAM role authentication setup
- Improve navigation to detailed setup instructions

* docs: Add BACKEND_API_KEY to configuration documentation

- Document gateway environment variables section
- Add BACKEND_API_KEY requirement for credential provider
- Include example .env file format for gateway directory
- Explain usage in create_gateway.sh script

* docs: Add BACKEND_API_KEY to deployment guide environment variables

- Include BACKEND_API_KEY in environment variables reference table
- Mark as required for gateway setup
- Provide quick reference alongside other required variables

* docs: Add BedrockAgentCoreFullAccess policy and trust policy documentation

- Document AWS managed policy BedrockAgentCoreFullAccess
- Add trust policy requirements for bedrock-agentcore.amazonaws.com
- Reorganize IAM permissions for better clarity
- Remove duplicate trust policy section
- Add IAM role requirement to deployment prerequisites

* docs: Document role_name field in gateway config example

- Explain that role_name is used to create and manage the gateway
- Specify BedrockAgentCoreFullAccess policy requirement
- Note trust policy requirement for bedrock-agentcore.amazonaws.com
- Improve clarity for gateway configuration setup

* docs: Add AWS IP address ranges for production security enhancement

- Document AWS IP ranges JSON download for restricting access
- Reference official AWS documentation for IP address ranges
- Provide security alternatives to 0.0.0.0/0 for production
- Include examples of restricted security group configurations
- Enable egress filtering and region-specific access control

* style: Format Python code with black

- Reformat 14 Python files for consistent code style
- Apply PEP 8 formatting standards
- Improve code readability and maintainability

* docs: Update SRE agent prerequisites and setup documentation

- Convert prerequisites section to markdown table format
- Add SSL certificate provider examples (no-ip.com, letsencrypt.org)
- Add Identity Provider (IDP) requirement with setup_cognito.sh reference
- Clarify that all prerequisites must be completed before setup
- Add reference to domain name and cert paths needed for BACKEND_DOMAIN
- Remove Managing OpenAPI Specifications section (covered in use-case setup)
- Add Deployment Guide link to Development to Production section

Addresses issues #171 and #174

* fix: Replace 'AWS Bedrock' with 'Amazon Bedrock' in SRE agent files

- Updated error messages in llm_utils.py
- Updated comments in both .env.example files
- Ensures consistent naming convention across SRE agent codebase

---------

Co-authored-by: dheerajoruganty <dheo@amazon.com>
Co-authored-by: Amit Arora <aroraai@amazon.com>
2025-08-01 13:24:58 -04:00

13 KiB

SRE Agent Deployment Guide for Amazon Bedrock AgentCore Runtime

This guide walks you through the complete deployment process for the SRE Agent, from local testing to production deployment on Amazon Bedrock AgentCore Runtime.

Prerequisites

  • AWS CLI configured with appropriate permissions
  • Docker installed and running
  • UV package manager installed
  • Python 3.12+
  • Access to Amazon Bedrock AgentCore Runtime
  • IAM role with BedrockAgentCoreFullAccess policy and appropriate trust policy (see Authentication Setup)

Environment Configuration

The SRE Agent uses environment variables for configuration. These are read from .env files in the appropriate directories:

  • CLI Testing: Environment variables are read from sre_agent/.env
  • Container Building: Environment variables are read from deployment/.env
  • Docker Platform: Local builds use Dockerfile.x86_64 (linux/amd64), AgentCore deployments use Dockerfile (linux/arm64)

Required Environment Variables

Create the appropriate .env files with these variables:

For sre_agent/.env (CLI testing and local container runs):

GATEWAY_ACCESS_TOKEN=your_gateway_access_token
LLM_PROVIDER=bedrock
DEBUG=false
# If using Anthropic provider, also add:
# ANTHROPIC_API_KEY=sk-ant-your-key-here

For deployment/.env (container building and deployment):

GATEWAY_ACCESS_TOKEN=your_gateway_access_token
ANTHROPIC_API_KEY=sk-ant-your-key-here
# These can be overridden by environment variables during build/deploy

Note: When using --env-file, all required variables should be in the .env file. Use -e only to override specific variables from the .env file.

Deployment Sequence

Phase 1: Local Testing with CLI

First, test the SRE agent locally using the command-line interface to ensure it works correctly.

1.1 Setup Environment

Create and configure your environment files:

# Setup CLI environment file
cp sre_agent/.env.example sre_agent/.env
# Edit sre_agent/.env with your configuration

Note: Environment variables can be overridden at runtime, but having .env files ensures consistent configuration.

1.2 Test CLI with Bedrock (Default)

# Test with default Bedrock provider
uv run sre-agent --prompt "list the pods in my infrastructure"

# Test with debug output enabled
uv run sre-agent --prompt "list the pods in my infrastructure" --debug

# Test with specific provider
uv run sre-agent --prompt "list the pods in my infrastructure" --provider bedrock --debug

1.3 Test CLI with Anthropic Provider

# Ensure ANTHROPIC_API_KEY is set in your .env file, then:
uv run sre-agent --prompt "list the pods in my infrastructure" --provider anthropic --debug

Expected Output: You should see the agent processing your request, routing to appropriate specialized agents, and returning infrastructure information.

Phase 2: Local Container Testing

Once CLI testing is successful, build and test the agent as a container locally.

2.1 Build Local Container

The build script accepts an optional ECR repository name and uses different Dockerfiles based on the target platform:

  • Local builds (LOCAL_BUILD=true): Uses Dockerfile.x86_64 for linux/amd64 platform
  • AgentCore builds (default): Uses Dockerfile for linux/arm64 platform (required by AgentCore)
# Build container for local testing with custom name
LOCAL_BUILD=true ./deployment/build_and_deploy.sh my_custom_sre_agent

# View help for all options
./deployment/build_and_deploy.sh --help

2.2 Test Local Container with Bedrock

Run the container locally with default Bedrock provider:

# Using .env file from sre_agent directory (recommended)
# Ensure LLM_PROVIDER=bedrock is set in sre_agent/.env
docker run -p 8080:8080 --env-file sre_agent/.env my_custom_sre_agent:latest

# Alternative: with explicit environment variables (if not using .env file)
docker run -p 8080:8080 \
  -v ~/.aws:/root/.aws:ro \
  -e AWS_PROFILE=default \
  -e GATEWAY_ACCESS_TOKEN=your_token \
  -e LLM_PROVIDER=bedrock \
  my_custom_sre_agent:latest

# With debug enabled (overrides DEBUG setting from .env file)
docker run -p 8080:8080 --env-file sre_agent/.env -e DEBUG=true my_custom_sre_agent:latest

Note: The container name matches the ECR repository name you specified during build.

2.3 Test Local Container with Anthropic

# Using .env file (ensure LLM_PROVIDER=anthropic is set in sre_agent/.env)
docker run -p 8080:8080 --env-file sre_agent/.env my_custom_sre_agent:latest

# With debug enabled (override DEBUG setting from .env file)
docker run -p 8080:8080 \
  --env-file sre_agent/.env \
  -e DEBUG=true \
  my_custom_sre_agent:latest

Note: Ensure both LLM_PROVIDER=anthropic and ANTHROPIC_API_KEY are set in your sre_agent/.env file when using the anthropic provider.

2.4 Test Container with curl

Test the running container:

# Basic test
curl -X POST http://localhost:8080/invocations \
  -H "Content-Type: application/json" \
  -d '{
    "input": {
      "prompt": "list the pods in my infrastructure"
    }
  }'

# Health check
curl http://localhost:8080/ping

Expected Output: The container should respond with JSON containing the agent's response.

Phase 3: Amazon Bedrock AgentCore Runtime Deployment

Once local container testing is successful, deploy to AgentCore.

3.1 Deploy to AgentCore with Bedrock

# Deploy with custom repository name and default settings (reads from deployment/.env)
./deployment/build_and_deploy.sh my_custom_sre_agent

# Deploy with debug enabled (environment variable override)
DEBUG=true ./deployment/build_and_deploy.sh my_custom_sre_agent

# Deploy with specific provider
LLM_PROVIDER=bedrock DEBUG=true ./deployment/build_and_deploy.sh my_custom_sre_agent

3.2 Deploy to AgentCore with Anthropic

# Deploy with Anthropic provider (ensure ANTHROPIC_API_KEY is in deployment/.env)
LLM_PROVIDER=anthropic ./deployment/build_and_deploy.sh my_custom_sre_agent

# Deploy with Anthropic and debug enabled
DEBUG=true LLM_PROVIDER=anthropic ./deployment/build_and_deploy.sh my_custom_sre_agent

# Override API key via environment variable
LLM_PROVIDER=anthropic ANTHROPIC_API_KEY=sk-ant-your-key ./deployment/build_and_deploy.sh my_custom_sre_agent

Build Script Usage:

# View all available options
./deployment/build_and_deploy.sh --help

# The script accepts one optional argument: ECR repository name
# Default repository name is 'sre_agent'
# Note: Use underscores (_) instead of hyphens (-) in repository names

Expected Output: The script will build, push to ECR, and deploy to AgentCore Runtime.

3.3 Test AgentCore Deployment

Test the deployed agent using the invoke script:

# Test deployed agent
uv run python deployment/invoke_agent_runtime.py \
  --prompt "list the pods in my infrastructure"

# Test with custom runtime ARN
uv run python deployment/invoke_agent_runtime.py \
  --prompt "list the pods in my infrastructure" \
  --runtime-arn "arn:aws:bedrock-agentcore:us-east-1:123456789012:runtime/your-runtime-id"

Environment Variables Reference

Core Configuration

Variable Description Default Required
GATEWAY_ACCESS_TOKEN Gateway authentication token - Yes
BACKEND_API_KEY Backend API key for credential provider - Yes (gateway setup)
LLM_PROVIDER Language model provider bedrock No
ANTHROPIC_API_KEY Anthropic API key - Only for anthropic provider
DEBUG Enable debug logging and traces false No

AWS Configuration

Variable Description Default Required
AWS_REGION AWS region for deployment us-east-1 No
AWS_PROFILE AWS profile to use - No
RUNTIME_NAME AgentCore runtime name ECR repo name No

Build Script Configuration

Variable Description Default Notes
LOCAL_BUILD Build for local testing only false Uses Dockerfile.x86_64 when true
PLATFORM Target platform arm64 AgentCore requires arm64, use x86_64 for local
ECR_REPO_NAME ECR repository name sre_agent Can be passed as command line argument

Debug Mode Usage

CLI Debug Mode

# Enable debug with --debug flag
uv run sre-agent --prompt "your query" --debug

# Or with environment variable
DEBUG=true uv run sre-agent --prompt "your query"

Container Debug Mode

# Local container with debug (overrides DEBUG setting in .env file)
docker run -p 8080:8080 --env-file sre_agent/.env -e DEBUG=true my_custom_sre_agent:latest

# AgentCore deployment with debug
DEBUG=true ./deployment/build_and_deploy.sh my_custom_sre_agent

Debug Output Examples

Without Debug Mode:

🤖 Multi-Agent System: Processing...
🧭 Supervisor: Routing to kubernetes_agent
🔧 Kubernetes Agent:
   💡 Full Response: Here are the pods in your infrastructure...
💬 Final Response: I found 5 pods running in your infrastructure...

With Debug Mode:

🤖 Multi-Agent System: Processing...

MCP tools loaded: 12
  - kubernetes-list-pods: List all pods in the cluster...
  - kubernetes-get-pod: Get details of a specific pod...

🧭 Supervisor: Routing to kubernetes_agent
🔧 Kubernetes Agent:
   🔍 DEBUG: agent_messages = 3
   📋 Found 3 trace messages:
      1. AIMessage: I'll help you list the pods...
   📞 Calling tools:
      kubernetes-list-pods(
        namespace=None
      ) [id: call_123]
   🛠️  kubernetes-list-pods [id: call_123]:
      {"pods": [...]}
   💡 Full Response: Here are the pods in your infrastructure...
💬 Final Response: I found 5 pods running in your infrastructure...

Provider Configuration

Using Amazon Bedrock (Default)

# CLI (reads from sre_agent/.env)
uv run sre-agent --provider bedrock --prompt "your query"

# Container (reads LLM_PROVIDER=bedrock from sre_agent/.env)
docker run -p 8080:8080 --env-file sre_agent/.env my_custom_sre_agent:latest

# Deployment (reads from deployment/.env, can override via environment variable)
LLM_PROVIDER=bedrock ./deployment/build_and_deploy.sh my_custom_sre_agent

Using Anthropic Claude

# CLI (reads LLM_PROVIDER and ANTHROPIC_API_KEY from sre_agent/.env)
uv run sre-agent --provider anthropic --prompt "your query"

# Container (reads LLM_PROVIDER=anthropic and ANTHROPIC_API_KEY from sre_agent/.env)
docker run -p 8080:8080 --env-file sre_agent/.env my_custom_sre_agent:latest

# Deployment (reads from deployment/.env, can override via environment variable)
LLM_PROVIDER=anthropic ./deployment/build_and_deploy.sh my_custom_sre_agent

# Override API key via environment variable (if not in deployment/.env)
LLM_PROVIDER=anthropic ANTHROPIC_API_KEY=sk-ant-xxx ./deployment/build_and_deploy.sh my_custom_sre_agent

Troubleshooting

Common Issues

  1. Gateway Token Issues

    # Verify token is set
    echo $GATEWAY_ACCESS_TOKEN
    # Or check .env file
    cat sre_agent/.env
    
  2. Provider Configuration

    # For Anthropic, ensure API key is valid
    echo $ANTHROPIC_API_KEY
    # Test API key with a simple call
    
  3. Debug Information

    # Enable debug mode to see detailed logs
    DEBUG=true uv run sre-agent --prompt "test"
    
  4. Container Issues

    # Check container logs
    docker logs <container_id>
    # Run with debug
    docker run -e DEBUG=true ... my_custom_sre_agent:latest
    

Verification Steps

  1. CLI Working: Agent responds to queries locally
  2. Container Working: Container responds to curl requests
  3. AgentCore Working: Deployed agent responds via invoke script

Quick Start: Copy-Paste Command Sequence

For a complete deployment using my_custom_sre_agent, copy and paste these commands in sequence:

1. Build Local Container

LOCAL_BUILD=true ./deployment/build_and_deploy.sh my_custom_sre_agent

2. Test Local Container (Bedrock)

docker run -p 8080:8080 --env-file sre_agent/.env my_custom_sre_agent:latest

3. Test with curl

curl -X POST http://localhost:8080/invocations \
  -H "Content-Type: application/json" \
  -d '{
    "input": {
      "prompt": "list the pods in my infrastructure"
    }
  }'

4. Deploy to AgentCore

./deployment/build_and_deploy.sh my_custom_sre_agent

5. Test AgentCore Deployment

uv run python deployment/invoke_agent_runtime.py \
  --prompt "list the pods in my infrastructure"

Best Practices

  1. Development: Always test locally first
  2. Environment Files: Use .env files for consistent configuration
  3. Debug Mode: Enable debug mode when troubleshooting
  4. Provider Testing: Test both Bedrock and Anthropic providers if using both
  5. Incremental Deployment: Deploy to staging environment before production