12 Commits

Author SHA1 Message Date
Amit Arora
ca7a2689b3 feat: improve runtime deployment and invocation robustness
- Increase deletion wait time to 150s for agent runtime cleanup
- Add retry logic with exponential backoff for MCP rate limiting (429 errors)
- Add session_id and user_id to agent state for memory retrieval
- Filter out /ping endpoint logs to reduce noise
- Increase boto3 read timeout to 5 minutes for long-running operations
- Add clear error messages for agent name conflicts
- Update README to clarify virtual environment requirement for scripts
- Fix session ID generation to meet 33+ character requirement

These changes improve reliability when deploying and invoking agents,
especially under heavy load or with complex queries that take time.
2025-08-05 22:07:05 +00:00
Amit Arora
8725335b19 docs: comprehensive documentation update and cleanup
- Remove unused root .env and .env.example files (not referenced by any code)
- Update configuration.md with comprehensive config file documentation
- Add configuration overview table with setup instructions and auto-generation info
- Consolidate specialized-agents.md content into system-components.md
- Update system-components.md with complete AgentCore architecture
- Add detailed sections for AgentCore Runtime, Gateway, and Memory primitives
- Remove cli-reference.md (excessive documentation for limited use)
- Update README.md to reference configuration guide in setup section
- Clean up documentation links and organization

The documentation now provides a clear, consolidated view of the system
architecture and configuration with proper cross-references and setup guidance.
2025-08-05 14:42:33 +00:00
Amit Arora
a3fae11db7 docs: add memory system initialization timing guidance
- Added note that memory system takes 10-12 minutes to be ready
- Added steps to check memory status with list command after 10 minutes
- Added instruction to run update command again once memory is ready
- Provides clear workflow for memory system setup and prevents user confusion
2025-08-05 03:18:25 +00:00
Amit Arora
e5f56bc82f fix: preserve .env, .venv, and reports in cleanup script
- Modified cleanup script to only remove AWS-generated configuration files
- Preserved .env files for development continuity
- Preserved .venv directories to avoid reinstalling dependencies
- Preserved reports/ directory containing investigation history
- Files removed: gateway URIs, tokens, agent ARNs, memory IDs only
- Updated documentation to clarify preserved vs removed files
2025-08-05 02:21:04 +00:00
Amit Arora
9cffb8f672 feat: add comprehensive cleanup script with memory deletion
- Added cleanup.sh script to delete all AWS resources (gateway, runtime, memory)
- Integrated memory deletion using bedrock_agentcore MemoryClient
- Added proper error handling and graceful fallbacks
- Updated execution order: servers → gateway → memory → runtime → local files
- Added memory deletion to README.md cleanup instructions
- Includes confirmation prompts and --force option for automation
2025-08-05 02:17:45 +00:00
Amit Arora
9bcebcb4bd Merge branch 'main' into feature/sre-agent-memory-integration
Resolved conflicts in:
- deployment/invoke_agent_runtime.py
- sre_agent/multi_agent_langgraph.py
- sre_agent/supervisor.py

Integrated memory system changes with latest main branch updates
2025-08-05 01:30:08 +00:00
Amit Arora
2677f7b4b0 feat: enhance memory system with automatic pattern extraction and improved logging
## Memory System Enhancements
- **Individual agent memory integration**: Every agent response now triggers automatic memory pattern extraction through on_agent_response() hooks
- **Enhanced conversation logging**: Added detailed message breakdown showing USER/ASSISTANT/TOOL message counts and tool names called
- **Fixed infrastructure extraction**: Resolved hardcoded agent name issues by using SREConstants for agent identification
- **Comprehensive memory persistence**: All agent responses and tool executions stored as conversation memory with proper session tracking

## Tool Architecture Clarification
- **Centralized memory access**: Confirmed only supervisor agent has direct access to memory tools (retrieve_memory, save_*)
- **Individual agent focus**: Individual agents have NO memory tools, only domain-specific tools (5 tools each for metrics, logs, k8s, runbooks)
- **Automatic pattern recognition**: Memory capture happens automatically through hooks, not manual tool calls by individual agents

## Documentation Updates
- **Updated memory-system.md**: Comprehensive design documentation reflecting current implementation
- **Added example analyses**: Created flight-booking-analysis.md and api-response-time-analysis.md in docs/examples/
- **Enhanced README.md**: Added memory system overview and personalized investigation examples
- **Updated .gitignore**: Now ignores entire reports/ folder instead of just .md files

## Implementation Improvements
- **Event ID tracking**: All memory operations generate and log event IDs for verification
- **Pattern extraction confirmation**: Logs confirm pattern extraction working for all agent types
- **Memory save verification**: Comprehensive logging shows successful saves across all memory types
- **Script enhancements**: manage_memories.py now handles duplicate removal and improved user management
2025-08-03 03:18:23 +00:00
Amit Arora
ab4aeb6bcd docs: update SSL certificate paths to use /opt/ssl standard location
- Update README.md to reference /opt/ssl for SSL certificate paths
- Update docs/demo-environment.md to use /opt/ssl paths
- Clean up scripts/configure_gateway.sh SSL fallback paths
- Remove duplicate and outdated SSL path references
- Establish /opt/ssl as the standard SSL certificate location

This ensures consistent SSL certificate management across all
documentation and scripts, supporting the established /opt/ssl
directory with proper ubuntu:ubuntu ownership.
2025-08-02 16:23:54 +00:00
Dheeraj Oruganty
e346e83bf1
fix(02-use-cases): SRE-Agent Deployment (#179)
* Add missing credential_provider_name parameter to config.yaml.example

* Fix get_config function to properly parse YAML values with inline comments

* Enhanced get_config to prevent copy-paste whitespace errors in AWS identifiers

* Improve LLM provider configuration and error handling with bedrock as default

* Add OpenAPI templating system and fix hardcoded regions

* Add backend template build to Readme

* delete old yaml files

* Fix Cognito setup with automation script and missing domain creation steps

* docs: Add EC2 instance port configuration documentation

- Document required inbound ports (443, 8011-8014)
- Include SSL/TLS security requirements
- Add AWS security group best practices
- Provide port usage summary table

* docs: Add hyperlinks to prerequisites in README

- Link EC2 port configuration documentation
- Link IAM role authentication setup
- Improve navigation to detailed setup instructions

* docs: Add BACKEND_API_KEY to configuration documentation

- Document gateway environment variables section
- Add BACKEND_API_KEY requirement for credential provider
- Include example .env file format for gateway directory
- Explain usage in create_gateway.sh script

* docs: Add BACKEND_API_KEY to deployment guide environment variables

- Include BACKEND_API_KEY in environment variables reference table
- Mark as required for gateway setup
- Provide quick reference alongside other required variables

* docs: Add BedrockAgentCoreFullAccess policy and trust policy documentation

- Document AWS managed policy BedrockAgentCoreFullAccess
- Add trust policy requirements for bedrock-agentcore.amazonaws.com
- Reorganize IAM permissions for better clarity
- Remove duplicate trust policy section
- Add IAM role requirement to deployment prerequisites

* docs: Document role_name field in gateway config example

- Explain that role_name is used to create and manage the gateway
- Specify BedrockAgentCoreFullAccess policy requirement
- Note trust policy requirement for bedrock-agentcore.amazonaws.com
- Improve clarity for gateway configuration setup

* docs: Add AWS IP address ranges for production security enhancement

- Document AWS IP ranges JSON download for restricting access
- Reference official AWS documentation for IP address ranges
- Provide security alternatives to 0.0.0.0/0 for production
- Include examples of restricted security group configurations
- Enable egress filtering and region-specific access control

* style: Format Python code with black

- Reformat 14 Python files for consistent code style
- Apply PEP 8 formatting standards
- Improve code readability and maintainability

* docs: Update SRE agent prerequisites and setup documentation

- Convert prerequisites section to markdown table format
- Add SSL certificate provider examples (no-ip.com, letsencrypt.org)
- Add Identity Provider (IDP) requirement with setup_cognito.sh reference
- Clarify that all prerequisites must be completed before setup
- Add reference to domain name and cert paths needed for BACKEND_DOMAIN
- Remove Managing OpenAPI Specifications section (covered in use-case setup)
- Add Deployment Guide link to Development to Production section

Addresses issues #171 and #174

* fix: Replace 'AWS Bedrock' with 'Amazon Bedrock' in SRE agent files

- Updated error messages in llm_utils.py
- Updated comments in both .env.example files
- Ensures consistent naming convention across SRE agent codebase

---------

Co-authored-by: dheerajoruganty <dheo@amazon.com>
Co-authored-by: Amit Arora <aroraai@amazon.com>
2025-08-01 13:24:58 -04:00
Amit Arora
00893cd175
feat(SRE Agent): Update SRE Agent architecture diagram to use Amazon Bedrock AgentCore icons (#163)
- Replace mermaid diagram with new architecture image using official AWS icons
- Add AgentCore Runtime, Gateway, Identity, Memory, and Observability components
- Show clear integration with Amazon Bedrock LLMs and Amazon Cognito
- Display all 4 MCP tools (k8s-api, logs-api, metrics-api, runbooks-api)
- Create docs/images directory for architecture diagrams

Fixes #162
2025-07-28 09:52:35 -04:00
Amit Arora
dff915fabb
fix(SRE Agent)- Deploy SRE Agent on Amazon Bedrock AgentCore Runtime with Enhanced Architecture (#158)
* feat: Deploy SRE agent on Amazon Bedrock AgentCore Runtime

- Add agent_runtime.py with FastAPI endpoints for AgentCore compatibility
- Create Dockerfile for ARM64-based containerization
- Add deployment scripts for automated ECR push and AgentCore deployment
- Update backend API URLs from placeholders to actual endpoints
- Update gateway configuration for production use
- Add dependencies for AgentCore runtime support

Implements #143

* chore: Add deployment artifacts to .gitignore

- Add deployment/.sre_agent_uri, deployment/.env, and deployment/.agent_arn to .gitignore
- Remove already tracked deployment artifacts from git

* feat: Make ANTHROPIC_API_KEY optional in deployment

- Update deploy_agent_runtime.py to conditionally include ANTHROPIC_API_KEY
- Show info message when using Amazon Bedrock as provider
- Update .env.example to clarify ANTHROPIC_API_KEY is optional
- Only include ANTHROPIC_API_KEY in environment variables if it exists

* fix: Use uv run python instead of python in build script

- Update build_and_deploy.sh to use 'uv run python' for deployment
- Change to parent directory to ensure uv environment is available
- Fixes 'python: command not found' error during deployment

* refactor: Improve deployment script structure and create .env symlink

- Flatten nested if-else blocks in deploy_agent_runtime.py for better readability
- Add 10-second sleep after deletion to ensure cleanup completes
- Create symlink from deployment/.env to sre_agent/.env to avoid duplication
- Move time import to top of file with other imports

* feat: Add debug mode support and comprehensive deployment guide

Add --debug command line flag and DEBUG environment variable support:
- Created shared logging configuration module
- Updated CLI and runtime to support --debug flag
- Made debug traces conditional on DEBUG environment variable
- Added debug mode for container and AgentCore deployments

Enhanced build and deployment script:
- Added command line argument for ECR repository name
- Added help documentation and usage examples
- Added support for local builds (x86_64) vs AgentCore builds (arm64)
- Added environment variable pass-through for DEBUG, LLM_PROVIDER, ANTHROPIC_API_KEY

Created comprehensive deployment guide:
- Step-by-step instructions from local testing to production
- Docker platform documentation (x86_64 vs arm64)
- Environment variable configuration with .env file usage
- Debug mode examples and troubleshooting guide
- Provider configuration for Bedrock and Anthropic

Updated README with AgentCore Runtime deployment section and documentation links.

* docs: Update SRE Agent README with deployment flow diagram and fix directory reference

- Fix reference from 04-SRE-agent to SRE-agent in README
- Add comprehensive flowchart showing development to production deployment flow
- Update overview to mention Amazon Bedrock AgentCore Runtime deployment
- Remove emojis from documentation for professional appearance

* docs: Replace mermaid diagram with ASCII step-by-step flow diagram

- Change from block-style mermaid diagram to ASCII flow diagram
- Show clear step-by-step progression from development to production
- Improve readability with structured boxes and arrows
- Minor text improvements for clarity

* feat: Implement comprehensive prompt management system and enhance deployment guide

- Create centralized prompt template system with external files in config/prompts/
- Add PromptLoader utility class with LRU caching and template variable substitution
- Integrate PromptConfig into SREConstants for centralized configuration management
- Update all agents (nodes, supervisor, output_formatter) to use prompt loader
- Replace 150+ lines of hardcoded prompts with modular, maintainable template system
- Enhance deployment guide with consistent naming (my_custom_sre_agent) throughout
- Add quick-start copy-paste command sequence for streamlined deployment
- Improve constants system with comprehensive model, AWS, timeout, and prompt configs
- Add architectural assessment document to .gitignore for local analysis
- Run black formatting across all updated Python files

* docs: Consolidate deployment and security documentation

- Rename deployment-and-security.md to security.md and remove redundant deployment content
- Enhance security.md with comprehensive production security guidelines including:
  - Authentication and authorization best practices
  - Encryption and data protection requirements
  - Operational security monitoring and logging
  - Input validation and prompt security measures
  - Infrastructure security recommendations
  - Compliance and governance frameworks
- Update README.md to reference new security.md file
- Eliminate redundancy between deployment-guide.md and deployment-and-security.md
- Improve documentation organization with clear separation of concerns

* config: Replace hardcoded endpoints with placeholder domains

- Update OpenAPI specifications to use placeholder domain 'your-backend-domain.com'
  - k8s_api.yaml: mcpgateway.ddns.net:8011 -> your-backend-domain.com:8011
  - logs_api.yaml: mcpgateway.ddns.net:8012 -> your-backend-domain.com:8012
  - metrics_api.yaml: mcpgateway.ddns.net:8013 -> your-backend-domain.com:8013
  - runbooks_api.yaml: mcpgateway.ddns.net:8014 -> your-backend-domain.com:8014
- Update agent configuration to use placeholder AgentCore gateway endpoint
  - agent_config.yaml: Replace specific gateway ID with 'your-agentcore-gateway-endpoint'
- Improve security by removing hardcoded production endpoints from repository
- Enable template-based configuration that users can customize during setup
- Align with existing documentation patterns for placeholder domain replacement
2025-07-27 15:05:03 -04:00
Shreyas Subramanian
176ef7bd91
renaming folders (#102) 2025-07-21 10:45:13 -04:00