- Increase deletion wait time to 150s for agent runtime cleanup
- Add retry logic with exponential backoff for MCP rate limiting (429 errors)
- Add session_id and user_id to agent state for memory retrieval
- Filter out /ping endpoint logs to reduce noise
- Increase boto3 read timeout to 5 minutes for long-running operations
- Add clear error messages for agent name conflicts
- Update README to clarify virtual environment requirement for scripts
- Fix session ID generation to meet 33+ character requirement
These changes improve reliability when deploying and invoking agents,
especially under heavy load or with complex queries that take time.
- Remove unused root .env and .env.example files (not referenced by any code)
- Update configuration.md with comprehensive config file documentation
- Add configuration overview table with setup instructions and auto-generation info
- Consolidate specialized-agents.md content into system-components.md
- Update system-components.md with complete AgentCore architecture
- Add detailed sections for AgentCore Runtime, Gateway, and Memory primitives
- Remove cli-reference.md (excessive documentation for limited use)
- Update README.md to reference configuration guide in setup section
- Clean up documentation links and organization
The documentation now provides a clear, consolidated view of the system
architecture and configuration with proper cross-references and setup guidance.
- Added note that memory system takes 10-12 minutes to be ready
- Added steps to check memory status with list command after 10 minutes
- Added instruction to run update command again once memory is ready
- Provides clear workflow for memory system setup and prevents user confusion
## Memory System Enhancements
- **Individual agent memory integration**: Every agent response now triggers automatic memory pattern extraction through on_agent_response() hooks
- **Enhanced conversation logging**: Added detailed message breakdown showing USER/ASSISTANT/TOOL message counts and tool names called
- **Fixed infrastructure extraction**: Resolved hardcoded agent name issues by using SREConstants for agent identification
- **Comprehensive memory persistence**: All agent responses and tool executions stored as conversation memory with proper session tracking
## Tool Architecture Clarification
- **Centralized memory access**: Confirmed only supervisor agent has direct access to memory tools (retrieve_memory, save_*)
- **Individual agent focus**: Individual agents have NO memory tools, only domain-specific tools (5 tools each for metrics, logs, k8s, runbooks)
- **Automatic pattern recognition**: Memory capture happens automatically through hooks, not manual tool calls by individual agents
## Documentation Updates
- **Updated memory-system.md**: Comprehensive design documentation reflecting current implementation
- **Added example analyses**: Created flight-booking-analysis.md and api-response-time-analysis.md in docs/examples/
- **Enhanced README.md**: Added memory system overview and personalized investigation examples
- **Updated .gitignore**: Now ignores entire reports/ folder instead of just .md files
## Implementation Improvements
- **Event ID tracking**: All memory operations generate and log event IDs for verification
- **Pattern extraction confirmation**: Logs confirm pattern extraction working for all agent types
- **Memory save verification**: Comprehensive logging shows successful saves across all memory types
- **Script enhancements**: manage_memories.py now handles duplicate removal and improved user management
- Update README.md to reference /opt/ssl for SSL certificate paths
- Update docs/demo-environment.md to use /opt/ssl paths
- Clean up scripts/configure_gateway.sh SSL fallback paths
- Remove duplicate and outdated SSL path references
- Establish /opt/ssl as the standard SSL certificate location
This ensures consistent SSL certificate management across all
documentation and scripts, supporting the established /opt/ssl
directory with proper ubuntu:ubuntu ownership.
* Add missing credential_provider_name parameter to config.yaml.example
* Fix get_config function to properly parse YAML values with inline comments
* Enhanced get_config to prevent copy-paste whitespace errors in AWS identifiers
* Improve LLM provider configuration and error handling with bedrock as default
* Add OpenAPI templating system and fix hardcoded regions
* Add backend template build to Readme
* delete old yaml files
* Fix Cognito setup with automation script and missing domain creation steps
* docs: Add EC2 instance port configuration documentation
- Document required inbound ports (443, 8011-8014)
- Include SSL/TLS security requirements
- Add AWS security group best practices
- Provide port usage summary table
* docs: Add hyperlinks to prerequisites in README
- Link EC2 port configuration documentation
- Link IAM role authentication setup
- Improve navigation to detailed setup instructions
* docs: Add BACKEND_API_KEY to configuration documentation
- Document gateway environment variables section
- Add BACKEND_API_KEY requirement for credential provider
- Include example .env file format for gateway directory
- Explain usage in create_gateway.sh script
* docs: Add BACKEND_API_KEY to deployment guide environment variables
- Include BACKEND_API_KEY in environment variables reference table
- Mark as required for gateway setup
- Provide quick reference alongside other required variables
* docs: Add BedrockAgentCoreFullAccess policy and trust policy documentation
- Document AWS managed policy BedrockAgentCoreFullAccess
- Add trust policy requirements for bedrock-agentcore.amazonaws.com
- Reorganize IAM permissions for better clarity
- Remove duplicate trust policy section
- Add IAM role requirement to deployment prerequisites
* docs: Document role_name field in gateway config example
- Explain that role_name is used to create and manage the gateway
- Specify BedrockAgentCoreFullAccess policy requirement
- Note trust policy requirement for bedrock-agentcore.amazonaws.com
- Improve clarity for gateway configuration setup
* docs: Add AWS IP address ranges for production security enhancement
- Document AWS IP ranges JSON download for restricting access
- Reference official AWS documentation for IP address ranges
- Provide security alternatives to 0.0.0.0/0 for production
- Include examples of restricted security group configurations
- Enable egress filtering and region-specific access control
* style: Format Python code with black
- Reformat 14 Python files for consistent code style
- Apply PEP 8 formatting standards
- Improve code readability and maintainability
* docs: Update SRE agent prerequisites and setup documentation
- Convert prerequisites section to markdown table format
- Add SSL certificate provider examples (no-ip.com, letsencrypt.org)
- Add Identity Provider (IDP) requirement with setup_cognito.sh reference
- Clarify that all prerequisites must be completed before setup
- Add reference to domain name and cert paths needed for BACKEND_DOMAIN
- Remove Managing OpenAPI Specifications section (covered in use-case setup)
- Add Deployment Guide link to Development to Production section
Addresses issues #171 and #174
* fix: Replace 'AWS Bedrock' with 'Amazon Bedrock' in SRE agent files
- Updated error messages in llm_utils.py
- Updated comments in both .env.example files
- Ensures consistent naming convention across SRE agent codebase
---------
Co-authored-by: dheerajoruganty <dheo@amazon.com>
Co-authored-by: Amit Arora <aroraai@amazon.com>
* feat: Deploy SRE agent on Amazon Bedrock AgentCore Runtime
- Add agent_runtime.py with FastAPI endpoints for AgentCore compatibility
- Create Dockerfile for ARM64-based containerization
- Add deployment scripts for automated ECR push and AgentCore deployment
- Update backend API URLs from placeholders to actual endpoints
- Update gateway configuration for production use
- Add dependencies for AgentCore runtime support
Implements #143
* chore: Add deployment artifacts to .gitignore
- Add deployment/.sre_agent_uri, deployment/.env, and deployment/.agent_arn to .gitignore
- Remove already tracked deployment artifacts from git
* feat: Make ANTHROPIC_API_KEY optional in deployment
- Update deploy_agent_runtime.py to conditionally include ANTHROPIC_API_KEY
- Show info message when using Amazon Bedrock as provider
- Update .env.example to clarify ANTHROPIC_API_KEY is optional
- Only include ANTHROPIC_API_KEY in environment variables if it exists
* fix: Use uv run python instead of python in build script
- Update build_and_deploy.sh to use 'uv run python' for deployment
- Change to parent directory to ensure uv environment is available
- Fixes 'python: command not found' error during deployment
* refactor: Improve deployment script structure and create .env symlink
- Flatten nested if-else blocks in deploy_agent_runtime.py for better readability
- Add 10-second sleep after deletion to ensure cleanup completes
- Create symlink from deployment/.env to sre_agent/.env to avoid duplication
- Move time import to top of file with other imports
* feat: Add debug mode support and comprehensive deployment guide
Add --debug command line flag and DEBUG environment variable support:
- Created shared logging configuration module
- Updated CLI and runtime to support --debug flag
- Made debug traces conditional on DEBUG environment variable
- Added debug mode for container and AgentCore deployments
Enhanced build and deployment script:
- Added command line argument for ECR repository name
- Added help documentation and usage examples
- Added support for local builds (x86_64) vs AgentCore builds (arm64)
- Added environment variable pass-through for DEBUG, LLM_PROVIDER, ANTHROPIC_API_KEY
Created comprehensive deployment guide:
- Step-by-step instructions from local testing to production
- Docker platform documentation (x86_64 vs arm64)
- Environment variable configuration with .env file usage
- Debug mode examples and troubleshooting guide
- Provider configuration for Bedrock and Anthropic
Updated README with AgentCore Runtime deployment section and documentation links.
* docs: Update SRE Agent README with deployment flow diagram and fix directory reference
- Fix reference from 04-SRE-agent to SRE-agent in README
- Add comprehensive flowchart showing development to production deployment flow
- Update overview to mention Amazon Bedrock AgentCore Runtime deployment
- Remove emojis from documentation for professional appearance
* docs: Replace mermaid diagram with ASCII step-by-step flow diagram
- Change from block-style mermaid diagram to ASCII flow diagram
- Show clear step-by-step progression from development to production
- Improve readability with structured boxes and arrows
- Minor text improvements for clarity
* feat: Implement comprehensive prompt management system and enhance deployment guide
- Create centralized prompt template system with external files in config/prompts/
- Add PromptLoader utility class with LRU caching and template variable substitution
- Integrate PromptConfig into SREConstants for centralized configuration management
- Update all agents (nodes, supervisor, output_formatter) to use prompt loader
- Replace 150+ lines of hardcoded prompts with modular, maintainable template system
- Enhance deployment guide with consistent naming (my_custom_sre_agent) throughout
- Add quick-start copy-paste command sequence for streamlined deployment
- Improve constants system with comprehensive model, AWS, timeout, and prompt configs
- Add architectural assessment document to .gitignore for local analysis
- Run black formatting across all updated Python files
* docs: Consolidate deployment and security documentation
- Rename deployment-and-security.md to security.md and remove redundant deployment content
- Enhance security.md with comprehensive production security guidelines including:
- Authentication and authorization best practices
- Encryption and data protection requirements
- Operational security monitoring and logging
- Input validation and prompt security measures
- Infrastructure security recommendations
- Compliance and governance frameworks
- Update README.md to reference new security.md file
- Eliminate redundancy between deployment-guide.md and deployment-and-security.md
- Improve documentation organization with clear separation of concerns
* config: Replace hardcoded endpoints with placeholder domains
- Update OpenAPI specifications to use placeholder domain 'your-backend-domain.com'
- k8s_api.yaml: mcpgateway.ddns.net:8011 -> your-backend-domain.com:8011
- logs_api.yaml: mcpgateway.ddns.net:8012 -> your-backend-domain.com:8012
- metrics_api.yaml: mcpgateway.ddns.net:8013 -> your-backend-domain.com:8013
- runbooks_api.yaml: mcpgateway.ddns.net:8014 -> your-backend-domain.com:8014
- Update agent configuration to use placeholder AgentCore gateway endpoint
- agent_config.yaml: Replace specific gateway ID with 'your-agentcore-gateway-endpoint'
- Improve security by removing hardcoded production endpoints from repository
- Enable template-based configuration that users can customize during setup
- Align with existing documentation patterns for placeholder domain replacement