Amit Arora dff915fabb
fix(SRE Agent)- Deploy SRE Agent on Amazon Bedrock AgentCore Runtime with Enhanced Architecture (#158)
* feat: Deploy SRE agent on Amazon Bedrock AgentCore Runtime

- Add agent_runtime.py with FastAPI endpoints for AgentCore compatibility
- Create Dockerfile for ARM64-based containerization
- Add deployment scripts for automated ECR push and AgentCore deployment
- Update backend API URLs from placeholders to actual endpoints
- Update gateway configuration for production use
- Add dependencies for AgentCore runtime support

Implements #143

* chore: Add deployment artifacts to .gitignore

- Add deployment/.sre_agent_uri, deployment/.env, and deployment/.agent_arn to .gitignore
- Remove already tracked deployment artifacts from git

* feat: Make ANTHROPIC_API_KEY optional in deployment

- Update deploy_agent_runtime.py to conditionally include ANTHROPIC_API_KEY
- Show info message when using Amazon Bedrock as provider
- Update .env.example to clarify ANTHROPIC_API_KEY is optional
- Only include ANTHROPIC_API_KEY in environment variables if it exists

* fix: Use uv run python instead of python in build script

- Update build_and_deploy.sh to use 'uv run python' for deployment
- Change to parent directory to ensure uv environment is available
- Fixes 'python: command not found' error during deployment

* refactor: Improve deployment script structure and create .env symlink

- Flatten nested if-else blocks in deploy_agent_runtime.py for better readability
- Add 10-second sleep after deletion to ensure cleanup completes
- Create symlink from deployment/.env to sre_agent/.env to avoid duplication
- Move time import to top of file with other imports

* feat: Add debug mode support and comprehensive deployment guide

Add --debug command line flag and DEBUG environment variable support:
- Created shared logging configuration module
- Updated CLI and runtime to support --debug flag
- Made debug traces conditional on DEBUG environment variable
- Added debug mode for container and AgentCore deployments

Enhanced build and deployment script:
- Added command line argument for ECR repository name
- Added help documentation and usage examples
- Added support for local builds (x86_64) vs AgentCore builds (arm64)
- Added environment variable pass-through for DEBUG, LLM_PROVIDER, ANTHROPIC_API_KEY

Created comprehensive deployment guide:
- Step-by-step instructions from local testing to production
- Docker platform documentation (x86_64 vs arm64)
- Environment variable configuration with .env file usage
- Debug mode examples and troubleshooting guide
- Provider configuration for Bedrock and Anthropic

Updated README with AgentCore Runtime deployment section and documentation links.

* docs: Update SRE Agent README with deployment flow diagram and fix directory reference

- Fix reference from 04-SRE-agent to SRE-agent in README
- Add comprehensive flowchart showing development to production deployment flow
- Update overview to mention Amazon Bedrock AgentCore Runtime deployment
- Remove emojis from documentation for professional appearance

* docs: Replace mermaid diagram with ASCII step-by-step flow diagram

- Change from block-style mermaid diagram to ASCII flow diagram
- Show clear step-by-step progression from development to production
- Improve readability with structured boxes and arrows
- Minor text improvements for clarity

* feat: Implement comprehensive prompt management system and enhance deployment guide

- Create centralized prompt template system with external files in config/prompts/
- Add PromptLoader utility class with LRU caching and template variable substitution
- Integrate PromptConfig into SREConstants for centralized configuration management
- Update all agents (nodes, supervisor, output_formatter) to use prompt loader
- Replace 150+ lines of hardcoded prompts with modular, maintainable template system
- Enhance deployment guide with consistent naming (my_custom_sre_agent) throughout
- Add quick-start copy-paste command sequence for streamlined deployment
- Improve constants system with comprehensive model, AWS, timeout, and prompt configs
- Add architectural assessment document to .gitignore for local analysis
- Run black formatting across all updated Python files

* docs: Consolidate deployment and security documentation

- Rename deployment-and-security.md to security.md and remove redundant deployment content
- Enhance security.md with comprehensive production security guidelines including:
  - Authentication and authorization best practices
  - Encryption and data protection requirements
  - Operational security monitoring and logging
  - Input validation and prompt security measures
  - Infrastructure security recommendations
  - Compliance and governance frameworks
- Update README.md to reference new security.md file
- Eliminate redundancy between deployment-guide.md and deployment-and-security.md
- Improve documentation organization with clear separation of concerns

* config: Replace hardcoded endpoints with placeholder domains

- Update OpenAPI specifications to use placeholder domain 'your-backend-domain.com'
  - k8s_api.yaml: mcpgateway.ddns.net:8011 -> your-backend-domain.com:8011
  - logs_api.yaml: mcpgateway.ddns.net:8012 -> your-backend-domain.com:8012
  - metrics_api.yaml: mcpgateway.ddns.net:8013 -> your-backend-domain.com:8013
  - runbooks_api.yaml: mcpgateway.ddns.net:8014 -> your-backend-domain.com:8014
- Update agent configuration to use placeholder AgentCore gateway endpoint
  - agent_config.yaml: Replace specific gateway ID with 'your-agentcore-gateway-endpoint'
- Improve security by removing hardcoded production endpoints from repository
- Enable template-based configuration that users can customize during setup
- Align with existing documentation patterns for placeholder domain replacement
2025-07-27 15:05:03 -04:00

2.0 KiB

Security Considerations

Overview

This document outlines security best practices and considerations for deploying and operating the SRE Multi-Agent System in production environments. Security is critical when handling infrastructure data and operational procedures.

Security Best Practices

Authentication and Authorization

  • Implement API authentication using OAuth2 or API keys for infrastructure endpoints
  • Use AWS IAM roles for Bedrock access instead of long-lived credentials
  • Apply principle of least privilege for API access
  • Implement role-based access control (RBAC) for different user types and permissions

Encryption and Data Protection

  • Enable TLS encryption for all API communications
  • Encrypt sensitive data at rest and in transit
  • Use secure secret management systems for credential storage
  • Protect personally identifiable information (PII) and sensitive infrastructure details

Operational Security

  • Implement comprehensive audit logging for agent actions and investigations
  • Regularly rotate API keys and tokens
  • Monitor for unusual access patterns or suspicious activities
  • Enable logging and monitoring for security events and anomalies

Input Validation and Prompt Security

  • Validate all user inputs to prevent prompt injection attacks
  • Implement input sanitization for queries and commands
  • Use Amazon Bedrock Guardrails to protect against malicious prompts
  • Restrict agent capabilities based on user authorization levels

Infrastructure Security

  • Deploy the system in secure network environments with proper firewall rules
  • Use VPC endpoints for AWS service communications when possible
  • Implement network segmentation between different system components
  • Regularly update dependencies and apply security patches

Compliance and Governance

  • Maintain audit trails for compliance requirements
  • Implement data retention policies for logs and investigation records
  • Ensure compliance with organizational security policies and standards
  • Regular security assessments and penetration testing