Amit Arora dff915fabb
fix(SRE Agent)- Deploy SRE Agent on Amazon Bedrock AgentCore Runtime with Enhanced Architecture (#158)
* feat: Deploy SRE agent on Amazon Bedrock AgentCore Runtime

- Add agent_runtime.py with FastAPI endpoints for AgentCore compatibility
- Create Dockerfile for ARM64-based containerization
- Add deployment scripts for automated ECR push and AgentCore deployment
- Update backend API URLs from placeholders to actual endpoints
- Update gateway configuration for production use
- Add dependencies for AgentCore runtime support

Implements #143

* chore: Add deployment artifacts to .gitignore

- Add deployment/.sre_agent_uri, deployment/.env, and deployment/.agent_arn to .gitignore
- Remove already tracked deployment artifacts from git

* feat: Make ANTHROPIC_API_KEY optional in deployment

- Update deploy_agent_runtime.py to conditionally include ANTHROPIC_API_KEY
- Show info message when using Amazon Bedrock as provider
- Update .env.example to clarify ANTHROPIC_API_KEY is optional
- Only include ANTHROPIC_API_KEY in environment variables if it exists

* fix: Use uv run python instead of python in build script

- Update build_and_deploy.sh to use 'uv run python' for deployment
- Change to parent directory to ensure uv environment is available
- Fixes 'python: command not found' error during deployment

* refactor: Improve deployment script structure and create .env symlink

- Flatten nested if-else blocks in deploy_agent_runtime.py for better readability
- Add 10-second sleep after deletion to ensure cleanup completes
- Create symlink from deployment/.env to sre_agent/.env to avoid duplication
- Move time import to top of file with other imports

* feat: Add debug mode support and comprehensive deployment guide

Add --debug command line flag and DEBUG environment variable support:
- Created shared logging configuration module
- Updated CLI and runtime to support --debug flag
- Made debug traces conditional on DEBUG environment variable
- Added debug mode for container and AgentCore deployments

Enhanced build and deployment script:
- Added command line argument for ECR repository name
- Added help documentation and usage examples
- Added support for local builds (x86_64) vs AgentCore builds (arm64)
- Added environment variable pass-through for DEBUG, LLM_PROVIDER, ANTHROPIC_API_KEY

Created comprehensive deployment guide:
- Step-by-step instructions from local testing to production
- Docker platform documentation (x86_64 vs arm64)
- Environment variable configuration with .env file usage
- Debug mode examples and troubleshooting guide
- Provider configuration for Bedrock and Anthropic

Updated README with AgentCore Runtime deployment section and documentation links.

* docs: Update SRE Agent README with deployment flow diagram and fix directory reference

- Fix reference from 04-SRE-agent to SRE-agent in README
- Add comprehensive flowchart showing development to production deployment flow
- Update overview to mention Amazon Bedrock AgentCore Runtime deployment
- Remove emojis from documentation for professional appearance

* docs: Replace mermaid diagram with ASCII step-by-step flow diagram

- Change from block-style mermaid diagram to ASCII flow diagram
- Show clear step-by-step progression from development to production
- Improve readability with structured boxes and arrows
- Minor text improvements for clarity

* feat: Implement comprehensive prompt management system and enhance deployment guide

- Create centralized prompt template system with external files in config/prompts/
- Add PromptLoader utility class with LRU caching and template variable substitution
- Integrate PromptConfig into SREConstants for centralized configuration management
- Update all agents (nodes, supervisor, output_formatter) to use prompt loader
- Replace 150+ lines of hardcoded prompts with modular, maintainable template system
- Enhance deployment guide with consistent naming (my_custom_sre_agent) throughout
- Add quick-start copy-paste command sequence for streamlined deployment
- Improve constants system with comprehensive model, AWS, timeout, and prompt configs
- Add architectural assessment document to .gitignore for local analysis
- Run black formatting across all updated Python files

* docs: Consolidate deployment and security documentation

- Rename deployment-and-security.md to security.md and remove redundant deployment content
- Enhance security.md with comprehensive production security guidelines including:
  - Authentication and authorization best practices
  - Encryption and data protection requirements
  - Operational security monitoring and logging
  - Input validation and prompt security measures
  - Infrastructure security recommendations
  - Compliance and governance frameworks
- Update README.md to reference new security.md file
- Eliminate redundancy between deployment-guide.md and deployment-and-security.md
- Improve documentation organization with clear separation of concerns

* config: Replace hardcoded endpoints with placeholder domains

- Update OpenAPI specifications to use placeholder domain 'your-backend-domain.com'
  - k8s_api.yaml: mcpgateway.ddns.net:8011 -> your-backend-domain.com:8011
  - logs_api.yaml: mcpgateway.ddns.net:8012 -> your-backend-domain.com:8012
  - metrics_api.yaml: mcpgateway.ddns.net:8013 -> your-backend-domain.com:8013
  - runbooks_api.yaml: mcpgateway.ddns.net:8014 -> your-backend-domain.com:8014
- Update agent configuration to use placeholder AgentCore gateway endpoint
  - agent_config.yaml: Replace specific gateway ID with 'your-agentcore-gateway-endpoint'
- Improve security by removing hardcoded production endpoints from repository
- Enable template-based configuration that users can customize during setup
- Align with existing documentation patterns for placeholder domain replacement
2025-07-27 15:05:03 -04:00

236 lines
7.9 KiB
Python

#!/usr/bin/env python3
import logging
from functools import lru_cache
from pathlib import Path
from typing import Dict, Any, Optional
# Configure logging with basicConfig
logging.basicConfig(
level=logging.INFO, # Set the log level to INFO
# Define log message format
format="%(asctime)s,p%(process)s,{%(filename)s:%(lineno)d},%(levelname)s,%(message)s",
)
logger = logging.getLogger(__name__)
class PromptLoader:
"""Utility class for loading and managing prompt templates."""
def __init__(self, prompts_dir: Optional[str] = None):
"""Initialize the prompt loader.
Args:
prompts_dir: Directory containing prompt files. If None, uses default relative path.
"""
if prompts_dir:
self.prompts_dir = Path(prompts_dir)
else:
# Default to config/prompts relative to this file
self.prompts_dir = Path(__file__).parent / "config" / "prompts"
logger.debug(f"PromptLoader initialized with prompts_dir: {self.prompts_dir}")
@lru_cache(maxsize=32)
def _load_prompt_file(self, filename: str) -> str:
"""Load a prompt file with caching.
Args:
filename: Name of the prompt file to load
Returns:
Content of the prompt file
Raises:
FileNotFoundError: If the prompt file doesn't exist
IOError: If there's an error reading the file
"""
filepath = self.prompts_dir / filename
if not filepath.exists():
raise FileNotFoundError(f"Prompt file not found: {filepath}")
try:
with open(filepath, "r", encoding="utf-8") as f:
content = f.read().strip()
logger.debug(f"Loaded prompt file: {filename}")
return content
except Exception as e:
logger.error(f"Error loading prompt file {filename}: {e}")
raise IOError(f"Failed to read prompt file {filename}: {e}")
def load_prompt(self, prompt_name: str) -> str:
"""Load a prompt by name.
Args:
prompt_name: Name of the prompt (without .txt extension)
Returns:
Content of the prompt file
"""
filename = f"{prompt_name}.txt"
return self._load_prompt_file(filename)
def load_template(self, template_name: str, **kwargs) -> str:
"""Load a prompt template and substitute variables.
Args:
template_name: Name of the template (without .txt extension)
**kwargs: Variables to substitute in the template
Returns:
Template content with variables substituted
"""
template_content = self.load_prompt(template_name)
try:
return template_content.format(**kwargs)
except KeyError as e:
logger.error(f"Missing template variable {e} in template {template_name}")
raise ValueError(f"Missing required template variable: {e}")
except Exception as e:
logger.error(f"Error formatting template {template_name}: {e}")
raise ValueError(f"Error formatting template {template_name}: {e}")
def get_agent_prompt(
self, agent_type: str, agent_name: str, agent_description: str
) -> str:
"""Combine base agent prompt with agent-specific prompt.
Args:
agent_type: Type of agent (kubernetes, logs, metrics, runbooks)
agent_name: Display name of the agent
agent_description: Description of the agent's capabilities
Returns:
Complete system prompt for the agent
"""
try:
# Load base prompt template
base_prompt = self.load_template(
"agent_base_prompt",
agent_name=agent_name,
agent_description=agent_description,
)
# Load agent-specific prompt if it exists
try:
agent_specific_prompt = self.load_prompt(f"{agent_type}_agent_prompt")
combined_prompt = f"{base_prompt}\n\n{agent_specific_prompt}"
except FileNotFoundError:
logger.warning(f"No specific prompt found for agent type: {agent_type}")
combined_prompt = base_prompt
return combined_prompt
except Exception as e:
logger.error(f"Error building agent prompt for {agent_type}: {e}")
raise
def get_supervisor_aggregation_prompt(
self,
is_plan_based: bool,
query: str,
agent_results: str,
auto_approve_plan: bool = False,
**kwargs,
) -> str:
"""Get supervisor aggregation prompt based on context.
Args:
is_plan_based: Whether this is a plan-based aggregation
query: Original user query
agent_results: JSON string of agent results
auto_approve_plan: Whether to include auto-approve instruction
**kwargs: Additional template variables (e.g., current_step, total_steps, plan)
Returns:
Formatted aggregation prompt
"""
try:
# Determine auto-approve instruction
auto_approve_instruction = ""
if auto_approve_plan:
auto_approve_instruction = "\n\nIMPORTANT: Do not ask any follow-up questions or suggest that the user can ask for more details. Provide a complete, conclusive response."
template_vars = {
"query": query,
"agent_results": agent_results,
"auto_approve_instruction": auto_approve_instruction,
**kwargs,
}
if is_plan_based:
return self.load_template(
"supervisor_plan_aggregation", **template_vars
)
else:
return self.load_template(
"supervisor_standard_aggregation", **template_vars
)
except Exception as e:
logger.error(f"Error building supervisor aggregation prompt: {e}")
raise
def get_executive_summary_prompts(
self, query: str, results_text: str
) -> tuple[str, str]:
"""Get system and user prompts for executive summary generation.
Args:
query: Original user query
results_text: Formatted investigation results
Returns:
Tuple of (system_prompt, user_prompt)
"""
try:
system_prompt = self.load_prompt("executive_summary_system")
user_prompt = self.load_template(
"executive_summary_user_template",
query=query,
results_text=results_text,
)
return system_prompt, user_prompt
except Exception as e:
logger.error(f"Error building executive summary prompts: {e}")
raise
def list_available_prompts(self) -> list[str]:
"""List all available prompt files.
Returns:
List of prompt names (without .txt extension)
"""
try:
prompt_files = list(self.prompts_dir.glob("*.txt"))
return [f.stem for f in prompt_files]
except Exception as e:
logger.error(f"Error listing prompt files: {e}")
return []
# Convenience instance for easy import
prompt_loader = PromptLoader()
# Convenience functions for backward compatibility
def load_prompt(prompt_name: str) -> str:
"""Load a prompt by name using the default loader."""
return prompt_loader.load_prompt(prompt_name)
def load_template(template_name: str, **kwargs) -> str:
"""Load and format a template using the default loader."""
return prompt_loader.load_template(template_name, **kwargs)
def get_agent_prompt(agent_type: str, agent_name: str, agent_description: str) -> str:
"""Get complete agent prompt using the default loader."""
return prompt_loader.get_agent_prompt(agent_type, agent_name, agent_description)