fix(SRE Agent)- Deploy SRE Agent on Amazon Bedrock AgentCore Runtime with Enhanced Architecture (#158)
* feat: Deploy SRE agent on Amazon Bedrock AgentCore Runtime
- Add agent_runtime.py with FastAPI endpoints for AgentCore compatibility
- Create Dockerfile for ARM64-based containerization
- Add deployment scripts for automated ECR push and AgentCore deployment
- Update backend API URLs from placeholders to actual endpoints
- Update gateway configuration for production use
- Add dependencies for AgentCore runtime support
Implements #143
* chore: Add deployment artifacts to .gitignore
- Add deployment/.sre_agent_uri, deployment/.env, and deployment/.agent_arn to .gitignore
- Remove already tracked deployment artifacts from git
* feat: Make ANTHROPIC_API_KEY optional in deployment
- Update deploy_agent_runtime.py to conditionally include ANTHROPIC_API_KEY
- Show info message when using Amazon Bedrock as provider
- Update .env.example to clarify ANTHROPIC_API_KEY is optional
- Only include ANTHROPIC_API_KEY in environment variables if it exists
* fix: Use uv run python instead of python in build script
- Update build_and_deploy.sh to use 'uv run python' for deployment
- Change to parent directory to ensure uv environment is available
- Fixes 'python: command not found' error during deployment
* refactor: Improve deployment script structure and create .env symlink
- Flatten nested if-else blocks in deploy_agent_runtime.py for better readability
- Add 10-second sleep after deletion to ensure cleanup completes
- Create symlink from deployment/.env to sre_agent/.env to avoid duplication
- Move time import to top of file with other imports
* feat: Add debug mode support and comprehensive deployment guide
Add --debug command line flag and DEBUG environment variable support:
- Created shared logging configuration module
- Updated CLI and runtime to support --debug flag
- Made debug traces conditional on DEBUG environment variable
- Added debug mode for container and AgentCore deployments
Enhanced build and deployment script:
- Added command line argument for ECR repository name
- Added help documentation and usage examples
- Added support for local builds (x86_64) vs AgentCore builds (arm64)
- Added environment variable pass-through for DEBUG, LLM_PROVIDER, ANTHROPIC_API_KEY
Created comprehensive deployment guide:
- Step-by-step instructions from local testing to production
- Docker platform documentation (x86_64 vs arm64)
- Environment variable configuration with .env file usage
- Debug mode examples and troubleshooting guide
- Provider configuration for Bedrock and Anthropic
Updated README with AgentCore Runtime deployment section and documentation links.
* docs: Update SRE Agent README with deployment flow diagram and fix directory reference
- Fix reference from 04-SRE-agent to SRE-agent in README
- Add comprehensive flowchart showing development to production deployment flow
- Update overview to mention Amazon Bedrock AgentCore Runtime deployment
- Remove emojis from documentation for professional appearance
* docs: Replace mermaid diagram with ASCII step-by-step flow diagram
- Change from block-style mermaid diagram to ASCII flow diagram
- Show clear step-by-step progression from development to production
- Improve readability with structured boxes and arrows
- Minor text improvements for clarity
* feat: Implement comprehensive prompt management system and enhance deployment guide
- Create centralized prompt template system with external files in config/prompts/
- Add PromptLoader utility class with LRU caching and template variable substitution
- Integrate PromptConfig into SREConstants for centralized configuration management
- Update all agents (nodes, supervisor, output_formatter) to use prompt loader
- Replace 150+ lines of hardcoded prompts with modular, maintainable template system
- Enhance deployment guide with consistent naming (my_custom_sre_agent) throughout
- Add quick-start copy-paste command sequence for streamlined deployment
- Improve constants system with comprehensive model, AWS, timeout, and prompt configs
- Add architectural assessment document to .gitignore for local analysis
- Run black formatting across all updated Python files
* docs: Consolidate deployment and security documentation
- Rename deployment-and-security.md to security.md and remove redundant deployment content
- Enhance security.md with comprehensive production security guidelines including:
- Authentication and authorization best practices
- Encryption and data protection requirements
- Operational security monitoring and logging
- Input validation and prompt security measures
- Infrastructure security recommendations
- Compliance and governance frameworks
- Update README.md to reference new security.md file
- Eliminate redundancy between deployment-guide.md and deployment-and-security.md
- Improve documentation organization with clear separation of concerns
* config: Replace hardcoded endpoints with placeholder domains
- Update OpenAPI specifications to use placeholder domain 'your-backend-domain.com'
- k8s_api.yaml: mcpgateway.ddns.net:8011 -> your-backend-domain.com:8011
- logs_api.yaml: mcpgateway.ddns.net:8012 -> your-backend-domain.com:8012
- metrics_api.yaml: mcpgateway.ddns.net:8013 -> your-backend-domain.com:8013
- runbooks_api.yaml: mcpgateway.ddns.net:8014 -> your-backend-domain.com:8014
- Update agent configuration to use placeholder AgentCore gateway endpoint
- agent_config.yaml: Replace specific gateway ID with 'your-agentcore-gateway-endpoint'
- Improve security by removing hardcoded production endpoints from repository
- Enable template-based configuration that users can customize during setup
- Align with existing documentation patterns for placeholder domain replacement
2025-07-27 15:05:03 -04:00
|
|
|
#!/usr/bin/env python3
|
|
|
|
|
|
|
|
import asyncio
|
|
|
|
import logging
|
|
|
|
import os
|
|
|
|
from datetime import datetime, timezone
|
|
|
|
from typing import Any, Dict
|
|
|
|
|
|
|
|
from fastapi import FastAPI, HTTPException
|
|
|
|
from pydantic import BaseModel
|
|
|
|
from langchain_core.messages import HumanMessage
|
|
|
|
from langchain_core.tools import BaseTool
|
|
|
|
|
|
|
|
from .multi_agent_langgraph import create_multi_agent_system
|
|
|
|
from .agent_state import AgentState
|
|
|
|
from .constants import SREConstants
|
|
|
|
|
|
|
|
# Import logging config
|
|
|
|
from .logging_config import configure_logging
|
|
|
|
|
|
|
|
# Configure logging based on DEBUG environment variable
|
|
|
|
# This ensures debug mode works even when not run via __main__
|
|
|
|
if not logging.getLogger().handlers:
|
|
|
|
# Check if DEBUG is already set in environment
|
|
|
|
debug_from_env = os.getenv("DEBUG", "false").lower() in ("true", "1", "yes")
|
|
|
|
configure_logging(debug_from_env)
|
|
|
|
|
|
|
|
# Disable uvicorn access logs for /ping endpoint
|
|
|
|
logging.getLogger("uvicorn.access").setLevel(logging.WARNING)
|
|
|
|
|
|
|
|
logger = logging.getLogger(__name__)
|
|
|
|
|
|
|
|
# Simple FastAPI app
|
|
|
|
app = FastAPI(title="SRE Agent Runtime", version="1.0.0")
|
|
|
|
|
|
|
|
|
|
|
|
# Simple request/response models
|
|
|
|
class InvocationRequest(BaseModel):
|
|
|
|
input: Dict[str, Any]
|
|
|
|
|
|
|
|
|
|
|
|
class InvocationResponse(BaseModel):
|
|
|
|
output: Dict[str, Any]
|
|
|
|
|
|
|
|
|
|
|
|
# Global variables for agent state
|
|
|
|
agent_graph = None
|
|
|
|
tools: list[BaseTool] = []
|
|
|
|
|
|
|
|
|
|
|
|
async def initialize_agent():
|
|
|
|
"""Initialize the SRE agent system using the same method as CLI."""
|
|
|
|
global agent_graph, tools
|
|
|
|
|
|
|
|
if agent_graph is not None:
|
|
|
|
return # Already initialized
|
|
|
|
|
|
|
|
try:
|
|
|
|
logger.info("Initializing SRE Agent system...")
|
|
|
|
|
|
|
|
# Get provider from environment variable with bedrock as default
|
|
|
|
provider = os.getenv("LLM_PROVIDER", "bedrock").lower()
|
|
|
|
|
|
|
|
# Validate provider
|
|
|
|
if provider not in ["anthropic", "bedrock"]:
|
|
|
|
logger.warning(f"Invalid provider '{provider}', defaulting to 'bedrock'")
|
|
|
|
provider = "bedrock"
|
|
|
|
|
|
|
|
logger.info(f"Environment LLM_PROVIDER: {os.getenv('LLM_PROVIDER', 'NOT_SET')}")
|
|
|
|
logger.info(f"Using LLM provider: {provider}")
|
|
|
|
logger.info(f"Calling create_multi_agent_system with provider: {provider}")
|
|
|
|
|
|
|
|
# Create multi-agent system using the same function as CLI
|
|
|
|
agent_graph, tools = await create_multi_agent_system(provider)
|
|
|
|
|
|
|
|
logger.info(
|
|
|
|
f"SRE Agent system initialized successfully with {len(tools)} tools"
|
|
|
|
)
|
|
|
|
|
|
|
|
except Exception as e:
|
2025-08-01 13:24:58 -04:00
|
|
|
from .llm_utils import LLMAuthenticationError, LLMAccessError, LLMProviderError
|
|
|
|
|
|
|
|
if isinstance(e, (LLMAuthenticationError, LLMAccessError, LLMProviderError)):
|
|
|
|
logger.error(f"LLM Provider Error: {e}")
|
|
|
|
print(f"\n❌ {type(e).__name__}:")
|
|
|
|
print(str(e))
|
|
|
|
print(f"\n💡 Set LLM_PROVIDER environment variable to switch providers:")
|
|
|
|
other_provider = "anthropic" if provider == "bedrock" else "bedrock"
|
|
|
|
print(f" export LLM_PROVIDER={other_provider}")
|
|
|
|
else:
|
|
|
|
logger.error(f"Failed to initialize SRE Agent system: {e}")
|
fix(SRE Agent)- Deploy SRE Agent on Amazon Bedrock AgentCore Runtime with Enhanced Architecture (#158)
* feat: Deploy SRE agent on Amazon Bedrock AgentCore Runtime
- Add agent_runtime.py with FastAPI endpoints for AgentCore compatibility
- Create Dockerfile for ARM64-based containerization
- Add deployment scripts for automated ECR push and AgentCore deployment
- Update backend API URLs from placeholders to actual endpoints
- Update gateway configuration for production use
- Add dependencies for AgentCore runtime support
Implements #143
* chore: Add deployment artifacts to .gitignore
- Add deployment/.sre_agent_uri, deployment/.env, and deployment/.agent_arn to .gitignore
- Remove already tracked deployment artifacts from git
* feat: Make ANTHROPIC_API_KEY optional in deployment
- Update deploy_agent_runtime.py to conditionally include ANTHROPIC_API_KEY
- Show info message when using Amazon Bedrock as provider
- Update .env.example to clarify ANTHROPIC_API_KEY is optional
- Only include ANTHROPIC_API_KEY in environment variables if it exists
* fix: Use uv run python instead of python in build script
- Update build_and_deploy.sh to use 'uv run python' for deployment
- Change to parent directory to ensure uv environment is available
- Fixes 'python: command not found' error during deployment
* refactor: Improve deployment script structure and create .env symlink
- Flatten nested if-else blocks in deploy_agent_runtime.py for better readability
- Add 10-second sleep after deletion to ensure cleanup completes
- Create symlink from deployment/.env to sre_agent/.env to avoid duplication
- Move time import to top of file with other imports
* feat: Add debug mode support and comprehensive deployment guide
Add --debug command line flag and DEBUG environment variable support:
- Created shared logging configuration module
- Updated CLI and runtime to support --debug flag
- Made debug traces conditional on DEBUG environment variable
- Added debug mode for container and AgentCore deployments
Enhanced build and deployment script:
- Added command line argument for ECR repository name
- Added help documentation and usage examples
- Added support for local builds (x86_64) vs AgentCore builds (arm64)
- Added environment variable pass-through for DEBUG, LLM_PROVIDER, ANTHROPIC_API_KEY
Created comprehensive deployment guide:
- Step-by-step instructions from local testing to production
- Docker platform documentation (x86_64 vs arm64)
- Environment variable configuration with .env file usage
- Debug mode examples and troubleshooting guide
- Provider configuration for Bedrock and Anthropic
Updated README with AgentCore Runtime deployment section and documentation links.
* docs: Update SRE Agent README with deployment flow diagram and fix directory reference
- Fix reference from 04-SRE-agent to SRE-agent in README
- Add comprehensive flowchart showing development to production deployment flow
- Update overview to mention Amazon Bedrock AgentCore Runtime deployment
- Remove emojis from documentation for professional appearance
* docs: Replace mermaid diagram with ASCII step-by-step flow diagram
- Change from block-style mermaid diagram to ASCII flow diagram
- Show clear step-by-step progression from development to production
- Improve readability with structured boxes and arrows
- Minor text improvements for clarity
* feat: Implement comprehensive prompt management system and enhance deployment guide
- Create centralized prompt template system with external files in config/prompts/
- Add PromptLoader utility class with LRU caching and template variable substitution
- Integrate PromptConfig into SREConstants for centralized configuration management
- Update all agents (nodes, supervisor, output_formatter) to use prompt loader
- Replace 150+ lines of hardcoded prompts with modular, maintainable template system
- Enhance deployment guide with consistent naming (my_custom_sre_agent) throughout
- Add quick-start copy-paste command sequence for streamlined deployment
- Improve constants system with comprehensive model, AWS, timeout, and prompt configs
- Add architectural assessment document to .gitignore for local analysis
- Run black formatting across all updated Python files
* docs: Consolidate deployment and security documentation
- Rename deployment-and-security.md to security.md and remove redundant deployment content
- Enhance security.md with comprehensive production security guidelines including:
- Authentication and authorization best practices
- Encryption and data protection requirements
- Operational security monitoring and logging
- Input validation and prompt security measures
- Infrastructure security recommendations
- Compliance and governance frameworks
- Update README.md to reference new security.md file
- Eliminate redundancy between deployment-guide.md and deployment-and-security.md
- Improve documentation organization with clear separation of concerns
* config: Replace hardcoded endpoints with placeholder domains
- Update OpenAPI specifications to use placeholder domain 'your-backend-domain.com'
- k8s_api.yaml: mcpgateway.ddns.net:8011 -> your-backend-domain.com:8011
- logs_api.yaml: mcpgateway.ddns.net:8012 -> your-backend-domain.com:8012
- metrics_api.yaml: mcpgateway.ddns.net:8013 -> your-backend-domain.com:8013
- runbooks_api.yaml: mcpgateway.ddns.net:8014 -> your-backend-domain.com:8014
- Update agent configuration to use placeholder AgentCore gateway endpoint
- agent_config.yaml: Replace specific gateway ID with 'your-agentcore-gateway-endpoint'
- Improve security by removing hardcoded production endpoints from repository
- Enable template-based configuration that users can customize during setup
- Align with existing documentation patterns for placeholder domain replacement
2025-07-27 15:05:03 -04:00
|
|
|
raise
|
|
|
|
|
|
|
|
|
|
|
|
@app.on_event("startup")
|
|
|
|
async def startup_event():
|
|
|
|
"""Initialize agent on startup."""
|
|
|
|
await initialize_agent()
|
|
|
|
|
|
|
|
|
|
|
|
@app.post("/invocations", response_model=InvocationResponse)
|
|
|
|
async def invoke_agent(request: InvocationRequest):
|
|
|
|
"""Main agent invocation endpoint."""
|
|
|
|
global agent_graph, tools
|
|
|
|
|
|
|
|
logger.info("Received invocation request")
|
|
|
|
|
|
|
|
try:
|
|
|
|
# Ensure agent is initialized
|
|
|
|
await initialize_agent()
|
|
|
|
|
|
|
|
# Extract user prompt
|
|
|
|
user_prompt = request.input.get("prompt", "")
|
|
|
|
if not user_prompt:
|
|
|
|
raise HTTPException(
|
|
|
|
status_code=400,
|
|
|
|
detail="No prompt found in input. Please provide a 'prompt' key in the input.",
|
|
|
|
)
|
|
|
|
|
|
|
|
logger.info(f"Processing query: {user_prompt}")
|
|
|
|
|
|
|
|
# Create initial state exactly like the CLI does
|
|
|
|
initial_state: AgentState = {
|
|
|
|
"messages": [HumanMessage(content=user_prompt)],
|
|
|
|
"next": "supervisor",
|
|
|
|
"agent_results": {},
|
|
|
|
"current_query": user_prompt,
|
|
|
|
"metadata": {},
|
|
|
|
"requires_collaboration": False,
|
|
|
|
"agents_invoked": [],
|
|
|
|
"final_response": None,
|
|
|
|
"auto_approve_plan": True, # Always auto-approve plans in runtime mode
|
|
|
|
}
|
|
|
|
|
|
|
|
# Process through the agent graph exactly like the CLI
|
|
|
|
final_response = ""
|
|
|
|
|
|
|
|
logger.info("Starting agent graph execution")
|
|
|
|
|
|
|
|
async for event in agent_graph.astream(initial_state):
|
|
|
|
for node_name, node_output in event.items():
|
|
|
|
logger.info(f"Processing node: {node_name}")
|
|
|
|
|
|
|
|
# Log key events from each node
|
|
|
|
if node_name == "supervisor":
|
|
|
|
next_agent = node_output.get("next", "")
|
|
|
|
metadata = node_output.get("metadata", {})
|
|
|
|
logger.info(f"Supervisor routing to: {next_agent}")
|
|
|
|
if metadata.get("routing_reasoning"):
|
|
|
|
logger.info(
|
|
|
|
f"Routing reasoning: {metadata['routing_reasoning']}"
|
|
|
|
)
|
|
|
|
|
|
|
|
elif node_name in [
|
|
|
|
"kubernetes_agent",
|
|
|
|
"logs_agent",
|
|
|
|
"metrics_agent",
|
|
|
|
"runbooks_agent",
|
|
|
|
]:
|
|
|
|
agent_results = node_output.get("agent_results", {})
|
|
|
|
logger.info(f"{node_name} completed with results")
|
|
|
|
|
|
|
|
# Capture final response from aggregate node
|
|
|
|
elif node_name == "aggregate":
|
|
|
|
final_response = node_output.get("final_response", "")
|
|
|
|
logger.info("Aggregate node completed, final response captured")
|
|
|
|
|
|
|
|
if not final_response:
|
|
|
|
logger.warning("No final response received from agent graph")
|
|
|
|
final_response = (
|
|
|
|
"I encountered an issue processing your request. Please try again."
|
|
|
|
)
|
|
|
|
else:
|
|
|
|
logger.info(f"Final response length: {len(final_response)} characters")
|
|
|
|
|
|
|
|
# Simple response format
|
|
|
|
response_data = {
|
|
|
|
"message": final_response,
|
|
|
|
"timestamp": datetime.now(timezone.utc).isoformat(),
|
|
|
|
"model": SREConstants.app.agent_model_name,
|
|
|
|
}
|
|
|
|
|
|
|
|
logger.info("Successfully processed agent request")
|
|
|
|
logger.info("Returning invocation response")
|
|
|
|
return InvocationResponse(output=response_data)
|
|
|
|
|
|
|
|
except HTTPException:
|
|
|
|
raise
|
|
|
|
except Exception as e:
|
|
|
|
logger.error(f"Agent processing failed: {e}")
|
|
|
|
logger.exception("Full exception details:")
|
|
|
|
raise HTTPException(
|
|
|
|
status_code=500, detail=f"Agent processing failed: {str(e)}"
|
|
|
|
)
|
|
|
|
|
|
|
|
|
|
|
|
@app.get("/ping")
|
|
|
|
async def ping():
|
|
|
|
"""Health check endpoint."""
|
|
|
|
return {"status": "healthy"}
|
|
|
|
|
|
|
|
|
|
|
|
async def invoke_sre_agent_async(prompt: str, provider: str = "anthropic") -> str:
|
|
|
|
"""
|
|
|
|
Programmatic interface to invoke SRE agent.
|
|
|
|
|
|
|
|
Args:
|
|
|
|
prompt: The user prompt/query
|
|
|
|
provider: LLM provider ("anthropic" or "bedrock")
|
|
|
|
|
|
|
|
Returns:
|
|
|
|
The agent's response as a string
|
|
|
|
"""
|
|
|
|
try:
|
|
|
|
# Create the multi-agent system
|
|
|
|
graph, tools = await create_multi_agent_system(provider=provider)
|
|
|
|
|
|
|
|
# Create initial state
|
|
|
|
initial_state: AgentState = {
|
|
|
|
"messages": [HumanMessage(content=prompt)],
|
|
|
|
"next": "supervisor",
|
|
|
|
"agent_results": {},
|
|
|
|
"current_query": prompt,
|
|
|
|
"metadata": {},
|
|
|
|
"requires_collaboration": False,
|
|
|
|
"agents_invoked": [],
|
|
|
|
"final_response": None,
|
|
|
|
}
|
|
|
|
|
|
|
|
# Execute and get final response
|
|
|
|
final_response = ""
|
|
|
|
async for event in graph.astream(initial_state):
|
|
|
|
for node_name, node_output in event.items():
|
|
|
|
if node_name == "aggregate":
|
|
|
|
final_response = node_output.get("final_response", "")
|
|
|
|
|
|
|
|
return final_response or "I encountered an issue processing your request."
|
|
|
|
|
|
|
|
except Exception as e:
|
|
|
|
logger.error(f"Agent invocation failed: {e}")
|
|
|
|
raise
|
|
|
|
|
|
|
|
|
|
|
|
def invoke_sre_agent(prompt: str, provider: str = "anthropic") -> str:
|
|
|
|
"""
|
|
|
|
Synchronous wrapper for invoke_sre_agent_async.
|
|
|
|
|
|
|
|
Args:
|
|
|
|
prompt: The user prompt/query
|
|
|
|
provider: LLM provider ("anthropic" or "bedrock")
|
|
|
|
|
|
|
|
Returns:
|
|
|
|
The agent's response as a string
|
|
|
|
"""
|
|
|
|
return asyncio.run(invoke_sre_agent_async(prompt, provider))
|
|
|
|
|
|
|
|
|
|
|
|
if __name__ == "__main__":
|
|
|
|
import argparse
|
|
|
|
import uvicorn
|
|
|
|
|
|
|
|
parser = argparse.ArgumentParser(description="SRE Agent Runtime")
|
|
|
|
parser.add_argument(
|
|
|
|
"--provider",
|
|
|
|
choices=["anthropic", "bedrock"],
|
|
|
|
default=os.getenv("LLM_PROVIDER", "bedrock"),
|
|
|
|
help="LLM provider to use (default: bedrock)",
|
|
|
|
)
|
|
|
|
parser.add_argument("--host", default="0.0.0.0", help="Host to bind to")
|
|
|
|
parser.add_argument("--port", type=int, default=8080, help="Port to bind to")
|
|
|
|
parser.add_argument(
|
|
|
|
"--debug",
|
|
|
|
action="store_true",
|
|
|
|
help="Enable debug logging and trace output",
|
|
|
|
)
|
|
|
|
|
|
|
|
args = parser.parse_args()
|
|
|
|
|
|
|
|
# Configure logging based on debug flag
|
|
|
|
from .logging_config import configure_logging
|
|
|
|
|
|
|
|
debug_enabled = configure_logging(args.debug)
|
|
|
|
|
|
|
|
# Set environment variables
|
|
|
|
os.environ["LLM_PROVIDER"] = args.provider
|
|
|
|
os.environ["DEBUG"] = "true" if debug_enabled else "false"
|
|
|
|
|
|
|
|
logger.info(f"Starting SRE Agent Runtime with provider: {args.provider}")
|
|
|
|
if debug_enabled:
|
|
|
|
logger.info("Debug logging enabled")
|
|
|
|
uvicorn.run(app, host=args.host, port=args.port)
|