* feat: integrate long-term memory system into SRE agent - Add AgentCore Memory integration with three memory strategies: * User preferences (escalation, notification, workflow preferences) * Infrastructure knowledge (dependencies, patterns, baselines) * Investigation summaries (timeline, actions, findings) - Implement memory tools for save/retrieve operations - Add automatic memory capture through hooks and pattern recognition - Extend agent state to support memory context - Integrate memory-aware planning in supervisor agent - Add comprehensive test coverage for memory functionality - Create detailed documentation with usage examples This transforms the SRE agent from stateless to learning assistant that becomes more valuable over time by remembering user preferences, infrastructure patterns, and investigation outcomes. Addresses issue #164 * feat: environment variable config, agent routing fixes, and project organization - Move USER_ID/SESSION_ID from metadata parsing to environment variables - Add .memory_id to .gitignore for local memory state - Update .gitignore to use .scratchpad/ folder instead of .scratchpad.md - Fix agent routing issues with supervisor prompt and graph node naming - Add conversation memory tracking for all agents and supervisor - Improve agent metadata system with centralized constants - Add comprehensive logging and debugging for agent tool access - Update deployment script to pass user_id/session_id in payload - Create .scratchpad/ folder structure for better project organization * feat: enhance SRE agent with automatic report archiving and error fixes - Add automatic archiving system for reports by date - Include user_id in report filenames for better organization - Fix Pydantic validation error with string-to-list conversion for investigation steps - Add content length truncation for memory storage to prevent validation errors - Remove status line from report output for cleaner formatting - Implement date-based folder organization (YYYY-MM-DD format) - Add memory content length limits configuration in constants Key improvements: - Reports now auto-archive old files when saving new ones - User-specific filenames: query_user_id_UserName_YYYYMMDD_HHMMSS.md - Robust error handling for memory content length limits - Backward compatibility with existing filename formats * feat: fix memory retrieval system for cross-session searches and user personalization Key fixes and improvements: - Fix case preservation in actor_id sanitization (Carol remains Carol, not carol) - Enable cross-session memory searches for infrastructure and investigation memories - Add XML parsing support for investigation summaries stored in XML format - Enhance user preference integration throughout the system - Add comprehensive debug logging for memory retrieval processes - Update prompts to support user-specific communication styles and preferences Memory system now properly: - Preserves user case in memory namespaces (/sre/users/Carol vs /sre/users/carol) - Searches across all sessions for planning context vs session-specific for current state - Parses both JSON and XML formatted investigation memories - Adapts investigation approach based on user preferences and historical patterns - Provides context-aware planning using infrastructure knowledge and past investigations * feat: enhance SRE agent with user-specific memory isolation and anti-hallucination measures Memory System Improvements: - Fix memory isolation to retrieve only user-specific memories (Alice doesn't see Carol's data) - Implement proper namespace handling for cross-session vs session-specific searches - Add detailed logging for memory retrieval debugging and verification - Remove verbose success logs, keep only error logs for cleaner output Anti-Hallucination Enhancements: - Add tool output validation requirements to agent prompts - Implement timestamp fabrication prevention (use 2024-* format from backend) - Require tool attribution for all metrics and findings in reports - Add backend data alignment patterns for consistent data references - Update supervisor aggregation prompts to flag unverified claims Code Organization: - Extract hardcoded prompts from supervisor.py to external prompt files - Add missing session_id parameters to SaveInfrastructureTool and SaveInvestigationTool - Improve memory client namespace documentation and cross-session search logic - Reduce debug logging noise while maintaining error tracking Verification Complete: - Memory isolation working correctly (only user-specific data retrieval) - Cross-session memory usage properly configured for planning and investigations - Memory integration confirmed in report generation pipeline - Anti-hallucination measures prevent fabricated metrics and timestamps * feat: organize utility scripts in dedicated scripts folder Script Organization: - Move manage_memories.py to scripts/ folder with updated import paths - Move configure_gateway.sh to scripts/ folder with corrected PROJECT_ROOT path - Copy user_config.yaml to scripts/ folder for self-contained script usage Path Fixes: - Update manage_memories.py to import sre_agent module from correct relative path - Fix .memory_id file path resolution for new script location - Update configure_gateway.sh PROJECT_ROOT to point to correct parent directory - Add fallback logic to find user_config.yaml in scripts/ or project root Script Improvements: - Update help text and examples to use 'uv run python scripts/' syntax - Make manage_memories.py executable with proper permissions - Maintain backward compatibility for custom config file paths - Self-contained scripts folder with all required dependencies Verification: - All scripts work correctly from new location - Memory management functions operate properly - Gateway configuration handles paths correctly - User preferences loading works from scripts directory * docs: update SSL certificate paths to use /opt/ssl standard location - Update README.md to reference /opt/ssl for SSL certificate paths - Update docs/demo-environment.md to use /opt/ssl paths - Clean up scripts/configure_gateway.sh SSL fallback paths - Remove duplicate and outdated SSL path references - Establish /opt/ssl as the standard SSL certificate location This ensures consistent SSL certificate management across all documentation and scripts, supporting the established /opt/ssl directory with proper ubuntu:ubuntu ownership. * feat: enhance memory system with infrastructure parsing fix and user personalization analysis Infrastructure Memory Parsing Improvements: - Fix infrastructure memory parsing to handle both JSON and plain text formats - Convert plain text memories to structured InfrastructureKnowledge objects - Change warning logs to debug level for normal text-to-structure conversion - Ensure all infrastructure memories are now retrievable and usable User Personalization Documentation: - Add comprehensive memory system analysis comparing Alice vs Carol reports - Create docs/examples/ folder with real investigation reports demonstrating personalization - Document side-by-side communication differences based on user preferences - Show how same technical incident produces different reports for different user roles Example Reports Added: - Alice's technical detailed investigation report (technical role preferences) - Carol's business-focused executive summary report (executive role preferences) - Memory system analysis with extensive side-by-side comparisons This demonstrates the memory system's ability to: - Maintain technical accuracy while adapting presentation style - Apply user-specific escalation procedures and communication channels - Build institutional knowledge about recurring infrastructure patterns - Personalize identical technical incidents for different organizational roles * feat: enhance memory system with automatic pattern extraction and improved logging ## Memory System Enhancements - **Individual agent memory integration**: Every agent response now triggers automatic memory pattern extraction through on_agent_response() hooks - **Enhanced conversation logging**: Added detailed message breakdown showing USER/ASSISTANT/TOOL message counts and tool names called - **Fixed infrastructure extraction**: Resolved hardcoded agent name issues by using SREConstants for agent identification - **Comprehensive memory persistence**: All agent responses and tool executions stored as conversation memory with proper session tracking ## Tool Architecture Clarification - **Centralized memory access**: Confirmed only supervisor agent has direct access to memory tools (retrieve_memory, save_*) - **Individual agent focus**: Individual agents have NO memory tools, only domain-specific tools (5 tools each for metrics, logs, k8s, runbooks) - **Automatic pattern recognition**: Memory capture happens automatically through hooks, not manual tool calls by individual agents ## Documentation Updates - **Updated memory-system.md**: Comprehensive design documentation reflecting current implementation - **Added example analyses**: Created flight-booking-analysis.md and api-response-time-analysis.md in docs/examples/ - **Enhanced README.md**: Added memory system overview and personalized investigation examples - **Updated .gitignore**: Now ignores entire reports/ folder instead of just .md files ## Implementation Improvements - **Event ID tracking**: All memory operations generate and log event IDs for verification - **Pattern extraction confirmation**: Logs confirm pattern extraction working for all agent types - **Memory save verification**: Comprehensive logging shows successful saves across all memory types - **Script enhancements**: manage_memories.py now handles duplicate removal and improved user management * docs: enhance memory system documentation with planning agent memory usage examples - Add real agent.log snippets showing planning agent retrieving and using memory context - Document XML-structured prompts for improved Claude model interaction - Explain JSON response format enforcement and infrastructure knowledge extraction - Add comprehensive logging and monitoring details - Document actor ID design for proper memory namespace isolation - Fix ASCII flow diagram alignment for better readability - Remove temporal framing and present features as current design facts * docs: add AWS documentation links and clean up memory system documentation - Add hyperlink to Amazon Bedrock AgentCore Memory main documentation - Link to Memory Getting Started Guide for the three memory strategies - Remove Legacy Pattern Recognition section from documentation (code remains) - Remove Error Handling and Fallbacks section to focus on core functionality - Keep implementation details in code while streamlining public documentation * docs: reorganize memory-system.md to eliminate redundancies - Merged Memory Tool Architecture and Planning sections into unified section - Consolidated all namespace/actor_id explanations in architecture section - Combined pattern recognition and memory capture content - Created dedicated Agent Memory Integration section with examples - Removed ~15-20% redundant content while improving clarity - Improved document structure for better navigation * style: apply ruff formatting and fix code style issues - Applied ruff auto-formatting to all Python files - Fixed 383 style issues automatically - Remaining issues require manual intervention: - 29 ruff errors (bare except, unused variables, etc.) - 61 mypy type errors (missing annotations, implicit Optional) - Verified memory system functionality matches documentation - Confirmed user personalization working correctly in reports * docs: make benefits section more succinct in memory-system.md - Consolidated 12 bullet points into 5 focused benefits - Removed redundant three-category structure (Users/Teams/Operations) - Maintained all key value propositions while improving readability - Reduced section length by ~60% while preserving essential information * feat: add comprehensive cleanup script with memory deletion - Added cleanup.sh script to delete all AWS resources (gateway, runtime, memory) - Integrated memory deletion using bedrock_agentcore MemoryClient - Added proper error handling and graceful fallbacks - Updated execution order: servers → gateway → memory → runtime → local files - Added memory deletion to README.md cleanup instructions - Includes confirmation prompts and --force option for automation * fix: preserve .env, .venv, and reports in cleanup script - Modified cleanup script to only remove AWS-generated configuration files - Preserved .env files for development continuity - Preserved .venv directories to avoid reinstalling dependencies - Preserved reports/ directory containing investigation history - Files removed: gateway URIs, tokens, agent ARNs, memory IDs only - Updated documentation to clarify preserved vs removed files * fix: use correct bedrock-agentcore-control client for gateway operations - Changed boto3 client from 'bedrock-agentcore' to 'bedrock-agentcore-control' - Fixes 'list_gateways' method not found error during gateway deletion - Both gateway and runtime deletion now use the correct control plane client * docs: add memory system initialization timing guidance - Added note that memory system takes 10-12 minutes to be ready - Added steps to check memory status with list command after 10 minutes - Added instruction to run update command again once memory is ready - Provides clear workflow for memory system setup and prevents user confusion * docs: comprehensive documentation update and cleanup - Remove unused root .env and .env.example files (not referenced by any code) - Update configuration.md with comprehensive config file documentation - Add configuration overview table with setup instructions and auto-generation info - Consolidate specialized-agents.md content into system-components.md - Update system-components.md with complete AgentCore architecture - Add detailed sections for AgentCore Runtime, Gateway, and Memory primitives - Remove cli-reference.md (excessive documentation for limited use) - Update README.md to reference configuration guide in setup section - Clean up documentation links and organization The documentation now provides a clear, consolidated view of the system architecture and configuration with proper cross-references and setup guidance. * feat: improve runtime deployment and invocation robustness - Increase deletion wait time to 150s for agent runtime cleanup - Add retry logic with exponential backoff for MCP rate limiting (429 errors) - Add session_id and user_id to agent state for memory retrieval - Filter out /ping endpoint logs to reduce noise - Increase boto3 read timeout to 5 minutes for long-running operations - Add clear error messages for agent name conflicts - Update README to clarify virtual environment requirement for scripts - Fix session ID generation to meet 33+ character requirement These changes improve reliability when deploying and invoking agents, especially under heavy load or with complex queries that take time. * chore: remove accidentally committed reports folder Removed 130+ markdown report files from the reports/ directory that were accidentally committed. The .gitignore already includes reports/ to prevent future commits of these generated files.
29 KiB
SRE Agent Memory System
Overview
The SRE Agent includes a sophisticated long-term memory system built on Amazon Bedrock AgentCore Memory that enables persistent user preferences, cross-session learning, and personalized investigation experiences. This system remembers user preferences, learns from past investigations, and tailors reports based on individual user roles and workflows.
The system provides three distinct memory strategies for different types of information and comes pre-configured with user personas to demonstrate personalized investigations.
Pre-configured User Personas
The system comes with two example user personas in scripts/user_config.yaml
that demonstrate how personalized investigations work:
Alice - Technical SRE Engineer
- Investigation Style: Detailed, systematic, multi-dimensional investigations with comprehensive analysis
- Communication: Technical team channels (
#alice-alerts
,#sre-team
) with detailed metrics and troubleshooting steps - Escalation: Technical management (
alice.manager@company.com
) with 15-minute delay threshold - Reports: Technical exposition with step-by-step methodologies and complete tool references
- Preferences: Detailed analysis, UTC timezone, includes troubleshooting steps
Carol - Executive/Director
- Investigation Style: Executive-focused with business impact analysis and streamlined presentation
- Communication: Strategic channels (
#carol-executive
,#strategic-alerts
) with filtered notifications (critical only) - Escalation: Executive team (
carol.director@company.com
) with faster 20-minute timeline - Reports: Business-focused summaries without detailed technical steps, emphasizing impact and business consequences
- Preferences: Executive summary format, EST timezone, business impact focus
Personalized Investigation Examples
When running investigations with different user IDs, the agent produces similar technical findings but presents them according to each user's preferences:
# Alice's detailed technical investigation
USER_ID=Alice sre-agent --prompt "API response times have degraded 3x in the last hour" --provider bedrock
# Carol's executive-focused investigation
USER_ID=Carol sre-agent --prompt "API response times have degraded 3x in the last hour" --provider bedrock
Both commands identify identical technical issues but present them differently:
- Alice receives detailed technical analysis with step-by-step troubleshooting and comprehensive tool references
- Carol receives executive summaries focused on business impact with rapid escalation timelines
For a detailed comparison showing how the memory system personalizes identical incidents, see: Memory System Report Comparison
Amazon Bedrock AgentCore Memory Architecture
The memory system uses Amazon Bedrock AgentCore Memory's sophisticated event-based model with automatic namespace routing:
Memory Strategies and Namespaces
When the SRE Agent initializes, it creates three memory strategies with specific namespace patterns:
- User Preferences Strategy: Namespace pattern
/sre/users/{user_id}/preferences
- Infrastructure Knowledge Strategy: Namespace pattern
/sre/infrastructure/{user_id}/{session_id}
- Investigation Memory Strategy: Namespace pattern
/sre/investigations/{user_id}/{session_id}
How Namespace Routing Works
The key insight is that the SRE Agent only needs to provide the actor_id when calling create_event()
. Amazon Bedrock AgentCore Memory automatically:
- Strategy Matching: Examines all strategies associated with the memory resource
- Namespace Resolution: Determines which namespace(s) the event belongs to based on the actor_id
- Automatic Routing: Places the event in the correct strategy's namespace without requiring explicit namespace specification
- Multi-Strategy Storage: A single event can be stored in multiple strategies if the namespaces match
Actor ID Design for Memory Namespace Isolation
The memory system uses a consistent actor_id strategy to ensure proper namespace isolation:
- User preferences: Use user_id as actor_id (e.g., "Alice") for personal namespaces (
/sre/users/Alice/preferences
) - Infrastructure knowledge: Use agent-specific actor_ids (e.g., "kubernetes-agent") for domain expertise namespaces
- Investigation summaries: Use user_id as actor_id for personal investigation history (
/sre/investigations/Alice
) - Conversation memory: Use user_id to maintain personal conversation context
This design ensures that:
- User-specific data remains isolated to individual users
- Infrastructure knowledge is organized by the agent that discovered it
- Memory operations route to the correct namespaces automatically
- Cross-session memory retrieval works reliably
Event-based Model Benefits
- Immutable Events: All memory entries are stored as immutable events that cannot be modified
- Accumulative Learning: New events accumulate over time without deleting old ones
- Strategy Aggregation: Memory strategies aggregate events from their namespace to provide relevant context
- Automatic Organization: Events are automatically organized by user, session, and memory type
Example Event Flow
# SRE Agent calls create_event with just actor_id and content
memory_client.create_event(
memory_id="sre_agent_memory-xyz",
actor_id="Alice", # Amazon Bedrock AgentCore Memory uses this to route to correct namespace
session_id="investigation_2025_01_15",
messages=[("preference_data", "ASSISTANT")]
)
# Amazon Bedrock AgentCore Memory automatically:
# 1. Checks all strategy namespaces for this memory
# 2. Matches actor_id "Alice" to namespace "/sre/users/Alice/preferences"
# 3. Stores event in User Preferences Strategy
# 4. Makes event available for future retrievals
Memory Strategies
These are the three long-term memory strategies supported by Amazon Bedrock AgentCore (see Memory Getting Started Guide):
1. User Preferences Memory
Strategy: Semantic Memory with 90-day retention
Purpose: Remember user-specific operational preferences
Captures:
- Escalation contacts and procedures
- Notification channels (Slack, email, etc.)
- Investigation workflow preferences
- Communication style preferences
Example Usage:
# When user mentions "escalate to ops-team@company.com for database issues"
# The system automatically captures:
{
"user_id": "user123",
"preference_type": "escalation",
"preference_value": {
"contact": "ops-team@company.com",
"service_category": "database"
},
"context": "Investigation of Redis connection failures"
}
2. Infrastructure Knowledge Memory
Strategy: Semantic Memory with 30-day retention
Purpose: Build understanding of infrastructure patterns and relationships
Captures:
- Service dependencies and relationships
- Failure patterns and common issues
- Configuration insights and best practices
- Performance baselines and thresholds
Example Usage:
# When investigating a service outage, the system learns:
{
"service_name": "web-api",
"knowledge_type": "dependency",
"knowledge_data": {
"depends_on": "postgres-db",
"failure_mode": "connection_timeout",
"typical_recovery_time": "2-5 minutes"
},
"confidence": 0.8
}
3. Investigation Summaries Memory
Strategy: Summary Memory with 60-day retention
Purpose: Maintain history of investigations for learning and reference
Captures:
- Investigation timeline and actions taken
- Key findings and root causes
- Resolution strategies and outcomes
- Cross-team collaboration context
Example Usage:
{
"incident_id": "incident_20250128_1045",
"query": "Why is the checkout service responding slowly?",
"timeline": [
{"time": "10:45", "action": "Started investigation with metrics agent"},
{"time": "10:47", "action": "Identified high CPU usage"},
{"time": "10:50", "action": "Checked application logs for errors"}
],
"actions_taken": [
"Analyzed CPU and memory metrics",
"Reviewed application error logs",
"Identified memory leak in payment processing"
],
"resolution_status": "completed",
"key_findings": [
"Memory leak in payment service consuming 2GB/hour",
"Database connection pool exhausted during peak traffic",
"Missing circuit breaker causing cascade failures"
]
}
Memory Flow During Investigation
┌─────────────┐ ┌─────────────────────┐ ┌──────────────────────┐
│ User │ │ Supervisor │ │ Amazon Bedrock │
│ │ │ Agent │ │ AgentCore Memory │
└──────┬──────┘ └──────────┬──────────┘ └──────────┬───────────┘
│ │ │
│ Investigation Query │ │
├─────────────────────►│ │
│ │ │
│ ┌───────▼───────┐ │
│ │ on_investigation_start() │
│ │ (memory_hooks) │ │
│ └───────┬───────┘ │
│ │ │
│ │ retrieve_memory(preferences) │
│ ├───────────────────────────────────►│
│ │◄───────────────────────────────────┤
│ │ User preferences (10) │
│ │ │
│ │ retrieve_memory(infrastructure) │
│ ├───────────────────────────────────►│
│ │◄───────────────────────────────────┤
│ │ Infrastructure data (50) │
│ │ │
│ │ retrieve_memory(investigations) │
│ ├───────────────────────────────────►│
│ │◄───────────────────────────────────┤
│ │ Past investigations (5) │
│ │ │
│ ┌───────▼───────┐ │
│ │ Planning Agent with Memory Tools │
│ │ (supervisor.py) │
│ └───────┬───────┘ │
│ │ │
│ ┌───────▼───────┐ │
│ │ Execute Investigation │
│ └───────┬───────┘ │
│ │ │
│ ├─► Metrics Agent │
│ ├─► Logs Agent │
│ ├─► K8s Agent │
│ ├─► Runbooks Agent │
│ │ │
│ ┌───────▼───────┐ │
│ │ Agent Response Processing │
│ │ (pattern extraction & storage) │
│ └───────┬───────┘ │
│ │ │
│ ┌───────▼───────┐ │
│ │ on_investigation_complete() │
│ │ (save investigation summary) │
│ └───────┬───────┘ │
│ │ │
│ Final Response │ │
│◄─────────────────────┤ │
│ │ │
Key Memory Interactions
The memory system integrates at three key points during an investigation. The supervisor.py
orchestrates memory retrieval at startup and saves investigation summaries at completion. Individual agent responses are processed by agent_nodes.py
which triggers pattern extraction through memory/hooks.py
.
- Investigation Start: Retrieves user preferences, infrastructure knowledge, and past investigations to provide context
- Agent Responses: Automatically extracts patterns like escalation contacts, notification channels, and service dependencies
- Investigation Complete: Saves comprehensive summary with timeline, actions taken, and key findings
Memory Tool Architecture and Planning Integration
The memory system uses a centralized architecture where only the supervisor agent has direct access to memory tools:
Tool Distribution Architecture
- Supervisor Agent: Has access to all 4 memory tools (
retrieve_memory
,save_preference
,save_infrastructure
,save_investigation
) - Individual Agents: Have NO direct access to memory tools, only domain-specific tools:
- Kubernetes Agent: 5 k8s-api tools (get_pod_status, get_deployment_status, etc.)
- Application Logs Agent: 5 logs-api tools (search_logs, get_error_logs, etc.)
- Performance Metrics Agent: 5 metrics-api tools (get_performance_metrics, analyze_trends, etc.)
- Operational Runbooks Agent: 5 runbooks-api tools (search_runbooks, get_incident_playbook, etc.)
Centralized Memory Management
This design ensures:
- Memory operations are coordinated through the supervisor
- Individual agents focus on their domain expertise without memory complexity
- Memory context is retrieved once and distributed to agents as needed
- Consistent memory patterns across all investigations
Available Memory Tools (Supervisor Only)
- save_preference: Saves user preferences to long-term memory
- save_infrastructure: Saves infrastructure knowledge to long-term memory
- save_investigation: Saves investigation summaries to long-term memory
- retrieve_memory: Retrieves relevant information from long-term memory
Memory Context in Planning
When creating investigation plans, the supervisor agent incorporates memory context from three sources. The planning agent uses the retrieve_memory
tool to gather relevant context before creating plans.
Planning Agent Memory Usage Example
Here's a real example from agent.log
showing how the planning agent retrieves and uses memory context:
# Memory context retrieval during planning (from agent.log)
2025-08-03 17:48:56,072,p1290668,{supervisor.py:339},INFO,Retrieved memory context for planning: 10 preferences, 50 knowledge items from 1 agents, 5 past investigations
# Planning agent tool calls to gather context
2025-08-03 17:49:01,067,p1290668,{tools.py:317},INFO,retrieve_memory called: type=preference, query='user settings communication escalation notification', actor_id=Alice -> Alice, max_results=5
2025-08-03 17:49:01,067,p1290668,{client.py:236},INFO,Retrieving preferences memories: actor_id=Alice, namespace=/sre/users/Alice/preferences, query='user settings communication escalation notification'
This shows the planning agent:
- Retrieved 10 user preferences from Alice's preference namespace
- Retrieved 50 infrastructure knowledge items from accumulated agent investigations
- Retrieved 5 past investigations for similar query patterns
- Used retrieve_memory tool with structured queries to gather context before planning
Enhanced Planning Prompt with Memory Context
The planning prompt now uses XML structure for better Claude interaction:
<memory_retrieval>
CRITICAL: Before creating the investigation plan, you MUST use the retrieve_memory tool to gather relevant context:
1. Use retrieve_memory("preference", "user settings communication escalation notification", "{user_id}", 5)
2. Use retrieve_memory("infrastructure", "[relevant service terms from query]", "sre-agent", 10, null)
3. Use retrieve_memory("investigation", "[key terms from user query]", "{user_id}", 5, null)
</memory_retrieval>
<planning_guidelines>
After gathering memory context, create a simple, focused investigation plan with 2-3 steps maximum.
Consider user preferences and past investigation patterns from memory.
</planning_guidelines>
<response_format>
MANDATORY: Your response MUST be ONLY valid JSON that matches this exact structure:
{
"steps": ["Step 1 description", "Step 2 description"],
"agents_sequence": ["kubernetes_agent", "logs_agent"],
"complexity": "simple",
"auto_execute": true,
"reasoning": "Brief explanation based on retrieved memory context"
}
</response_format>
Memory-Informed Planning Example
# Enhanced planning prompt includes:
"""
User's query: list kubernetes pods
Retrieved Memory Context:
- User Preferences (10 items): Auto-approval for simple Kubernetes plans, technical detail preference
- Infrastructure Knowledge (50 items): Production namespace layout, pod dependency patterns
- Past Investigations (5 items): Previous successful pod listing investigations
Create an investigation plan considering this context...
"""
The planning agent then creates plans like:
{
"steps": ["Use Kubernetes agent to retrieve current pod status across all namespaces", "Analyze pod health and resource utilization", "Provide structured technical report with pod details"],
"agents_sequence": ["kubernetes_agent"],
"complexity": "simple",
"auto_execute": true,
"reasoning": "Based on user preferences for auto-approval of simple Kubernetes plans and past successful investigations, this is a straightforward pod listing task requiring only the Kubernetes agent"
}
Memory Capture and Pattern Recognition
The SRE Agent automatically captures information during investigations through a sophisticated pattern recognition and structured data conversion process:
How Memory Capture Works
The SRE agent code (specifically sre_agent/memory/hooks.py
) uses regex patterns to parse agent responses and extract structured information:
- Response Analysis: After each agent response, the system scans the text for specific patterns
- Pattern Matching: Uses regex to identify key information types
- Data Structuring: Converts matched patterns into structured Pydantic models
- Memory Storage: Calls Amazon Bedrock AgentCore Memory's
create_event()
API to store the structured data
SRE Agent Pattern Recognition
Every individual agent response triggers automatic memory pattern extraction through the on_agent_response()
hook. This ensures that valuable information discovered during domain-specific investigations is captured and made available for future use.
Infrastructure Knowledge Extraction via Agent JSON Responses
The system uses a sophisticated agent-based approach for infrastructure knowledge extraction. Each agent is instructed to include infrastructure knowledge in their responses using structured JSON format:
Agent Response Format
{
"infrastructure_knowledge": [
{
"service_name": "web-app-deployment",
"knowledge_type": "baseline",
"knowledge_data": {
"cpu_usage_normal": "75%",
"memory_usage_normal": "85%",
"typical_pods": 1,
"node_distribution": "node-1"
},
"confidence": 0.9,
"context": "Pod status analysis revealed normal resource usage patterns"
}
]
}
Knowledge Types Captured
- dependency: Service relationships and dependencies
- pattern: Recurring infrastructure patterns and behaviors
- config: Configuration insights and settings
- baseline: Performance baselines and normal operating ranges
Automatic Extraction Process
- Agent Response Processing: Each agent response is scanned for JSON blocks containing
infrastructure_knowledge
- JSON Parsing: The system extracts and validates the JSON structure
- Knowledge Storage: Valid knowledge items are stored in the infrastructure memory namespace
- Cross-Session Availability: Knowledge becomes available for future investigations across all sessions
Enhanced Agent Response Processing and Logging
Comprehensive Response Logging
The system provides detailed logging of agent responses and memory operations:
# From agent.log - Message breakdown logging
2025-08-03 17:45:30,397,p1289365,{agent_nodes.py:347},INFO,Kubernetes Infrastructure Agent - Message breakdown: 1 USER, 1 ASSISTANT, 1 TOOL messages
2025-08-03 17:45:30,397,p1289365,{agent_nodes.py:349},INFO,Kubernetes Infrastructure Agent - Tools called: k8s-api___get_pod_status
# Memory pattern extraction logging
2025-08-03 17:45:30,398,p1289365,{hooks.py:193},INFO,on_agent_response called for agent: Kubernetes Infrastructure Agent, user_id: Alice
2025-08-03 17:45:30,399,p1289365,{hooks.py:383},INFO,Extracted 5 infrastructure knowledge items from agent response
Infrastructure Knowledge Validation
The system includes validation and error handling for infrastructure knowledge extraction:
# Successful extraction logging
2025-08-03 17:45:30,401,p1289365,{hooks.py:387},INFO,Saved infrastructure knowledge: web-app-deployment (baseline) with confidence 0.9
2025-08-03 17:45:30,402,p1289365,{hooks.py:387},INFO,Saved infrastructure knowledge: database-pod (pattern) with confidence 0.8
Automatic Conversation Memory Storage
All agent interactions are automatically stored in conversation memory with message type breakdown:
# Conversation storage with tool tracking
2025-08-03 17:45:30,397,p1289365,{agent_nodes.py:347},INFO,Kubernetes Infrastructure Agent - Message breakdown: 1 USER, 1 ASSISTANT, 1 TOOL messages
2025-08-03 17:45:30,397,p1289365,{agent_nodes.py:349},INFO,Kubernetes Infrastructure Agent - Tools called: k8s-api___get_pod_status
2025-08-03 17:45:30,530,p1289365,{agent_nodes.py:375},INFO,Kubernetes Infrastructure Agent: Successfully stored conversation in memory
Cross-Session Memory Access
The system provides cross-session memory retrieval for better investigation context:
# Cross-session infrastructure knowledge retrieval
2025-08-03 17:45:30,140,p1289365,{hooks.py:71},INFO,Retrieved infrastructure knowledge for user 'Alice' from 1 different sources: Alice: 50 memories
2025-08-03 17:45:30,140,p1289365,{client.py:245},INFO,Retrieved 50 infrastructure memories for Alice
Memory Capture Methods
- Supervisor tool calls:
retrieve_memory
called during planning;save_investigation
called via planning agent - Automatic pattern extraction: Agent responses are processed by
on_agent_response()
hook to extract:- User preferences (escalation emails, Slack channels)
- Infrastructure knowledge (service dependencies, baselines)
- Calls
_save_*
functions directly (not tool calls)
- Manual configuration: User preferences loaded via
manage_memories.py update
- Conversation storage: All agent responses and tool calls stored as conversation memory
Memory Storage Process
- Pattern Detection: SRE agent code identifies relevant information in responses
- Data Conversion: Creates structured objects (UserPreference, InfrastructureKnowledge, etc.)
- Event Creation: Calls
create_event()
with actor_id and structured data - Namespace Routing: Amazon Bedrock AgentCore Memory automatically routes to correct namespace based on strategy configuration
Agent Memory Integration
The memory system integrates seamlessly with existing SRE agents:
Kubernetes Agent
- Captures: Service dependencies, deployment patterns, resource baselines
- Uses: Past deployment issues, known resource requirements
- Example Knowledge Captured:
{ "service_name": "web-app-deployment", "knowledge_type": "baseline", "knowledge_data": { "cpu_usage_normal": "75%", "memory_usage_normal": "85%", "typical_pods": 1 } }
Logs Agent
- Captures: Common error patterns, log query preferences, resolution strategies
- Uses: Similar error patterns, effective log queries from past investigations
- Example Knowledge Captured:
{ "service_name": "payment-service", "knowledge_type": "pattern", "knowledge_data": { "common_errors": ["connection timeout", "memory leak"], "effective_queries": ["error AND payment AND timeout"] } }
Metrics Agent
- Captures: Performance baselines, alert thresholds, metric correlations
- Uses: Historical baselines, known performance patterns
- Example Knowledge Captured:
{ "service_name": "api-gateway", "knowledge_type": "baseline", "knowledge_data": { "normal_response_time": "200ms", "peak_traffic_hours": "14:00-17:00 UTC" } }
Runbooks Agent
- Captures: Successful resolution procedures, team escalation paths
- Uses: Proven resolution strategies, appropriate runbook recommendations
- Example Knowledge Captured:
{ "service_name": "database", "knowledge_type": "dependency", "knowledge_data": { "escalation_team": "database-team@company.com", "recovery_runbook": "DB-001" } }
Manual Memory Management
Memory management is handled through the manage_memories.py
script:
Viewing Memories
# List all memory types
uv run python scripts/manage_memories.py list
# List specific memory type
uv run python scripts/manage_memories.py list --memory-type preferences
# List memories for specific user
uv run python scripts/manage_memories.py list --memory-type preferences --actor-id Alice
Managing User Preferences
# Load user preferences from YAML configuration
uv run python scripts/manage_memories.py update
# Load from custom configuration file
uv run python scripts/manage_memories.py update --config-file custom_users.yaml
Benefits
- Personalized Investigations: Tailors reports and communication to individual user preferences and roles
- Faster Resolution: Leverages historical context and past investigation knowledge
- Knowledge Preservation: Automatically captures and shares tribal knowledge across team changes
- Pattern Recognition: Identifies recurring issues and optimizes escalation routing
- Reduced MTTR: Accelerates problem resolution through accumulated institutional knowledge
Privacy and Data Management
Data Retention
- User preferences: 90 days (configurable)
- Infrastructure knowledge: 30 days (configurable)
- Investigation summaries: 60 days (configurable)
Setting Up Memory System
Initial Setup
The memory system is automatically initialized during the setup process:
# Initialize memory system and load user preferences (included in setup instructions)
uv run python scripts/manage_memories.py update
This command:
- Creates a new memory resource if none exists
- Configures the three memory strategies
- Loads user preferences from
scripts/user_config.yaml
- Stores the memory ID in
.memory_id
for future use
Adding User Preferences
To add new users or modify existing preferences:
- Edit
scripts/user_config.yaml
to add new user configurations - Run the update command to load new preferences:
uv run python scripts/manage_memories.py update
Managing Memories
# List all memory types
uv run python scripts/manage_memories.py list
# List specific memory type
uv run python scripts/manage_memories.py list --memory-type preferences
# List preferences for specific user
uv run python scripts/manage_memories.py list --memory-type preferences --actor-id Alice