iSharkFly-Docs/amazon-bedrock-agentcore-samples

mirror of https://github.com/awslabs/amazon-bedrock-agentcore-samples.git synced 2025-09-08 20:50:46 +00:00

feat(02-use-cases): integrate AgentCore Memory with SRE Agent for intelligent context-aware incident response (#210 )

* feat: integrate long-term memory system into SRE agent

- Add AgentCore Memory integration with three memory strategies:
  * User preferences (escalation, notification, workflow preferences)
  * Infrastructure knowledge (dependencies, patterns, baselines)
  * Investigation summaries (timeline, actions, findings)

- Implement memory tools for save/retrieve operations
- Add automatic memory capture through hooks and pattern recognition
- Extend agent state to support memory context
- Integrate memory-aware planning in supervisor agent
- Add comprehensive test coverage for memory functionality
- Create detailed documentation with usage examples

This transforms the SRE agent from stateless to learning assistant
that becomes more valuable over time by remembering user preferences,
infrastructure patterns, and investigation outcomes.

Addresses issue #164

* feat: environment variable config, agent routing fixes, and project organization

- Move USER_ID/SESSION_ID from metadata parsing to environment variables
- Add .memory_id to .gitignore for local memory state
- Update .gitignore to use .scratchpad/ folder instead of .scratchpad.md
- Fix agent routing issues with supervisor prompt and graph node naming
- Add conversation memory tracking for all agents and supervisor
- Improve agent metadata system with centralized constants
- Add comprehensive logging and debugging for agent tool access
- Update deployment script to pass user_id/session_id in payload
- Create .scratchpad/ folder structure for better project organization

* feat: enhance SRE agent with automatic report archiving and error fixes

- Add automatic archiving system for reports by date
- Include user_id in report filenames for better organization
- Fix Pydantic validation error with string-to-list conversion for investigation steps
- Add content length truncation for memory storage to prevent validation errors
- Remove status line from report output for cleaner formatting
- Implement date-based folder organization (YYYY-MM-DD format)
- Add memory content length limits configuration in constants

Key improvements:
- Reports now auto-archive old files when saving new ones
- User-specific filenames: query_user_id_UserName_YYYYMMDD_HHMMSS.md
- Robust error handling for memory content length limits
- Backward compatibility with existing filename formats

* feat: fix memory retrieval system for cross-session searches and user personalization

Key fixes and improvements:
- Fix case preservation in actor_id sanitization (Carol remains Carol, not carol)
- Enable cross-session memory searches for infrastructure and investigation memories
- Add XML parsing support for investigation summaries stored in XML format
- Enhance user preference integration throughout the system
- Add comprehensive debug logging for memory retrieval processes
- Update prompts to support user-specific communication styles and preferences

Memory system now properly:
- Preserves user case in memory namespaces (/sre/users/Carol vs /sre/users/carol)
- Searches across all sessions for planning context vs session-specific for current state
- Parses both JSON and XML formatted investigation memories
- Adapts investigation approach based on user preferences and historical patterns
- Provides context-aware planning using infrastructure knowledge and past investigations

* feat: enhance SRE agent with user-specific memory isolation and anti-hallucination measures

Memory System Improvements:
- Fix memory isolation to retrieve only user-specific memories (Alice doesn't see Carol's data)
- Implement proper namespace handling for cross-session vs session-specific searches
- Add detailed logging for memory retrieval debugging and verification
- Remove verbose success logs, keep only error logs for cleaner output

Anti-Hallucination Enhancements:
- Add tool output validation requirements to agent prompts
- Implement timestamp fabrication prevention (use 2024-* format from backend)
- Require tool attribution for all metrics and findings in reports
- Add backend data alignment patterns for consistent data references
- Update supervisor aggregation prompts to flag unverified claims

Code Organization:
- Extract hardcoded prompts from supervisor.py to external prompt files
- Add missing session_id parameters to SaveInfrastructureTool and SaveInvestigationTool
- Improve memory client namespace documentation and cross-session search logic
- Reduce debug logging noise while maintaining error tracking

Verification Complete:
- Memory isolation working correctly (only user-specific data retrieval)
- Cross-session memory usage properly configured for planning and investigations
- Memory integration confirmed in report generation pipeline
- Anti-hallucination measures prevent fabricated metrics and timestamps

* feat: organize utility scripts in dedicated scripts folder

Script Organization:
- Move manage_memories.py to scripts/ folder with updated import paths
- Move configure_gateway.sh to scripts/ folder with corrected PROJECT_ROOT path
- Copy user_config.yaml to scripts/ folder for self-contained script usage

Path Fixes:
- Update manage_memories.py to import sre_agent module from correct relative path
- Fix .memory_id file path resolution for new script location
- Update configure_gateway.sh PROJECT_ROOT to point to correct parent directory
- Add fallback logic to find user_config.yaml in scripts/ or project root

Script Improvements:
- Update help text and examples to use 'uv run python scripts/' syntax
- Make manage_memories.py executable with proper permissions
- Maintain backward compatibility for custom config file paths
- Self-contained scripts folder with all required dependencies

Verification:
- All scripts work correctly from new location
- Memory management functions operate properly
- Gateway configuration handles paths correctly
- User preferences loading works from scripts directory

* docs: update SSL certificate paths to use /opt/ssl standard location

- Update README.md to reference /opt/ssl for SSL certificate paths
- Update docs/demo-environment.md to use /opt/ssl paths
- Clean up scripts/configure_gateway.sh SSL fallback paths
- Remove duplicate and outdated SSL path references
- Establish /opt/ssl as the standard SSL certificate location

This ensures consistent SSL certificate management across all
documentation and scripts, supporting the established /opt/ssl
directory with proper ubuntu:ubuntu ownership.

* feat: enhance memory system with infrastructure parsing fix and user personalization analysis

Infrastructure Memory Parsing Improvements:
- Fix infrastructure memory parsing to handle both JSON and plain text formats
- Convert plain text memories to structured InfrastructureKnowledge objects
- Change warning logs to debug level for normal text-to-structure conversion
- Ensure all infrastructure memories are now retrievable and usable

User Personalization Documentation:
- Add comprehensive memory system analysis comparing Alice vs Carol reports
- Create docs/examples/ folder with real investigation reports demonstrating personalization
- Document side-by-side communication differences based on user preferences
- Show how same technical incident produces different reports for different user roles

Example Reports Added:
- Alice's technical detailed investigation report (technical role preferences)
- Carol's business-focused executive summary report (executive role preferences)
- Memory system analysis with extensive side-by-side comparisons

This demonstrates the memory system's ability to:
- Maintain technical accuracy while adapting presentation style
- Apply user-specific escalation procedures and communication channels
- Build institutional knowledge about recurring infrastructure patterns
- Personalize identical technical incidents for different organizational roles

* feat: enhance memory system with automatic pattern extraction and improved logging

## Memory System Enhancements
- **Individual agent memory integration**: Every agent response now triggers automatic memory pattern extraction through on_agent_response() hooks
- **Enhanced conversation logging**: Added detailed message breakdown showing USER/ASSISTANT/TOOL message counts and tool names called
- **Fixed infrastructure extraction**: Resolved hardcoded agent name issues by using SREConstants for agent identification
- **Comprehensive memory persistence**: All agent responses and tool executions stored as conversation memory with proper session tracking

## Tool Architecture Clarification
- **Centralized memory access**: Confirmed only supervisor agent has direct access to memory tools (retrieve_memory, save_*)
- **Individual agent focus**: Individual agents have NO memory tools, only domain-specific tools (5 tools each for metrics, logs, k8s, runbooks)
- **Automatic pattern recognition**: Memory capture happens automatically through hooks, not manual tool calls by individual agents

## Documentation Updates
- **Updated memory-system.md**: Comprehensive design documentation reflecting current implementation
- **Added example analyses**: Created flight-booking-analysis.md and api-response-time-analysis.md in docs/examples/
- **Enhanced README.md**: Added memory system overview and personalized investigation examples
- **Updated .gitignore**: Now ignores entire reports/ folder instead of just .md files

## Implementation Improvements
- **Event ID tracking**: All memory operations generate and log event IDs for verification
- **Pattern extraction confirmation**: Logs confirm pattern extraction working for all agent types
- **Memory save verification**: Comprehensive logging shows successful saves across all memory types
- **Script enhancements**: manage_memories.py now handles duplicate removal and improved user management

* docs: enhance memory system documentation with planning agent memory usage examples

- Add real agent.log snippets showing planning agent retrieving and using memory context
- Document XML-structured prompts for improved Claude model interaction
- Explain JSON response format enforcement and infrastructure knowledge extraction
- Add comprehensive logging and monitoring details
- Document actor ID design for proper memory namespace isolation
- Fix ASCII flow diagram alignment for better readability
- Remove temporal framing and present features as current design facts

* docs: add AWS documentation links and clean up memory system documentation

- Add hyperlink to Amazon Bedrock AgentCore Memory main documentation
- Link to Memory Getting Started Guide for the three memory strategies
- Remove Legacy Pattern Recognition section from documentation (code remains)
- Remove Error Handling and Fallbacks section to focus on core functionality
- Keep implementation details in code while streamlining public documentation

* docs: reorganize memory-system.md to eliminate redundancies

- Merged Memory Tool Architecture and Planning sections into unified section
- Consolidated all namespace/actor_id explanations in architecture section
- Combined pattern recognition and memory capture content
- Created dedicated Agent Memory Integration section with examples
- Removed ~15-20% redundant content while improving clarity
- Improved document structure for better navigation

* style: apply ruff formatting and fix code style issues

- Applied ruff auto-formatting to all Python files
- Fixed 383 style issues automatically
- Remaining issues require manual intervention:
  - 29 ruff errors (bare except, unused variables, etc.)
  - 61 mypy type errors (missing annotations, implicit Optional)
- Verified memory system functionality matches documentation
- Confirmed user personalization working correctly in reports

* docs: make benefits section more succinct in memory-system.md

- Consolidated 12 bullet points into 5 focused benefits
- Removed redundant three-category structure (Users/Teams/Operations)
- Maintained all key value propositions while improving readability
- Reduced section length by ~60% while preserving essential information

* feat: add comprehensive cleanup script with memory deletion

- Added cleanup.sh script to delete all AWS resources (gateway, runtime, memory)
- Integrated memory deletion using bedrock_agentcore MemoryClient
- Added proper error handling and graceful fallbacks
- Updated execution order: servers → gateway → memory → runtime → local files
- Added memory deletion to README.md cleanup instructions
- Includes confirmation prompts and --force option for automation

* fix: preserve .env, .venv, and reports in cleanup script

- Modified cleanup script to only remove AWS-generated configuration files
- Preserved .env files for development continuity
- Preserved .venv directories to avoid reinstalling dependencies
- Preserved reports/ directory containing investigation history
- Files removed: gateway URIs, tokens, agent ARNs, memory IDs only
- Updated documentation to clarify preserved vs removed files

* fix: use correct bedrock-agentcore-control client for gateway operations

- Changed boto3 client from 'bedrock-agentcore' to 'bedrock-agentcore-control'
- Fixes 'list_gateways' method not found error during gateway deletion
- Both gateway and runtime deletion now use the correct control plane client

* docs: add memory system initialization timing guidance

- Added note that memory system takes 10-12 minutes to be ready
- Added steps to check memory status with list command after 10 minutes
- Added instruction to run update command again once memory is ready
- Provides clear workflow for memory system setup and prevents user confusion

* docs: comprehensive documentation update and cleanup

- Remove unused root .env and .env.example files (not referenced by any code)
- Update configuration.md with comprehensive config file documentation
- Add configuration overview table with setup instructions and auto-generation info
- Consolidate specialized-agents.md content into system-components.md
- Update system-components.md with complete AgentCore architecture
- Add detailed sections for AgentCore Runtime, Gateway, and Memory primitives
- Remove cli-reference.md (excessive documentation for limited use)
- Update README.md to reference configuration guide in setup section
- Clean up documentation links and organization

The documentation now provides a clear, consolidated view of the system
architecture and configuration with proper cross-references and setup guidance.

* feat: improve runtime deployment and invocation robustness

- Increase deletion wait time to 150s for agent runtime cleanup
- Add retry logic with exponential backoff for MCP rate limiting (429 errors)
- Add session_id and user_id to agent state for memory retrieval
- Filter out /ping endpoint logs to reduce noise
- Increase boto3 read timeout to 5 minutes for long-running operations
- Add clear error messages for agent name conflicts
- Update README to clarify virtual environment requirement for scripts
- Fix session ID generation to meet 33+ character requirement

These changes improve reliability when deploying and invoking agents,
especially under heavy load or with complex queries that take time.

* chore: remove accidentally committed reports folder

Removed 130+ markdown report files from the reports/ directory that were
accidentally committed. The .gitignore already includes reports/ to prevent
future commits of these generated files.

2025-08-06 17:49:56 -04:00

29 KiB

Raw Permalink Blame History

SRE Agent Memory System

Overview

The SRE Agent includes a sophisticated long-term memory system built on Amazon Bedrock AgentCore Memory that enables persistent user preferences, cross-session learning, and personalized investigation experiences. This system remembers user preferences, learns from past investigations, and tailors reports based on individual user roles and workflows.

The system provides three distinct memory strategies for different types of information and comes pre-configured with user personas to demonstrate personalized investigations.

Pre-configured User Personas

The system comes with two example user personas in scripts/user_config.yaml that demonstrate how personalized investigations work:

Alice - Technical SRE Engineer

Investigation Style: Detailed, systematic, multi-dimensional investigations with comprehensive analysis
Communication: Technical team channels (#alice-alerts, #sre-team) with detailed metrics and troubleshooting steps
Escalation: Technical management (alice.manager@company.com) with 15-minute delay threshold
Reports: Technical exposition with step-by-step methodologies and complete tool references
Preferences: Detailed analysis, UTC timezone, includes troubleshooting steps

Carol - Executive/Director

Investigation Style: Executive-focused with business impact analysis and streamlined presentation
Communication: Strategic channels (#carol-executive, #strategic-alerts) with filtered notifications (critical only)
Escalation: Executive team (carol.director@company.com) with faster 20-minute timeline
Reports: Business-focused summaries without detailed technical steps, emphasizing impact and business consequences
Preferences: Executive summary format, EST timezone, business impact focus

Personalized Investigation Examples

When running investigations with different user IDs, the agent produces similar technical findings but presents them according to each user's preferences:

# Alice's detailed technical investigation
USER_ID=Alice sre-agent --prompt "API response times have degraded 3x in the last hour" --provider bedrock

# Carol's executive-focused investigation  
USER_ID=Carol sre-agent --prompt "API response times have degraded 3x in the last hour" --provider bedrock

Both commands identify identical technical issues but present them differently:

Alice receives detailed technical analysis with step-by-step troubleshooting and comprehensive tool references
Carol receives executive summaries focused on business impact with rapid escalation timelines

For a detailed comparison showing how the memory system personalizes identical incidents, see: Memory System Report Comparison

Amazon Bedrock AgentCore Memory Architecture

The memory system uses Amazon Bedrock AgentCore Memory's sophisticated event-based model with automatic namespace routing:

Memory Strategies and Namespaces

When the SRE Agent initializes, it creates three memory strategies with specific namespace patterns:

User Preferences Strategy: Namespace pattern /sre/users/{user_id}/preferences
Infrastructure Knowledge Strategy: Namespace pattern /sre/infrastructure/{user_id}/{session_id}
Investigation Memory Strategy: Namespace pattern /sre/investigations/{user_id}/{session_id}

How Namespace Routing Works

The key insight is that the SRE Agent only needs to provide the actor_id when calling create_event(). Amazon Bedrock AgentCore Memory automatically:

Strategy Matching: Examines all strategies associated with the memory resource
Namespace Resolution: Determines which namespace(s) the event belongs to based on the actor_id
Automatic Routing: Places the event in the correct strategy's namespace without requiring explicit namespace specification
Multi-Strategy Storage: A single event can be stored in multiple strategies if the namespaces match

Actor ID Design for Memory Namespace Isolation

The memory system uses a consistent actor_id strategy to ensure proper namespace isolation:

User preferences: Use user_id as actor_id (e.g., "Alice") for personal namespaces (/sre/users/Alice/preferences)
Infrastructure knowledge: Use agent-specific actor_ids (e.g., "kubernetes-agent") for domain expertise namespaces
Investigation summaries: Use user_id as actor_id for personal investigation history (/sre/investigations/Alice)
Conversation memory: Use user_id to maintain personal conversation context

This design ensures that:

User-specific data remains isolated to individual users
Infrastructure knowledge is organized by the agent that discovered it
Memory operations route to the correct namespaces automatically
Cross-session memory retrieval works reliably

Event-based Model Benefits

Immutable Events: All memory entries are stored as immutable events that cannot be modified
Accumulative Learning: New events accumulate over time without deleting old ones
Strategy Aggregation: Memory strategies aggregate events from their namespace to provide relevant context
Automatic Organization: Events are automatically organized by user, session, and memory type

Example Event Flow

# SRE Agent calls create_event with just actor_id and content
memory_client.create_event(
    memory_id="sre_agent_memory-xyz",
    actor_id="Alice",  # Amazon Bedrock AgentCore Memory uses this to route to correct namespace
    session_id="investigation_2025_01_15",
    messages=[("preference_data", "ASSISTANT")]
)

# Amazon Bedrock AgentCore Memory automatically:
# 1. Checks all strategy namespaces for this memory
# 2. Matches actor_id "Alice" to namespace "/sre/users/Alice/preferences" 
# 3. Stores event in User Preferences Strategy
# 4. Makes event available for future retrievals

Memory Strategies

These are the three long-term memory strategies supported by Amazon Bedrock AgentCore (see Memory Getting Started Guide):

1. User Preferences Memory

Strategy: Semantic Memory with 90-day retention
Purpose: Remember user-specific operational preferences

Captures:

Escalation contacts and procedures
Notification channels (Slack, email, etc.)
Investigation workflow preferences
Communication style preferences

Example Usage:

# When user mentions "escalate to ops-team@company.com for database issues"
# The system automatically captures:
{
  "user_id": "user123",
  "preference_type": "escalation",
  "preference_value": {
    "contact": "ops-team@company.com",
    "service_category": "database"
  },
  "context": "Investigation of Redis connection failures"
}

2. Infrastructure Knowledge Memory

Strategy: Semantic Memory with 30-day retention
Purpose: Build understanding of infrastructure patterns and relationships

Captures:

Service dependencies and relationships
Failure patterns and common issues
Configuration insights and best practices
Performance baselines and thresholds

Example Usage:

# When investigating a service outage, the system learns:
{
  "service_name": "web-api",
  "knowledge_type": "dependency",
  "knowledge_data": {
    "depends_on": "postgres-db",
    "failure_mode": "connection_timeout",
    "typical_recovery_time": "2-5 minutes"
  },
  "confidence": 0.8
}

3. Investigation Summaries Memory

Strategy: Summary Memory with 60-day retention
Purpose: Maintain history of investigations for learning and reference

Captures:

Investigation timeline and actions taken
Key findings and root causes
Resolution strategies and outcomes
Cross-team collaboration context

Example Usage:

{
  "incident_id": "incident_20250128_1045",
  "query": "Why is the checkout service responding slowly?",
  "timeline": [
    {"time": "10:45", "action": "Started investigation with metrics agent"},
    {"time": "10:47", "action": "Identified high CPU usage"},
    {"time": "10:50", "action": "Checked application logs for errors"}
  ],
  "actions_taken": [
    "Analyzed CPU and memory metrics",
    "Reviewed application error logs",  
    "Identified memory leak in payment processing"
  ],
  "resolution_status": "completed",
  "key_findings": [
    "Memory leak in payment service consuming 2GB/hour",
    "Database connection pool exhausted during peak traffic",
    "Missing circuit breaker causing cascade failures"
  ]
}

Memory Flow During Investigation

┌─────────────┐    ┌─────────────────────┐              ┌──────────────────────┐
│    User     │    │     Supervisor      │              │  Amazon Bedrock      │
│             │    │      Agent          │              │  AgentCore Memory    │
└──────┬──────┘    └──────────┬──────────┘              └──────────┬───────────┘
       │                      │                                    │
       │ Investigation Query  │                                    │
       ├─────────────────────►│                                    │
       │                      │                                    │
       │              ┌───────▼───────┐                            │
       │              │ on_investigation_start()                   │
       │              │ (memory_hooks) │                           │
       │              └───────┬───────┘                            │
       │                      │                                    │
       │                      │ retrieve_memory(preferences)       │
       │                      ├───────────────────────────────────►│
       │                      │◄───────────────────────────────────┤
       │                      │ User preferences (10)              │
       │                      │                                    │
       │                      │ retrieve_memory(infrastructure)    │
       │                      ├───────────────────────────────────►│
       │                      │◄───────────────────────────────────┤
       │                      │ Infrastructure data (50)           │
       │                      │                                    │
       │                      │ retrieve_memory(investigations)    │
       │                      ├───────────────────────────────────►│
       │                      │◄───────────────────────────────────┤
       │                      │ Past investigations (5)            │
       │                      │                                    │
       │              ┌───────▼───────┐                            │
       │              │ Planning Agent with Memory Tools           │
       │              │ (supervisor.py)                            │
       │              └───────┬───────┘                            │
       │                      │                                    │
       │              ┌───────▼───────┐                            │
       │              │ Execute Investigation                      │
       │              └───────┬───────┘                            │
       │                      │                                    │
       │                      ├─► Metrics Agent                    │
       │                      ├─► Logs Agent                       │
       │                      ├─► K8s Agent                        │
       │                      ├─► Runbooks Agent                   │
       │                      │                                    │
       │              ┌───────▼───────┐                            │
       │              │ Agent Response Processing                  │
       │              │ (pattern extraction & storage)             │
       │              └───────┬───────┘                            │
       │                      │                                    │
       │              ┌───────▼───────┐                            │
       │              │ on_investigation_complete()                │
       │              │ (save investigation summary)               │
       │              └───────┬───────┘                            │
       │                      │                                    │
       │ Final Response       │                                    │
       │◄─────────────────────┤                                    │
       │                      │                                    │

Key Memory Interactions

The memory system integrates at three key points during an investigation. The supervisor.py orchestrates memory retrieval at startup and saves investigation summaries at completion. Individual agent responses are processed by agent_nodes.py which triggers pattern extraction through memory/hooks.py.

Investigation Start: Retrieves user preferences, infrastructure knowledge, and past investigations to provide context
Agent Responses: Automatically extracts patterns like escalation contacts, notification channels, and service dependencies
Investigation Complete: Saves comprehensive summary with timeline, actions taken, and key findings

Memory Tool Architecture and Planning Integration

The memory system uses a centralized architecture where only the supervisor agent has direct access to memory tools:

Tool Distribution Architecture

Supervisor Agent: Has access to all 4 memory tools (retrieve_memory, save_preference, save_infrastructure, save_investigation)
Individual Agents: Have NO direct access to memory tools, only domain-specific tools:
- Kubernetes Agent: 5 k8s-api tools (get_pod_status, get_deployment_status, etc.)
- Application Logs Agent: 5 logs-api tools (search_logs, get_error_logs, etc.)
- Performance Metrics Agent: 5 metrics-api tools (get_performance_metrics, analyze_trends, etc.)
- Operational Runbooks Agent: 5 runbooks-api tools (search_runbooks, get_incident_playbook, etc.)

Centralized Memory Management

This design ensures:

Memory operations are coordinated through the supervisor
Individual agents focus on their domain expertise without memory complexity
Memory context is retrieved once and distributed to agents as needed
Consistent memory patterns across all investigations

Available Memory Tools (Supervisor Only)

save_preference: Saves user preferences to long-term memory
save_infrastructure: Saves infrastructure knowledge to long-term memory
save_investigation: Saves investigation summaries to long-term memory
retrieve_memory: Retrieves relevant information from long-term memory

Memory Context in Planning

When creating investigation plans, the supervisor agent incorporates memory context from three sources. The planning agent uses the retrieve_memory tool to gather relevant context before creating plans.

Planning Agent Memory Usage Example

Here's a real example from agent.log showing how the planning agent retrieves and uses memory context:

# Memory context retrieval during planning (from agent.log)
2025-08-03 17:48:56,072,p1290668,{supervisor.py:339},INFO,Retrieved memory context for planning: 10 preferences, 50 knowledge items from 1 agents, 5 past investigations

# Planning agent tool calls to gather context
2025-08-03 17:49:01,067,p1290668,{tools.py:317},INFO,retrieve_memory called: type=preference, query='user settings communication escalation notification', actor_id=Alice -> Alice, max_results=5
2025-08-03 17:49:01,067,p1290668,{client.py:236},INFO,Retrieving preferences memories: actor_id=Alice, namespace=/sre/users/Alice/preferences, query='user settings communication escalation notification'

This shows the planning agent:

Retrieved 10 user preferences from Alice's preference namespace
Retrieved 50 infrastructure knowledge items from accumulated agent investigations
Retrieved 5 past investigations for similar query patterns
Used retrieve_memory tool with structured queries to gather context before planning

Enhanced Planning Prompt with Memory Context

The planning prompt now uses XML structure for better Claude interaction:

<memory_retrieval>
CRITICAL: Before creating the investigation plan, you MUST use the retrieve_memory tool to gather relevant context:
1. Use retrieve_memory("preference", "user settings communication escalation notification", "{user_id}", 5)
2. Use retrieve_memory("infrastructure", "[relevant service terms from query]", "sre-agent", 10, null)  
3. Use retrieve_memory("investigation", "[key terms from user query]", "{user_id}", 5, null)
</memory_retrieval>

<planning_guidelines>
After gathering memory context, create a simple, focused investigation plan with 2-3 steps maximum.
Consider user preferences and past investigation patterns from memory.
</planning_guidelines>

<response_format>
MANDATORY: Your response MUST be ONLY valid JSON that matches this exact structure:
{
  "steps": ["Step 1 description", "Step 2 description"],
  "agents_sequence": ["kubernetes_agent", "logs_agent"],
  "complexity": "simple",
  "auto_execute": true,
  "reasoning": "Brief explanation based on retrieved memory context"
}
</response_format>

Memory-Informed Planning Example

# Enhanced planning prompt includes:
"""
User's query: list kubernetes pods

Retrieved Memory Context:
- User Preferences (10 items): Auto-approval for simple Kubernetes plans, technical detail preference
- Infrastructure Knowledge (50 items): Production namespace layout, pod dependency patterns  
- Past Investigations (5 items): Previous successful pod listing investigations

Create an investigation plan considering this context...
"""

The planning agent then creates plans like:

{
  "steps": ["Use Kubernetes agent to retrieve current pod status across all namespaces", "Analyze pod health and resource utilization", "Provide structured technical report with pod details"],
  "agents_sequence": ["kubernetes_agent"],
  "complexity": "simple", 
  "auto_execute": true,
  "reasoning": "Based on user preferences for auto-approval of simple Kubernetes plans and past successful investigations, this is a straightforward pod listing task requiring only the Kubernetes agent"
}

Memory Capture and Pattern Recognition

The SRE Agent automatically captures information during investigations through a sophisticated pattern recognition and structured data conversion process:

How Memory Capture Works

The SRE agent code (specifically sre_agent/memory/hooks.py) uses regex patterns to parse agent responses and extract structured information:

Response Analysis: After each agent response, the system scans the text for specific patterns
Pattern Matching: Uses regex to identify key information types
Data Structuring: Converts matched patterns into structured Pydantic models
Memory Storage: Calls Amazon Bedrock AgentCore Memory's create_event() API to store the structured data

SRE Agent Pattern Recognition

Every individual agent response triggers automatic memory pattern extraction through the on_agent_response() hook. This ensures that valuable information discovered during domain-specific investigations is captured and made available for future use.

Infrastructure Knowledge Extraction via Agent JSON Responses

The system uses a sophisticated agent-based approach for infrastructure knowledge extraction. Each agent is instructed to include infrastructure knowledge in their responses using structured JSON format:

Agent Response Format

{
  "infrastructure_knowledge": [
    {
      "service_name": "web-app-deployment",
      "knowledge_type": "baseline",
      "knowledge_data": {
        "cpu_usage_normal": "75%",
        "memory_usage_normal": "85%",
        "typical_pods": 1,
        "node_distribution": "node-1"
      },
      "confidence": 0.9,
      "context": "Pod status analysis revealed normal resource usage patterns"
    }
  ]
}

Knowledge Types Captured

dependency: Service relationships and dependencies
pattern: Recurring infrastructure patterns and behaviors
config: Configuration insights and settings
baseline: Performance baselines and normal operating ranges

Automatic Extraction Process

Agent Response Processing: Each agent response is scanned for JSON blocks containing infrastructure_knowledge
JSON Parsing: The system extracts and validates the JSON structure
Knowledge Storage: Valid knowledge items are stored in the infrastructure memory namespace
Cross-Session Availability: Knowledge becomes available for future investigations across all sessions

Enhanced Agent Response Processing and Logging

Comprehensive Response Logging

The system provides detailed logging of agent responses and memory operations:

# From agent.log - Message breakdown logging
2025-08-03 17:45:30,397,p1289365,{agent_nodes.py:347},INFO,Kubernetes Infrastructure Agent - Message breakdown: 1 USER, 1 ASSISTANT, 1 TOOL messages
2025-08-03 17:45:30,397,p1289365,{agent_nodes.py:349},INFO,Kubernetes Infrastructure Agent - Tools called: k8s-api___get_pod_status

# Memory pattern extraction logging  
2025-08-03 17:45:30,398,p1289365,{hooks.py:193},INFO,on_agent_response called for agent: Kubernetes Infrastructure Agent, user_id: Alice
2025-08-03 17:45:30,399,p1289365,{hooks.py:383},INFO,Extracted 5 infrastructure knowledge items from agent response

Infrastructure Knowledge Validation

The system includes validation and error handling for infrastructure knowledge extraction:

# Successful extraction logging
2025-08-03 17:45:30,401,p1289365,{hooks.py:387},INFO,Saved infrastructure knowledge: web-app-deployment (baseline) with confidence 0.9
2025-08-03 17:45:30,402,p1289365,{hooks.py:387},INFO,Saved infrastructure knowledge: database-pod (pattern) with confidence 0.8

Automatic Conversation Memory Storage

All agent interactions are automatically stored in conversation memory with message type breakdown:

# Conversation storage with tool tracking
2025-08-03 17:45:30,397,p1289365,{agent_nodes.py:347},INFO,Kubernetes Infrastructure Agent - Message breakdown: 1 USER, 1 ASSISTANT, 1 TOOL messages
2025-08-03 17:45:30,397,p1289365,{agent_nodes.py:349},INFO,Kubernetes Infrastructure Agent - Tools called: k8s-api___get_pod_status
2025-08-03 17:45:30,530,p1289365,{agent_nodes.py:375},INFO,Kubernetes Infrastructure Agent: Successfully stored conversation in memory

Cross-Session Memory Access

The system provides cross-session memory retrieval for better investigation context:

# Cross-session infrastructure knowledge retrieval
2025-08-03 17:45:30,140,p1289365,{hooks.py:71},INFO,Retrieved infrastructure knowledge for user 'Alice' from 1 different sources: Alice: 50 memories
2025-08-03 17:45:30,140,p1289365,{client.py:245},INFO,Retrieved 50 infrastructure memories for Alice

Memory Capture Methods

Supervisor tool calls: retrieve_memory called during planning; save_investigation called via planning agent
Automatic pattern extraction: Agent responses are processed by on_agent_response() hook to extract:
- User preferences (escalation emails, Slack channels)
- Infrastructure knowledge (service dependencies, baselines)
- Calls _save_* functions directly (not tool calls)
Manual configuration: User preferences loaded via manage_memories.py update
Conversation storage: All agent responses and tool calls stored as conversation memory

Memory Storage Process

Pattern Detection: SRE agent code identifies relevant information in responses
Data Conversion: Creates structured objects (UserPreference, InfrastructureKnowledge, etc.)
Event Creation: Calls create_event() with actor_id and structured data
Namespace Routing: Amazon Bedrock AgentCore Memory automatically routes to correct namespace based on strategy configuration

Agent Memory Integration

The memory system integrates seamlessly with existing SRE agents:

Kubernetes Agent

Captures: Service dependencies, deployment patterns, resource baselines
Uses: Past deployment issues, known resource requirements

Example Knowledge Captured:

{
  "service_name": "web-app-deployment",
  "knowledge_type": "baseline",
  "knowledge_data": {
    "cpu_usage_normal": "75%",
    "memory_usage_normal": "85%",
    "typical_pods": 1
  }
}

Logs Agent

Captures: Common error patterns, log query preferences, resolution strategies
Uses: Similar error patterns, effective log queries from past investigations

Example Knowledge Captured:

{
  "service_name": "payment-service",
  "knowledge_type": "pattern",
  "knowledge_data": {
    "common_errors": ["connection timeout", "memory leak"],
    "effective_queries": ["error AND payment AND timeout"]
  }
}

Metrics Agent

Captures: Performance baselines, alert thresholds, metric correlations
Uses: Historical baselines, known performance patterns

Example Knowledge Captured:

{
  "service_name": "api-gateway",
  "knowledge_type": "baseline",
  "knowledge_data": {
    "normal_response_time": "200ms",
    "peak_traffic_hours": "14:00-17:00 UTC"
  }
}

Runbooks Agent

Captures: Successful resolution procedures, team escalation paths
Uses: Proven resolution strategies, appropriate runbook recommendations

Example Knowledge Captured:

{
  "service_name": "database",
  "knowledge_type": "dependency",
  "knowledge_data": {
    "escalation_team": "database-team@company.com",
    "recovery_runbook": "DB-001"
  }
}

Manual Memory Management

Memory management is handled through the manage_memories.py script:

Viewing Memories

# List all memory types
uv run python scripts/manage_memories.py list

# List specific memory type
uv run python scripts/manage_memories.py list --memory-type preferences

# List memories for specific user
uv run python scripts/manage_memories.py list --memory-type preferences --actor-id Alice

Managing User Preferences

# Load user preferences from YAML configuration
uv run python scripts/manage_memories.py update

# Load from custom configuration file
uv run python scripts/manage_memories.py update --config-file custom_users.yaml

Benefits

Personalized Investigations: Tailors reports and communication to individual user preferences and roles
Faster Resolution: Leverages historical context and past investigation knowledge
Knowledge Preservation: Automatically captures and shares tribal knowledge across team changes
Pattern Recognition: Identifies recurring issues and optimizes escalation routing
Reduced MTTR: Accelerates problem resolution through accumulated institutional knowledge

Privacy and Data Management

Data Retention

User preferences: 90 days (configurable)
Infrastructure knowledge: 30 days (configurable)
Investigation summaries: 60 days (configurable)

Setting Up Memory System

Initial Setup

The memory system is automatically initialized during the setup process:

# Initialize memory system and load user preferences (included in setup instructions)
uv run python scripts/manage_memories.py update

This command:

Creates a new memory resource if none exists
Configures the three memory strategies
Loads user preferences from scripts/user_config.yaml
Stores the memory ID in .memory_id for future use

Adding User Preferences

To add new users or modify existing preferences:

Edit scripts/user_config.yaml to add new user configurations
Run the update command to load new preferences:

uv run python scripts/manage_memories.py update

Managing Memories

# List all memory types
uv run python scripts/manage_memories.py list

# List specific memory type
uv run python scripts/manage_memories.py list --memory-type preferences

# List preferences for specific user
uv run python scripts/manage_memories.py list --memory-type preferences --actor-id Alice

29 KiB Raw Permalink Blame History

SRE Agent Memory System

Overview

Pre-configured User Personas

Alice - Technical SRE Engineer

Carol - Executive/Director

Personalized Investigation Examples

Amazon Bedrock AgentCore Memory Architecture

Memory Strategies and Namespaces

How Namespace Routing Works

Actor ID Design for Memory Namespace Isolation

Event-based Model Benefits

Example Event Flow

Memory Strategies

1. User Preferences Memory

2. Infrastructure Knowledge Memory

3. Investigation Summaries Memory

Memory Flow During Investigation

Key Memory Interactions

Memory Tool Architecture and Planning Integration

Tool Distribution Architecture

Centralized Memory Management

Available Memory Tools (Supervisor Only)

Memory Context in Planning

Planning Agent Memory Usage Example

Enhanced Planning Prompt with Memory Context

Memory-Informed Planning Example

Memory Capture and Pattern Recognition

How Memory Capture Works

SRE Agent Pattern Recognition

Infrastructure Knowledge Extraction via Agent JSON Responses

Agent Response Format

Knowledge Types Captured

Automatic Extraction Process

Enhanced Agent Response Processing and Logging

Comprehensive Response Logging

Infrastructure Knowledge Validation

Automatic Conversation Memory Storage

Cross-Session Memory Access

Memory Capture Methods

Memory Storage Process

Agent Memory Integration

Kubernetes Agent

Logs Agent

Metrics Agent

Runbooks Agent

Manual Memory Management

Viewing Memories

Managing User Preferences

Benefits

Privacy and Data Management

Data Retention

Setting Up Memory System

Initial Setup

Adding User Preferences

Managing Memories

29 KiB

Raw Permalink Blame History