mirror of
https://github.com/awslabs/amazon-bedrock-agentcore-samples.git
synced 2025-09-08 20:50:46 +00:00
113 lines
6.1 KiB
Markdown
113 lines
6.1 KiB
Markdown
|
# API Response Time Failures Investigation Analysis
|
||
|
|
||
|
*Created: 2025-08-02*
|
||
|
|
||
|
## Question Asked
|
||
|
"have i investigated api response time failures before..how many times if so"
|
||
|
|
||
|
## Analysis Summary
|
||
|
|
||
|
The question about API response time failures was answered using a **hybrid approach** - both memory retrieval AND live infrastructure tool calls. This differs from the flight booking question which was purely memory-based.
|
||
|
|
||
|
## Investigation Flow
|
||
|
|
||
|
### 1. Initial Memory Retrieval (Supervisor)
|
||
|
The supervisor first searched memory to understand user context:
|
||
|
|
||
|
**Memory Context Retrieved:**
|
||
|
- **User Preferences**: 10 preference records for Alice
|
||
|
- **Infrastructure Knowledge**: 50 knowledge items from 1 agent
|
||
|
- **Past Investigations**: 5 investigation summaries for Alice
|
||
|
|
||
|
**Key Memory Finding**: The system found that Alice had previously investigated API response time failures, specifically:
|
||
|
> "Based on memory retrieval, the user (Alice) has investigated API response time failures at least once before, with a significant incident on January 15, 2024 affecting the web-service's `/api/users` endpoint."
|
||
|
|
||
|
### 2. Memory Tool Calls (3 calls)
|
||
|
The supervisor made specific memory retrieval calls:
|
||
|
|
||
|
```
|
||
|
retrieve_memory called: type=preference, query='user settings communication escalation notification', actor_id=Alice
|
||
|
retrieve_memory called: type=infrastructure, query='api response time failures degradation', actor_id=sre-agent
|
||
|
retrieve_memory called: type=investigation, query='api response time failures', actor_id=Alice
|
||
|
```
|
||
|
|
||
|
### 3. Live Infrastructure Tool Calls (3 calls)
|
||
|
**Unlike the flight booking question**, the system ALSO called live infrastructure tools:
|
||
|
|
||
|
**Performance Metrics Agent:**
|
||
|
- `metrics-api___get_performance_metrics` (response_time data)
|
||
|
- `metrics-api___get_error_rates` (24h error rate data)
|
||
|
- `metrics-api___analyze_trends` (response time trend analysis)
|
||
|
|
||
|
**Application Logs Agent:**
|
||
|
- Called to search for historical log patterns
|
||
|
|
||
|
## Key Findings - Hybrid Memory + Live Tools
|
||
|
|
||
|
### 1. **Memory-Based Historical Context**
|
||
|
- ✓ Found past investigation on January 15, 2024 affecting web-service `/api/users` endpoint
|
||
|
- ✓ Retrieved Alice's user preferences and communication settings
|
||
|
- ✓ Accessed 50 infrastructure knowledge items for context
|
||
|
|
||
|
### 2. **Live Tool Verification**
|
||
|
- ✓ Called `get_performance_metrics` to get current response time data
|
||
|
- ✓ Called `get_error_rates` to check current error rates
|
||
|
- ✓ Called `analyze_trends` to identify response time patterns
|
||
|
- ✓ Searched application logs for additional historical evidence
|
||
|
|
||
|
### 3. **Intelligent Tool Selection**
|
||
|
The system distinguished between:
|
||
|
- **Historical count question** → Use memory retrieval first
|
||
|
- **Performance investigation** → Also verify with live tools
|
||
|
|
||
|
## Why Both Memory AND Live Tools?
|
||
|
|
||
|
This question was more complex than the flight booking question:
|
||
|
|
||
|
1. **"How many times"** → Requires memory search for historical count
|
||
|
2. **"API response time failures"** → Performance question that benefits from current metrics verification
|
||
|
3. **Investigation context** → Current tools help validate if past patterns still exist
|
||
|
|
||
|
## Tool Usage Analysis
|
||
|
|
||
|
| Tool Type | Called? | User ID | Agent | Purpose | Result |
|
||
|
|-----------|---------|---------|-------|---------|---------|
|
||
|
| `retrieve_memory` (preferences) | ✓ Yes | Alice | Supervisor | Get Alice's preferences | Found 10 preference records |
|
||
|
| `retrieve_memory` (infrastructure) | ✓ Yes | Alice | Supervisor | Get past infrastructure knowledge | Found 50 knowledge items |
|
||
|
| `retrieve_memory` (investigation) | ✓ Yes | Alice | Supervisor | Get past investigations | Found 5 investigation summaries |
|
||
|
| `metrics-api___get_performance_metrics` | ✓ Yes | Alice | Performance Metrics Agent | Current response time data | Retrieved current metrics |
|
||
|
| `metrics-api___get_error_rates` | ✓ Yes | Alice | Performance Metrics Agent | Current error rates | Retrieved error rate data |
|
||
|
| `metrics-api___analyze_trends` | ✓ Yes | Alice | Performance Metrics Agent | Response time trends | Analyzed trend patterns |
|
||
|
| `logs-api` tools | ✓ Yes | Alice | Application Logs Agent | Historical log analysis | Searched for patterns |
|
||
|
| `k8s-api` tools | ✗ No | - | - | Cluster status | Not needed for this query |
|
||
|
| `runbooks-api` tools | ✗ No | - | - | Operational procedures | Not needed for this query |
|
||
|
|
||
|
## System Intelligence Demonstrated
|
||
|
|
||
|
1. **Hybrid Approach**: Correctly combined memory retrieval with live verification
|
||
|
2. **Question Complexity Recognition**: Identified this as both historical AND performance question
|
||
|
3. **Tool Prioritization**: Started with memory, then used live tools for validation
|
||
|
4. **Comprehensive Search**: Used multiple memory types and infrastructure tools
|
||
|
5. **Context Integration**: Combined historical context with current system state
|
||
|
|
||
|
## Comparison with Flight Booking Question
|
||
|
|
||
|
| Aspect | Flight Booking | API Response Time |
|
||
|
|--------|----------------|------------------|
|
||
|
| **Question Type** | Pure historical ("have i investigated") | Historical + Performance ("failures before") |
|
||
|
| **User ID** | Alice | Alice |
|
||
|
| **Memory Tools** | ✓ Used (3 calls by Supervisor) | ✓ Used (3 calls by Supervisor) |
|
||
|
| **Infrastructure Tools** | ✗ Not used | ✓ Used (3+ calls by Metrics & Logs Agents) |
|
||
|
| **Reasoning** | Historical question only | Performance investigation requires validation |
|
||
|
| **Result** | Memory-only answer | Memory + live verification |
|
||
|
|
||
|
## Conclusion
|
||
|
|
||
|
The API response time question demonstrates the SRE agent's **sophisticated intelligence** in choosing appropriate tools:
|
||
|
|
||
|
- **Pure historical questions** → Memory-only approach (flight booking)
|
||
|
- **Performance + historical questions** → Hybrid memory + live tools approach (API response times)
|
||
|
|
||
|
The system correctly identified that API response time failures are an ongoing operational concern that benefits from both historical context AND current system verification, unlike the flight booking service which was purely a historical lookup.
|
||
|
|
||
|
This shows the memory system working in **harmony with live monitoring tools** rather than replacing them, providing the best of both historical institutional knowledge and current system state.
|