>_
.issue.db
/highway-workflow-engine
Dashboard
Issues
Memory
Lessons
Audit Log
New Issue
Edit Issue #636
Update issue details
Title *
Description
## Root Cause Analysis When a sync tool (like LLM) blocks forever, it bypasses all protection layers: 1. **LLM Tool**: Used `with ThreadPoolExecutor()` context manager - blocks on `shutdown(wait=True)` 2. **Bulkhead**: `timeout_seconds=600` only works for async operations, can't cancel sync blocking 3. **Orchestrator**: `as_completed(futures)` had NO TIMEOUT, blocks forever on stuck futures ## Evidence - Worker-2 stuck for 40+ minutes while process alive (NOTIFY received, no claiming) - Worker-1 recovered only because Docker restarted it (RestartCount=1) - Health check shows 'healthy' (only checks if process exists) ## Fixes Applied 1. **orchestrator.py**: Added `timeout=630.0` to `as_completed()`, handles stuck tasks gracefully 2. **llm.py**: Changed to `executor.shutdown(wait=False, cancel_futures=True)` ## Remaining Work - [ ] Improve health check to detect zombie workers (not just process existence) - [ ] Add worker watchdog mechanism - [ ] Check other tools for similar patterns - [ ] Consider process isolation for untrusted code
Priority
Low
Medium
High
Critical
Status
Open
In Progress
Closed
Won't Do
Due Date (YYYY-MM-DD)
Tags (comma separated)
Related Issues (IDs)
Enter IDs of issues related to this one. They will be linked as 'related'.
Update Issue
Cancel