#636 CRITICAL: Bulkhead timeout bypass - sync blocking causes worker freeze
Description
Edit## Root Cause Analysis
When a sync tool (like LLM) blocks forever, it bypasses all protection layers:
1. **LLM Tool**: Used `with ThreadPoolExecutor()` context manager - blocks on `shutdown(wait=True)`
2. **Bulkhead**: `timeout_seconds=600` only works for async operations, can't cancel sync blocking
3. **Orchestrator**: `as_completed(futures)` had NO TIMEOUT, blocks forever on stuck futures
## Evidence
- Worker-2 stuck for 40+ minutes while process alive (NOTIFY received, no claiming)
- Worker-1 recovered only because Docker restarted it (RestartCount=1)
- Health check shows 'healthy' (only checks if process exists)
## Fixes Applied
1. **orchestrator.py**: Added `timeout=630.0` to `as_completed()`, handles stuck tasks gracefully
2. **llm.py**: Changed to `executor.shutdown(wait=False, cancel_futures=True)`
## Remaining Work
- [ ] Improve health check to detect zombie workers (not just process existence)
- [ ] Add worker watchdog mechanism
- [ ] Check other tools for similar patterns
- [ ] Consider process isolation for untrusted code
Comments
Loading comments...
Context
Loading context...
Audit History
View AllLoading audit history...