#636 CRITICAL: Bulkhead timeout bypass - sync blocking causes worker freeze

closed high Created 2025-12-20 17:22 · Updated 2025-12-25 03:45

Description

Edit
## Root Cause Analysis When a sync tool (like LLM) blocks forever, it bypasses all protection layers: 1. **LLM Tool**: Used `with ThreadPoolExecutor()` context manager - blocks on `shutdown(wait=True)` 2. **Bulkhead**: `timeout_seconds=600` only works for async operations, can't cancel sync blocking 3. **Orchestrator**: `as_completed(futures)` had NO TIMEOUT, blocks forever on stuck futures ## Evidence - Worker-2 stuck for 40+ minutes while process alive (NOTIFY received, no claiming) - Worker-1 recovered only because Docker restarted it (RestartCount=1) - Health check shows 'healthy' (only checks if process exists) ## Fixes Applied 1. **orchestrator.py**: Added `timeout=630.0` to `as_completed()`, handles stuck tasks gracefully 2. **llm.py**: Changed to `executor.shutdown(wait=False, cancel_futures=True)` ## Remaining Work - [ ] Improve health check to detect zombie workers (not just process existence) - [ ] Add worker watchdog mechanism - [ ] Check other tools for similar patterns - [ ] Consider process isolation for untrusted code

Comments

Loading comments...

Context

Loading context...

Audit History

View All
Loading audit history...