#453 LLM tool asyncio.run() causes worker zombie state via anyio corruption

closed critical Created 2025-12-16 11:24 · Updated 2025-12-16 11:33

Description

Edit
In llm.py lines 573-577, when no event loop is running: ```python try: asyncio.get_running_loop() except RuntimeError: return asyncio.run(coro) # No timeout! ``` When anyio (used by httpx) has corrupted internal state (e.g., after sandbox fallback), asyncio.run() can hang indefinitely because: 1. anyio callbacks fail with 'no running event loop' 2. HTTP request never completes 3. asyncio.run() waits forever **Evidence from logs:** - 10:01:07 asyncio error: RuntimeError: no running event loop -> InvalidStateError - Workers stopped claiming tasks entirely after this - LISTEN thread still ran (daemon), making workers appear 'alive' - 590 pending + 7000+ completed tasks, zero being processed **Root cause:** Mixing sync sandbox fallback -> direct execution -> LLM async call corrupts anyio state **Fix options:** 1. Add timeout to asyncio.run() path (wrap in thread with timeout) 2. Ensure httpx client lifecycle is properly managed 3. Add watchdog to detect zombie main loop

Comments

Loading comments...

Context

Loading context...

Audit History

View All
Loading audit history...