#721 Workers become zombies after asyncio crash - container healthy but all threads dead

closed critical Created 2025-12-27 03:27 · Updated 2025-12-27 03:51

Description

Edit
OBSERVED: Workers can crash with asyncio.exceptions.CancelledError during HTTP operations, causing all internal threads (TimeoutService, task processing loop, etc.) to die while the container remains 'running' and 'healthy'. Symptoms: - Container status: running/healthy - Zero log activity - Stuck workflows not cleaned up - TimeoutService not running Root cause: asyncio task group error during HTTP connection: asyncio.exceptions.CancelledError: Cancelled via cancel scope ... Impact: Workflows get stuck in 'running' state indefinitely because TimeoutService (which cleans up expired claims) is dead. Workaround: Restart workers with 'docker compose restart worker' Suggested fix: Add a watchdog that monitors thread activity and restarts worker if all threads are dead. Or improve asyncio error handling to prevent cascade failures.

Comments

Loading comments...

Context

Loading context...

Audit History

View All
Loading audit history...