#725 CRITICAL: Zombie transactions block task recovery when worker pods crash
Description
Edit## Summary
When a worker pod crashes during task execution, the PostgreSQL connection may stay open as a zombie transaction, holding exclusive locks that block ALL recovery attempts from other workers.
## Root Cause Analysis
1. Worker pod highway-worker-d7db8c577-twqhs claimed task at 16:13:01
2. Worker crashed/died during execution (before committing)
3. PostgreSQL connection stayed open (TCP half-open - server didn't detect client death)
4. idle_in_transaction_session_timeout = 0 (disabled) means zombie transactions live forever
5. Zombie transaction holds exclusive lock on jumper.r_highway_default row
6. 10+ recovery attempts from both clusters (minikube + srv2) are BLOCKED waiting for this lock
## Evidence
- pg_stat_activity shows PID 69623 idle in transaction for 31+ minutes
- pg_locks shows multiple UPDATE queries with granted=false (waiting for lock)
- Multiple workers logged Failing stuck run but no success/error follow-up
- Task run 019b65bc-0b7f-7246-9afd-73d889451639 still in running state
## Impact
- CRITICAL: A single crashed worker can block task recovery INDEFINITELY
- All workers trying to recover the task are blocked
- Multi-cluster deployment makes this more likely
## Fix Required
1. IMMEDIATE: Set idle_in_transaction_session_timeout to 300000 (5 minutes) in PostgreSQL
2. Set statement_timeout and lock_timeout to reasonable values
3. Configure TCP keepalive on database connections
4. Add connection pool idle timeout in engine/db.py
Comments
Loading comments...
Context
Loading context...
Audit History
View AllLoading audit history...