#725 CRITICAL: Zombie transactions block task recovery when worker pods crash

closed critical Created 2025-12-28 16:45 · Updated 2025-12-28 17:08

Description

Edit
## Summary When a worker pod crashes during task execution, the PostgreSQL connection may stay open as a zombie transaction, holding exclusive locks that block ALL recovery attempts from other workers. ## Root Cause Analysis 1. Worker pod highway-worker-d7db8c577-twqhs claimed task at 16:13:01 2. Worker crashed/died during execution (before committing) 3. PostgreSQL connection stayed open (TCP half-open - server didn't detect client death) 4. idle_in_transaction_session_timeout = 0 (disabled) means zombie transactions live forever 5. Zombie transaction holds exclusive lock on jumper.r_highway_default row 6. 10+ recovery attempts from both clusters (minikube + srv2) are BLOCKED waiting for this lock ## Evidence - pg_stat_activity shows PID 69623 idle in transaction for 31+ minutes - pg_locks shows multiple UPDATE queries with granted=false (waiting for lock) - Multiple workers logged Failing stuck run but no success/error follow-up - Task run 019b65bc-0b7f-7246-9afd-73d889451639 still in running state ## Impact - CRITICAL: A single crashed worker can block task recovery INDEFINITELY - All workers trying to recover the task are blocked - Multi-cluster deployment makes this more likely ## Fix Required 1. IMMEDIATE: Set idle_in_transaction_session_timeout to 300000 (5 minutes) in PostgreSQL 2. Set statement_timeout and lock_timeout to reasonable values 3. Configure TCP keepalive on database connections 4. Add connection pool idle timeout in engine/db.py

Comments

Loading comments...

Context

Loading context...

Audit History

View All
Loading audit history...