>_
.issue.db
/highway-workflow-engine
Dashboard
Issues
Memory
Lessons
Audit Log
New Issue
Edit Issue #725
Update issue details
Title *
Description
## Summary When a worker pod crashes during task execution, the PostgreSQL connection may stay open as a zombie transaction, holding exclusive locks that block ALL recovery attempts from other workers. ## Root Cause Analysis 1. Worker pod highway-worker-d7db8c577-twqhs claimed task at 16:13:01 2. Worker crashed/died during execution (before committing) 3. PostgreSQL connection stayed open (TCP half-open - server didn't detect client death) 4. idle_in_transaction_session_timeout = 0 (disabled) means zombie transactions live forever 5. Zombie transaction holds exclusive lock on jumper.r_highway_default row 6. 10+ recovery attempts from both clusters (minikube + srv2) are BLOCKED waiting for this lock ## Evidence - pg_stat_activity shows PID 69623 idle in transaction for 31+ minutes - pg_locks shows multiple UPDATE queries with granted=false (waiting for lock) - Multiple workers logged Failing stuck run but no success/error follow-up - Task run 019b65bc-0b7f-7246-9afd-73d889451639 still in running state ## Impact - CRITICAL: A single crashed worker can block task recovery INDEFINITELY - All workers trying to recover the task are blocked - Multi-cluster deployment makes this more likely ## Fix Required 1. IMMEDIATE: Set idle_in_transaction_session_timeout to 300000 (5 minutes) in PostgreSQL 2. Set statement_timeout and lock_timeout to reasonable values 3. Configure TCP keepalive on database connections 4. Add connection pool idle timeout in engine/db.py
Priority
Low
Medium
High
Critical
Status
Open
In Progress
Closed
Won't Do
Due Date (YYYY-MM-DD)
Tags (comma separated)
Related Issues (IDs)
Enter IDs of issues related to this one. They will be linked as 'related'.
Update Issue
Cancel