#752 Zombie transaction blocks stuck task detection causing cascading lock convoy
Description
EditROOT CAUSE ANALYSIS:
A worker crash (SystemExit, OOM, etc.) leaves a transaction in 'idle in transaction' state. This zombie transaction holds locks on workflow_state rows. When fail_stuck_tasks cron job tries to UPDATE r_highway_default to mark runs as failed, it blocks waiting for these locks. All subsequent cron job executions pile up, creating a lock convoy. Result: stuck task detection completely broken - orphaned tasks never get cleaned up.
EVIDENCE FROM PRODUCTION:
- PID 532152: idle in transaction, holding transactionid lock 4974063
- 16+ fail_stuck_tasks queries blocked for 2-18+ minutes
- Workflow run 2adf633f-4824-4f49-a66b-9311035c006a stuck in pending forever
FIXES REQUIRED:
1. Add PostgreSQL timeout settings (idle_in_transaction_session_timeout, lock_timeout)
2. Add SET LOCAL lock_timeout in fail_stuck_tasks function
3. Ensure transaction cleanup on worker crash (handle SystemExit properly)
Comments
Loading comments...
Context
Loading context...
Audit History
View AllLoading audit history...