#126 Increase max_attempts for workflow tasks - too low for disruption resilience
Description
EditIncrease max_attempts from 3 to 64 for workflow tasks to handle disruption scenarios.
RISK ANALYSIS:
- With 3 attempts and 30s claim_timeout, a task has ~90s total before permanent failure
- Under disruption, 3 workers could crash sequentially, exhausting retries in <2 minutes
- With 64 attempts, tasks can survive extended outages (up to ~32 minutes with 30s timeout)
CONFIGURATION:
- max_attempts: 64 (from 3)
- claim_timeout: 15s (reduced from 30s for faster failover)
- Total survival window: 64 * 15s = 16 minutes of continuous disruption
TRADEOFF:
- Higher retry count = longer before permanent failure detection
- But essential for enterprise resilience during rolling deployments, network partitions, etc.
Files to update:
- Default max_attempts in workflow submission
- Worker claim_timeout in systemd service
Comments
Loading comments...
Context
Loading context...
Audit History
View AllLoading audit history...