#126 Increase max_attempts for workflow tasks - too low for disruption resilience

closed high Created 2025-11-27 06:07 · Updated 2025-11-27 07:00

Description

Edit
Increase max_attempts from 3 to 64 for workflow tasks to handle disruption scenarios. RISK ANALYSIS: - With 3 attempts and 30s claim_timeout, a task has ~90s total before permanent failure - Under disruption, 3 workers could crash sequentially, exhausting retries in <2 minutes - With 64 attempts, tasks can survive extended outages (up to ~32 minutes with 30s timeout) CONFIGURATION: - max_attempts: 64 (from 3) - claim_timeout: 15s (reduced from 30s for faster failover) - Total survival window: 64 * 15s = 16 minutes of continuous disruption TRADEOFF: - Higher retry count = longer before permanent failure detection - But essential for enterprise resilience during rolling deployments, network partitions, etc. Files to update: - Default max_attempts in workflow submission - Worker claim_timeout in systemd service

Comments

Loading comments...

Context

Loading context...

Audit History

View All
Loading audit history...