#719 Parallel branch retry creates orphaned waits due to absurd_run_id in join_event_name

closed critical Created 2025-12-27 02:52 · Updated 2025-12-27 02:52

Description

Edit
FIXED: When a workflow with parallel branches is retried (e.g., due to worker crash), the join_event_name used absurd_run_id which changes on each retry. Branches spawned before crash emit events with OLD absurd_run_id, but after retry wait_for_branch listens for NEW absurd_run_id, causing timeout/deadlock. Root cause: operators.py line 644 used ctx.absurd_run_id (changes on retry) instead of ctx.workflow_run_id (stable). Fix: Changed parallel_id generation to use workflow_run_id: - engine/interpreters/operators.py:644 - engine/replay/replay_context.py:207 The idempotency_key at line 685 already used workflow_run_id correctly, so branches won't be duplicated on retry. The fix ensures join_event_name is stable across retries. All 447 tests pass.

Comments

Loading comments...

Context

Loading context...

Audit History

View All
Loading audit history...