#263 Code Versioning: Enforce Artifact Loading for Deterministic Replay
Description
Edit## Problem
tools.python.run imports code dynamically from sys.path (local disk). If a worker is updated with new code while processing an old workflow event, it may execute new logic against old state, causing non-deterministic replay failures.
## Impact
- Replay debugging becomes unreliable if code changes between execution and replay
- Production workflows may behave differently on retry if code was deployed mid-execution
- Non-deterministic failures that are impossible to reproduce
## Technical Details
- Current: tools.python.run imports from local sys.path
- Problem: Code on disk may differ from code that originally created the workflow
- Scenario:
1. Workflow v1.0 starts execution
2. Code deployed: v1.1 with breaking changes
3. Worker restarts mid-workflow
4. Workflow resumes with v1.1 code but v1.0 state
5. Non-deterministic behavior or crash
## Proposed Solution: Strict Artifact Loading
Workers should prioritize loading the code version hash referenced in the workflow_run (stored in DataShard) over local disk code.
## Implementation Steps
### Phase 1: Code Hash Tracking
1. On workflow submission, compute SHA256 of relevant Python modules
2. Store code_version_hash in workflow_runs table
3. Store actual code artifacts in DataShard
### Phase 2: Artifact Loading
1. Modify tools.python.run to check for workflow code version
2. If version mismatch detected:
- Option A: Load code from DataShard artifact
- Option B: Fail with clear error message
3. Add configuration flag: strict_code_versioning (default: warn)
### Phase 3: Validation
1. Add startup check comparing worker code hash vs active workflows
2. Log warning if mismatched workflows detected
3. Add CLI command: hwe validate-code-versions
## Configuration Options
strict_code_versioning:
- disabled: No version checking (current behavior)
- warn: Log warning but continue execution
- strict: Fail on version mismatch
- artifact: Load from DataShard artifact
## Migration Strategy
1. Deploy with strict_code_versioning=warn initially
2. Monitor for version mismatch warnings
3. Enable strict mode once confident
## Acceptance Criteria
- [ ] Code version hash stored on workflow creation
- [ ] Version mismatch detection implemented
- [ ] Warning mode logs mismatches
- [ ] Strict mode fails on mismatch
- [ ] Documentation updated with versioning strategy
## Tags: reliability, replay, versioning, determinism
Comments
Loading comments...
Context
Loading context...
Audit History
View AllLoading audit history...