Lessons Learned [highway-workflow-engine]

Lessons

Category	Lesson	Issue	Date
security	Security audit pattern: After fixing one bug class (missing filter/prefix), always search codebase for similar patterns. Issues #795-802 found 7 similar bugs by searching for: (1) SQL function calls without schema prefix (2) queries without workflow_run_id filter (3) queries without tenant_id filter (4) fallback code paths that bypass security checks.	-	2026-01-02
database	Always use explicit schema prefix for PostgreSQL function calls. When calling PostgreSQL functions from Python, use explicit schema qualification (e.g., 'highway.function_name') rather than relying on search_path. Different database connections may have different search_path settings. Issue #796 was caused by calling release_rate_token without the highway prefix.	#796	2026-01-02
architecture	SINGLETON PATTERN: When using singletons with configuration parameters (like queue_name), the singleton caches the FIRST configuration and ignores subsequent calls. Use per-configuration singletons (dict[config_key, instance]) or factory pattern instead.	#794	2026-01-02
general	DISTRIBUTED SYSTEMS AUDIT CHECKLIST: 1) Always check FOR UPDATE vs SKIP LOCKED semantics in heartbeat/timeout pairs 2) Event emit must always include workflow_run_id after scoping migration 3) DLQ operations need idempotency keys 4) Secret caches need short TTLs for security rotation 5) TOCTOU races occur when SELECT and UPDATE are not in same atomic operation 6) Alert cooldowns must be persisted to survive restarts 7) Approval tokens need lifecycle management tied to workflow completion	-	2026-01-02
concurrency	Make Jumper task completion idempotent. When complete_task() encounters 'not currently running' error, check if already completed and treat as success. Race conditions between parallel branch wake events are expected.	#775	2026-01-02
testing	Docker volume file sync can take 8-30 seconds. When validating files from workflow containers, poll for expected content marker (e.g. FINAL_VERIFICATION_PASSED) with 30s timeout instead of just checking file existence.	-	2026-01-02
general	workflow_run.status must be synced from jumper task state via database trigger (sync_workflow_run_status). Without this trigger, running workflows show as 'pending'. The trigger fires on state changes in jumper.r_highway_* tables and updates workflow_run accordingly.	-	2026-01-02
general	Platform tenants (_platform, platform) must be exempt from rate limits. These are internal system tenants that run bootstrap, cron jobs, and internal operations. RATE_LIMIT_EXEMPT_TENANTS constant in tool_rate_limiter.py and tenant_rate_limiter.py controls this.	-	2026-01-02
architecture	State machine validation: Always use WHERE clauses that check both current state AND ownership. Jumper sets state=running in claim_task, so checks must use state=running not claimed.	#765	2026-01-02
general	Circuit breaker per-workflow isolation: When circuit_breaker_per_workflow=true, each workflow gets isolated circuit breakers (key format: resource_wf_XXXXX). Circuit breaker states are stored in resilient_circuit_db.public.rc_circuit_breakers. Cooldown is typically 30-60s. If a workflow completes while circuit is OPEN, subsequent workflows are NOT affected - each has its own circuit. Check open_until timestamp to verify cooldown status.	-	2026-01-02
cron	Hybrid cron watchdog pattern: Use a meta-cron (cron_watchdog) to monitor all other crons. On worker startup, only check if watchdog is alive (1 query). Watchdog runs every 5 minutes and recovers dead crons. This eliminates N workers doing N startup checks.	#763	2026-01-02
general	Durable cron jobs need recovery on worker restart. When internal worker dies during spawn_next step, the cron chain dies completely. Added _recover_dead_cron_chains() function to worker.py that checks for dead chains on startup and respawns them via internal.platform.spawn_cron_jobs. Issue #762.	-	2026-01-01
sandbox	Sandbox restrictions apply to ALL imports during workflow execution, including engine internal code. Function-level imports of banned modules (time, random, datetime, etc.) in engine code will fail if called during workflow execution. Always use module-level imports for banned modules in engine code.	#760	2026-01-01
architecture	HeartbeatService MUST update both last_heartbeat AND claim_expires_at. The claim is a renewable lease extended on each heartbeat. Cleanup cron checks BOTH fields before marking task as failed.	#759	2026-01-01
architecture	demo_artifact_bootstrap: Demo artifacts are automatically uploaded during platform bootstrap for demo tenant. Use compute_artifact_id_from_source() for deterministic IDs. Source files in engine/platform_core/demo_artifacts/. Registry table highway.demo_artifact_registry tracks name-to-ID mappings.	#753	2026-01-01
general	Zombie transactions and lock convoys: When workers crash (SystemExit, OOM), transactions may be left in 'idle in transaction' state, holding locks indefinitely. This blocks cleanup jobs, causing lock convoy cascades. Defense: (1) Set idle_in_transaction_session_timeout at connection level to auto-kill zombies after 60s, (2) Use SET LOCAL lock_timeout in cleanup operations to fail fast, (3) Catch BaseException (not just Exception) and explicitly rollback.	-	2026-01-01
architecture	DBOS vs Highway: DBOS uses deterministic replay while Highway uses atomic checkpointing. Highways atomic transaction model is superior for mission-critical. Gaps: Kafka (#744), priority queues (#745), cron (#746), debouncing (#747), forking (#748), partitioned queues (#749). DO NOT add: SQLite, async, separate app DB. See docs/HIGHWAY_VS_DBOS_ANALYSIS.md	-	2025-12-30
general	Integration test categories failing on production K8s: (1) Docker tools - 9 tests fail, workers lack Docker socket, (2) Artifacts - 2 tests pending, worker queue issue, (3) Circuit breaker - 3 tests, returns UNKNOWN, (4) Datashard logging - 3 tests, shell commands fail, (5) Event/IPC - 3 tests, workflows fail immediately. Quick run: HIGHWAY_API_URL=https://highway.solutions pytest tests/integration/ -n 2 -v	-	2025-12-29
Deployment	Docker-compose env_file with explicit environment: block causes issues - the environment: block interpolates from HOST at parse time, overriding env_file values with empty strings. Solution: Remove explicit vault token mappings from environment: block, let env_file handle them entirely.	#726	2025-12-28
general	PostgreSQL zombie transactions can block distributed task recovery indefinitely. When worker pods crash during DB transactions, TCP connections may stay half-open. PostgreSQL won't detect client death, leaving transaction 'idle in transaction' forever holding exclusive locks. FIX: Set idle_in_transaction_session_timeout=300000 (5 min), configure TCP keepalive. DETECT: SELECT * FROM pg_stat_activity WHERE state='idle in transaction'	-	2025-12-28
general	Signal handler pattern for blocking waits: When signal handler sets a shutdown flag, also set any Event objects that threads may be waiting on. Otherwise threads blocked on Event.wait() won't wake until timeout.	-	2025-12-28
general	Docker signal handling: When using shell expansion in docker-compose command, add 'exec' before the actual command to make the process PID 1 and receive signals directly. Without exec, the shell is PID 1 and absorbs SIGTERM.	-	2025-12-28
best-practice	Use sync httpx.Client not async httpx.AsyncClient in workers - asyncio can cause zombie workers where container shows healthy but all threads are dead. httpx.Client is equally capable and avoids asyncio concurrency bugs. ThreadPoolExecutor workarounds for asyncio are fragile.	#721	2025-12-27
architecture	Docker-in-Docker temp files must use shared mount path: When spawning containers from inside containers (DinD with Docker socket), temp files created in /tmp exist only inside the parent container. Volume mounts use HOST paths. Use a shared mount path (same absolute path on host and container) for files that need to be mounted into child containers.	#720	2025-12-27
architecture	Parallel IDs must use workflow_run_id not absurd_run_id: When generating identifiers that need to persist across retries (like parallel branch join_event_name), ALWAYS use workflow_run_id (stable across retries) not absurd_run_id (changes on each retry). absurd_run_id represents one execution attempt, workflow_run_id represents the entire workflow lifecycle.	#719	2025-12-27
architecture	ActivityOperator is queue-only: When creating internal operators (like ReflexiveOperator creating ActivityOperator), execute_activity_operator only QUEUES the activity and returns immediately. Internal operators must add their own await_event() call to wait for completion. See engine/interpreters/inline_executor.py _execute_reflexive_operator	#713	2025-12-25
security	Python DSL must execute in isolated container: No DB libs (psycopg), no network libs (requests), no secrets. Use separate Docker network (internal:true). API calls http://dsl-compiler:8080/compile.	#678	2025-12-24
api/visibility	Terminal workflow states (completed, compensated, failed) should always show progress_percentage=100. Switch/conditional branches that aren't executed shouldn't reduce progress for finished workflows.	#677	2025-12-24
api/visibility	Progress calculation requires unlimited query: (1) Fork/join workflows have branch step_succeeded events not in task_ids - use set intersection. (2) LIMIT 50 drops early events - split into limited query for UI and unlimited query for progress count. See workflows.py:1424-1471.	#675	2025-12-24
PostgreSQL	JSONB column data extraction: When querying PostgreSQL tables with JSONB columns (like absurd_event_log.payload), data nested in JSONB must be extracted using arrow operators (payload->>'key' for text, payload->'key' for JSONB). Always verify schema with \\d before writing SQL - event logs often store step names and results inside payload JSONB, not as separate columns.	#674	2025-12-24
database	Always use row_factory=dict_row when creating psycopg3 cursors for dict access. Use get_db_cursor() helper which sets this by default.	#673	2025-12-24
general	RAG cache must use PostgreSQL not file system - /tmp is container-local in Docker, so each container computes embeddings separately causing hundreds of /embedding API calls. Use rag_embedding_cache table with TTL for shared cache across containers.	-	2025-12-24
general	Branch tasks must NEVER update parent workflow_run status. Always check 'task_name \!= BRANCH_EXECUTION_TASK' before updating workflow_run for any status (running, completed, sleeping, failed). The parent workflow controls its own lifecycle - branches only emit events.	-	2025-12-24
orchestrator	CRITICAL: Always sync workflow_run.status with absurd task state. When task state changes (sleeping, cancelled, retrying, abandoned), update workflow_run accordingly. Absurd is source of truth, workflow_run is UI display layer.	-	2025-12-23
security	Security: When implementing IPC proxies for sandboxed code, never blindly forward all method calls. Always enforce a strict whitelist of allowed methods on the server side to prevent sandboxed code from accessing dangerous internal APIs.	-	2025-12-22
ScriptPlan	Working Hours Correct Usage: (1) Use project.dateToIdx(naive_datetime) - datetime MUST be NAIVE; (2) Get WorkingHours from shift.get('workinghours', 0) not resource.workinghours; (3) Use resource.get('leaves', 0) not resource.leaves which is a method; (4) Project must be scheduled with schedule=True	-	2025-12-21
architecture	For distributed rate limiting, use PostgreSQL atomic operations (INSERT ON CONFLICT DO UPDATE RETURNING). Two-phase architecture: CPU-heavy scheduling done once on config change (parse TJP, cache report in JSONB), fast runtime checks compare cached windows. Always support fail_open for graceful degradation.	#635	2025-12-20
general	Dockerfile pre-installed apps: The find command that deletes .py files after bytecode compilation must exclude engine/apps/pre_installed/*.py because the bootstrap command reads source code from .py files to store in the database. Added exclusion pattern to Dockerfile line 85-88.	-	2025-12-20
general	ScriptPlan Tool Architecture: Use wait_for_event() with timeout for interruptible scheduling instead of plain ctx.sleep(). This allows schedule updates to wake the sleeping scheduler immediately via emit_event(). State is stored entirely in workflow variables (ctx.get/set_variable) - no custom DB tables needed. The state reset pattern (delete_checkpoint + loop) prevents history accumulation in long-running schedulers.	-	2025-12-19
code-duplication	SandboxedDurableContext has TWO implementations that must stay in sync: (1) engine/sandbox/sandboxed_context.py (for reference/typing), (2) inline string in engine/sandbox/sandboxed_executor.py:218-325 (actually injected into containers). When adding methods to DurableContext API, both must be updated.	-	2025-12-19
exception-handling	AbsurdSleepError in operators.py _execute_with_retry() must be explicitly re-raised BEFORE any generic Exception handler. It is a suspension signal, not a failure. If caught as generic Exception, parallel branch workflows fail incorrectly. Fix: add 'except AbsurdSleepError: raise' before 'except Exception'.	-	2025-12-19
general	PostgreSQL idle_in_transaction_session_timeout kills connections held open during long task execution. The inline_executor creates SAVEPOINTs before tasks and releases after - if task runs longer than timeout, connection dies. Quick fix: increase timeout in docker-compose.yml. Proper fix: don't hold transactions during execution (ticket #625).	-	2025-12-19
general	Cython Optimization - Added three Cython modules for performance: variable_resolver_cy.pyx (30-50% faster), schema_hash_cy.pyx (40-60% faster), chunking_cy.pyx. Key learnings: 1) Use cpdef for functions callable from Python and C. 2) Closures not supported in cpdef - use def instead. 3) Create compat layers for graceful fallback. 4) I/O-bound code (like loop operators with checkpoint saves) has limited Cython ROI.	-	2025-12-18
parallel_execution	Parallel branch completion events can cause 'thundering herd' problem where multiple workers pick up the same workflow. Each branch completion emits a NOTIFY, waking up all workers. Without SELECT FOR UPDATE or similar locking, multiple workers execute the same workflow causing duplicate task execution. This was discovered by the ultimate_correctness.py test which uses atomic counters to detect exactly-once violations.	#615	2025-12-18
tools	tools.python.run expects a MODULE PATH (e.g., 'mypackage.module.function'), NOT inline Python code. For inline code execution, use tools.code.exec instead. The key difference: tools.python.run has DurableContext access, tools.code.exec runs in a sandboxed Docker container with no context access.	#614	2025-12-18
code-quality	Proactive code hardening patterns: (1) INSERT ON CONFLICT for idempotent inserts (Fix #458, #495), (2) Atomic UPDATE with WHERE for check-and-modify, (3) Pre-compiled regex at module level (#481, #499, #528), (4) frozenset for immutable constants (#529), (5) TTL caches with proactive eviction (#489), (6) Double-check locking for singletons (#480). Code review found 22 potential issues but 18 were already fixed in the codebase.	-	2025-12-17
code-review	Comprehensive Code Review 2025-12-17: Created 22 issues from expert agent review of engine/ and api/. Critical patterns found: (1) Unbounded caches without proactive TTL cleanup cause memory growth (2) Singleton initialization needs double-check locking (3) Check-then-act patterns in DB operations cause race conditions - use INSERT ON CONFLICT or WHERE clause in UPDATE (4) Regex compilation in hot paths - always pre-compile at module level (5) Approval/signal processing needs optimistic locking to prevent double-processing (6) JWT config fetched from Vault on every request - add TTL cache. Priority fixes: #517 DataShard memory, #520 approval race, #525 N+1 analytics, #526 JWT caching.	-	2025-12-17
caching	TTL Cache Pattern: Module-level caches keyed by workflow_run_id/tenant_id MUST use TTL+max_size+LRU eviction+thread lock. Store (value,timestamp) tuples. Cleanup when size>max/2, evict oldest 10% when full.	-	2025-12-17
architecture	Durable workflow engines intentionally trade performance for durability - each variable operation is a DB round-trip to survive crashes. Caching can improve performance but requires careful design to maintain crash consistency guarantees.	#460	2025-12-17
performance	Use tuple() instead of list() for immutable sequences from dict.keys(), dict.values(), reversed(). Signals immutability and avoids allocation.	#459	2025-12-17
memory	Cache eviction: Unbounded caches are memory leaks. Add TTL + max_size + eviction policy. Store (value, timestamp) tuples.	#456	2025-12-17
threading	Thread-safe singleton: Always use double-check locking with threading.Lock() - check None, acquire lock, check None again, then create.	#455	2025-12-17
general	Multi-tenant isolation is enforced at data access layer, not task claiming. tenant_id is derived from immutable task record in orchestrator.py:376-395, not from worker identity. All data operations (secrets, events, DB) use this tenant_id. RLS would be defense-in-depth, not a security fix. Workers are neutral infrastructure like AWS Lambda.	-	2025-12-16
general	Test failures root causes (Dec 2025): (1) JWT token generation bug - config.get_secret() returns dict, use ['value'] not second arg; (2) Docker entrypoint vs command - use entrypoint so it always runs even with custom args; (3) Internal queue saturation - scale internal-worker replicas for parallel test load; (4) Async log timing - increase retry/timeout for tests that wait on async operations	-	2025-12-15
general	Docker-in-Docker sandbox path mapping: When workers run inside containers and spawn sandbox containers via Docker socket, bind mounts fail because paths exist only in worker container, not on host. Solution: (1) Use put_archive() API to copy files INTO container instead of bind mounts, (2) For workspace mounts, use PWD:PWD volume mount so path is identical on host and container.	-	2025-12-15
checkpoint-system	Multi-tenant checkpoint isolation: tenant_id must be passed through entire save/load chain. If tenant_id defaults to 'default' on save but filters by actual tenant on load, checkpoints are never found, causing ctx.step() to re-execute. See Issue #436. Fix: z_absurd_0.0.9_fix_checkpoint_tenant_id.sql	#436	2025-12-14
general	Platform roles vs tenant roles are defined in separate files (platform.py vs rbac_roles.py). When adding new permissions, ensure they are added to BOTH PLATFORM_ROLES and PREDEFINED_ROLES where applicable. Use sync_platform_permissions() and sync_tenant_role_permissions() to update existing installations.	-	2025-12-12
general	Multi-tenant security requires sandboxing ALL user code execution. Never pass raw DB connections to user code. Use HTTP API callbacks with scoped execution tokens for tenant isolation.	-	2025-12-12
general	Security Review Pattern: When reviewing engine/ security, check: (1) SQL injection - verify sql.Identifier() for dynamic table names, (2) SSRF - verify DNS pinning not just validation, (3) Sandbox - verify sys.modules cleared, (4) Thread safety - verify locks around caches, (5) Temp files - verify unpredictable names via tempfile module.	-	2025-12-11
data-integrity	DataShard storage must use content-addressable keys (definition_hash) instead of mutable identifiers like workflow_name+version. This prevents stale data collisions after DB recreation. See Issue #360 for details.	#360	2025-12-11
workflow-dsl	For TRUE parallel execution in workflows, use ParallelOperator (builder.parallel) NOT ForEachOperator (builder.foreach). ForEach is sequential - each iteration blocks until complete. ParallelOperator spawns independent Absurd tasks that run concurrently. Use wait_for_parallel_branches() to sync after fork.	#359	2025-12-11
tools	Python tools (tools.python.run) MUST have ctx: DurableContext as first parameter in their function signature, even if not used. The tool executor always passes ctx as the first argument when invoking the function.	#358	2025-12-11
workflow-engine	ForEach result storage: Store results in BOTH flat key format (task_id.result) AND dict structure ({task_id: {result: value}}) to support both legacy path resolution (dot-separated) and dict-based Jinja2 templates. See _store_execution_result in inline_executor.py.	#358	2025-12-11
workflow-dsl	Template variable collision: Use alternative placeholder syntax like %CONTENT% in LLM prompts that need runtime substitution. Highway's Jinja2 template engine processes {{variable}} before tool execution, so use content_storage_ref parameter with %CONTENT% placeholder for large content that should be fetched during tool execution.	#358	2025-12-11
workflow-engine	ForEach result storage: Must store as BOTH flat key (task_id.result) AND dict structure ({task_id: {result: ...}}) for path resolution. resolve_variable_path splits by dots. See inline_executor.py:_store_execution_result	-	2025-12-11
s3	S3 Presigned URL SignatureDoesNotMatch: When generating presigned URLs for user uploads where content type is unknown, do NOT include ContentType in signing params. The upload request must send exact same Content-Type as signed, otherwise S3 returns SignatureDoesNotMatch. Fix: omit ContentType from generate_presigned_url params.	#358	2025-12-11
general	HTTP tools must validate URLs for SSRF attacks - always block localhost, private IPs (10.x, 172.16-31.x, 192.168.x), link-local (169.254.x for cloud metadata). Use ipaddress module to check if resolved IP is_private, is_loopback, is_link_local, or is_reserved.	-	2025-12-10
general	ActivityOperator pattern: When an operator queues async work (activities), it must wait for completion by calling absurd_client.await_event() and raise AbsurdSleepError if not ready. Check payload.status for failed and raise RuntimeError to propagate failure to workflow. Add async operators to sleep-state exclusion list in inline_executor.py.	-	2025-12-10
tools	Reflexive Loop Verification: Use tools.code.exec not tools.python.run for LLM-generated code. tools.python.run expects function_name (module.function path), not raw code. tools.code.exec runs in subprocess with timeout, captures stdout/stderr/exit_code.	#331	2025-12-09
security	CRITICAL: App code_loader ALLOWED_IMPORTS must NOT include engine.config, engine.tools., engine.services., engine.db, engine.durable_context. These give direct Vault/DB access. Apps use ONLY engine.apps.sdk.base and engine.apps.sdk.context. All functionality via AppContext methods.	#328	2025-12-09
design	waits_for_event actions CANNOT run as activities. ActivityContext is atomic (no durable waits). wait_for_event requires DurableContext to suspend and wait. Approval workflows must run in normal worker mode without timeout_policy.	#328	2025-12-09
general	Database Code Execution Security: When executing user-provided code from database, use multiple layers of security: 1) AST validation for forbidden imports BEFORE execution, 2) Runtime import whitelist via custom __import__, 3) Restricted __builtins__ (no eval/exec/open), 4) Unique module names per execution to prevent caching, 5) Code hash verification for integrity. Issue #326.	-	2025-12-09
general	App versioning with python_module entrypoint provides FALSE isolation - code is always loaded from disk, not per-version. For true multi-tenant version isolation, must implement python_code entrypoint that stores actual code in database.	-	2025-12-09
general	UUID-only Secret Lookups: Secrets MUST be referenced by UUID only. get_secret() requires secret_id (UUID format), validates with _validate_uuid(), rejects secret_name/secret_path. Returns invalid_uuid for non-UUID values. Provides multi-tenant isolation.	#321	2025-12-08
testing	Python unittest mock patches: When mocking functions imported inside other functions, patch the original location (engine.config.get_config) not where it appears to be used. Mock target must be where the import comes from.	#263	2025-12-05
observability	Sidecar pattern for crash-safe logging: Use separate autocommit connection pool for lifecycle events. Small pool 2-5 conns, autocommit=True for immediate persistence, fire-and-forget design so telemetry failures never crash workflow, log before transaction starts and after commit	#262	2025-12-05
security	OAuth authentication != authorization. After OAuth verifies identity, ALWAYS check user has actual permissions in the system before issuing JWT tokens. Never trust OAuth state parameters for authorization decisions.	#260	2025-12-04
security	API key rotation security: Default behavior must be SECURE (immediate invalidation), with opt-in grace period. Never leave compromised credentials valid by default.	#257	2025-12-04
security	API Security Hardening: Use @require_permission decorators (after @spec.validate), g.tenant_id not headers, AST validation for code execution (whitelist not blacklist)	#248	2025-12-04
general	SECURITY: All API endpoints must have @require_permission decorator. New endpoints require security review before deployment. Tenant isolation must use g.tenant_id from middleware, NEVER request headers directly.	-	2025-12-04
general	expected behavior when the worker is killed - the shell process dies with it (process group termination). For long-running activities to survive worker restarts, you need a retry policy: Now if you make restart: 1. Worker dies → shell command gets SIGTERM 2. Activity marked as failed (attempt 1 of 5) 3. After 5 seconds delay, new worker picks it up 4. HTTP server restarts automatically Without retry policy: Activity fails permanently on worker crash With retry policy: Activity auto-recovers when new worker starts	-	2025-12-03
architecture	ActivityContext pattern for long-running activities: Use ActivityContext (not DurableContext) for activities to avoid holding DB connections. ActivityContext.get_connection() provides on-demand short-lived connections that auto-commit. DurableContext.get_connection() just yields its held connection. This prevents connection pool exhaustion during high concurrency of long-running activities.	#244	2025-12-03
general	DB Connection Management: Never hold database connections during long-running operations (shell commands, HTTP requests, etc.). Connections should be acquired, used briefly, and released. For long-running activities, acquire connection only for: 1) initial setup/variable resolution, 2) status updates. Use separate short-lived connections for periodic updates (heartbeat, PID storage). This prevents connection pool exhaustion when scaling to hundreds of concurrent activities.	-	2025-12-03
architecture	MCP integration: DurableMCPClient wraps ctx.step() for checkpointing. Tools: tools.mcp.invoke, tools.mcp.list_tools, tools.mcp.read_resource. DSL: WorkflowBuilder.mcp_tool(). DB: mcp_server_config with tenant isolation. Transports: stdio and HTTP. Credentials in Vault.	#240	2025-12-03
general	Spectree validates responses when resp= specified. UUIDs must be str(uuid). Response must match Pydantic models exactly.	-	2025-12-02
replay	Saga pattern implemented in DurableContext: (1) step_with_compensation() registers compensation functions, (2) run_compensations() executes in reverse order (LIFO), (3) saga() context manager auto-runs compensations on failure. All compensations are durable via ctx.step() for idempotency.	-	2025-12-02
replay	Highway replay uses two modes: Display Mode (historical data from checkpoints + audit log) and Simulation Mode (time-travel debugging re-executing code with mocked side effects). Determinism NOT enforced at runtime - relies on developer discipline to use ctx.now, ctx.get_random, ctx.step.	-	2025-12-02
general	Scheduling a Workflow When you "schedule" a workflow (e.g., via POST /v1/workflows), the following happens: 1. API Endpoint: The request hits api/blueprints/v1/workflows.py. 2. Versioning: The WorkflowVersioningService hashes the definition and stores it in the workflow_definition table. 3. Tracking: A workflow_run record is created in the database with status pending. 4. Enqueuing: The API calls absurd_client.spawn_task to insert a new task into the absurd task queue (specifically t_{queue_name}). The task name is typically tools.workflow.execute. * Atomicity: This insertion happens within a database transaction. Once committed, the workflow is effectively "scheduled" for immediate execution.	-	2025-11-30
general	The execution flow is as follows: 1. The orchestrator claims a task, which corresponds to the execute_workflow function. 2. The orchestrator instantiates a DurableContext, which involves initiating a database transaction and creating an AbsurdClient. 3. The orchestrator then calls execute_workflow, passing in the newly created ctx. 4. execute_workflow instantiates a WorkflowInterpreter. 5. interpreter.start_workflow(ctx=ctx, ...) is called. 6. This, in turn, calls inline_executor.start_workflow(ctx=ctx, ...). 7. Finally, the InlineExecutor executes the workflow graph, passing the same ctx object to all execute_task_inline calls.	-	2025-11-30
general	DRAIN LOOP BUG PATTERN: When using while-True drain loops with LIMIT queries, always track seen IDs to prevent infinite loops. If query returns same rows (race condition prevents state change), loop runs forever. Fix: seen_ids set + break if all rows duplicates + MAX_PER_CYCLE limit.	-	2025-11-30
general	DSL_VARIABLE_INTERPOLATION: Never echo large variable content like {{result.stdout}} or {{result}} in shell tasks. The full content gets interpolated causing 'Argument list too long' errors. Use specific small fields like {{result.returncode}}, {{result.status_code}}, or truncate large outputs.	-	2025-11-29