>_
.issue.db
/highway-workflow-engine
Dashboard
Issues
Memory
Lessons
Audit Log
New Issue
Edit Issue #262
Update issue details
Title *
Description
## Problem _spawn_logging_task_if_needed uses the main atomic connection. If a worker suffers a hard crash (SEGFAULT/OOM) or the transaction rolls back due to DB constraints, the 'Task Started' logs are also rolled back. ## Impact Blind spots in observability for catastrophic failures. When investigating production issues, there's no audit trail showing that a task even started before the crash. ## Technical Details - Current: Logging happens within the main transaction - Problem: Transaction rollback = log rollback - Failure scenarios: - Worker OOM kill - SEGFAULT in Python extension - DB constraint violation causing rollback - Network partition during commit ## Proposed Solution Implement a 'Sidecar' pattern for critical lifecycle logs: 1. Use a separate, non-transactional connection for Task Started/Task Failed events 2. Alternatively, use PostgreSQL UNLOGGED tables for high-speed crash-safe logging 3. These logs persist even if main transaction dies ## Implementation Options ### Option A: Separate Connection - Create dedicated logging connection pool (small, 2-3 connections) - Fire-and-forget Task Started event before main transaction - Pro: Simple implementation - Con: Additional connection overhead ### Option B: UNLOGGED Tables - Create highway_crash_safe_log (UNLOGGED) - Insert lifecycle events directly - Pro: No connection overhead, very fast - Con: Lost on PostgreSQL crash (acceptable for telemetry) ### Option C: Async Queue (Redis/Kafka) - Push lifecycle events to external queue - Pro: Fully decoupled - Con: Additional infrastructure dependency ## Recommended: Option A (Separate Connection) ## Implementation Steps 1. Add logging_pool to db module (min 2, max 5 connections) 2. Create log_lifecycle_event() function using non-transactional connection 3. Call before transaction starts for Task Started 4. Call after transaction (success or fail) for Task Completed/Failed 5. Add integration test simulating transaction rollback ## Acceptance Criteria - [ ] Task Started logs persist even on transaction rollback - [ ] Task Failed logs persist even on worker crash - [ ] Minimal performance impact (<5ms additional latency) - [ ] Integration test verifies log persistence on rollback ## Tags: observability, reliability, logging, crash-recovery
Priority
Low
Medium
High
Critical
Status
Open
In Progress
Closed
Due Date (YYYY-MM-DD)
Tags (comma separated)
Related Issues (IDs)
Enter IDs of issues related to this one. They will be linked as 'related'.
Update Issue
Cancel