#262 Audit Log Visibility: Implement Sidecar Telemetry for Hard Crash Scenarios
Description
Edit## Problem
_spawn_logging_task_if_needed uses the main atomic connection. If a worker suffers a hard crash (SEGFAULT/OOM) or the transaction rolls back due to DB constraints, the 'Task Started' logs are also rolled back.
## Impact
Blind spots in observability for catastrophic failures. When investigating production issues, there's no audit trail showing that a task even started before the crash.
## Technical Details
- Current: Logging happens within the main transaction
- Problem: Transaction rollback = log rollback
- Failure scenarios:
- Worker OOM kill
- SEGFAULT in Python extension
- DB constraint violation causing rollback
- Network partition during commit
## Proposed Solution
Implement a 'Sidecar' pattern for critical lifecycle logs:
1. Use a separate, non-transactional connection for Task Started/Task Failed events
2. Alternatively, use PostgreSQL UNLOGGED tables for high-speed crash-safe logging
3. These logs persist even if main transaction dies
## Implementation Options
### Option A: Separate Connection
- Create dedicated logging connection pool (small, 2-3 connections)
- Fire-and-forget Task Started event before main transaction
- Pro: Simple implementation
- Con: Additional connection overhead
### Option B: UNLOGGED Tables
- Create highway_crash_safe_log (UNLOGGED)
- Insert lifecycle events directly
- Pro: No connection overhead, very fast
- Con: Lost on PostgreSQL crash (acceptable for telemetry)
### Option C: Async Queue (Redis/Kafka)
- Push lifecycle events to external queue
- Pro: Fully decoupled
- Con: Additional infrastructure dependency
## Recommended: Option A (Separate Connection)
## Implementation Steps
1. Add logging_pool to db module (min 2, max 5 connections)
2. Create log_lifecycle_event() function using non-transactional connection
3. Call before transaction starts for Task Started
4. Call after transaction (success or fail) for Task Completed/Failed
5. Add integration test simulating transaction rollback
## Acceptance Criteria
- [ ] Task Started logs persist even on transaction rollback
- [ ] Task Failed logs persist even on worker crash
- [ ] Minimal performance impact (<5ms additional latency)
- [ ] Integration test verifies log persistence on rollback
## Tags: observability, reliability, logging, crash-recovery
Comments
Loading comments...
Context
Loading context...
Audit History
View AllLoading audit history...