#264 Connection Pool Starvation: Link Bulkhead Semaphore to DB Pool Size

closed medium Created 2025-12-05 09:01 · Updated 2025-12-05 09:15

Description

Edit
## Problem The Orchestrator Bulkhead limit (max_concurrent_tasks) is decoupled from the database connection pool size. If max_concurrent_tasks exceeds the pool size, worker threads will starve waiting for DB connections. ## Impact - Worker threads block indefinitely waiting for connections - Cascading timeouts across the system - Potential deadlock scenarios - Reduced throughput despite available CPU/memory ## Technical Details - Current config: - max_concurrent_tasks = 32 (from config.ini) - pool_max_size = 50 (from config.ini) - Safe margin exists currently, but no enforcement - Each task execution requires at least one connection - Long-running tasks may hold connections for extended periods ## Failure Scenario 1. max_concurrent_tasks = 100 2. pool_max_size = 50 3. All 100 tasks start simultaneously 4. First 50 tasks get connections 5. Remaining 50 tasks block on connection acquisition 6. Timeout errors or deadlock ## Proposed Solutions ### Option A: Startup Validation (Recommended) Add validation during worker startup: - Assert: max_concurrent_tasks <= pool_max_size - safety_margin - safety_margin = 5 (for heartbeat, registration, etc.) - Fail fast with clear error message if misconfigured ### Option B: Hard-Link Semaphores Create a shared semaphore that gates both: - Task acquisition from Bulkhead - Connection acquisition from pool This ensures no task starts without guaranteed connection ### Option C: Dynamic Pool Sizing Automatically size pool based on max_concurrent_tasks: - pool_max_size = max_concurrent_tasks + safety_margin - Less flexible but eliminates misconfiguration ## Recommended: Option A + Option C ## Implementation Steps 1. Add startup validation in worker.py: - Read pool_max_size from config - Read max_concurrent_tasks from config - Assert pool_max_size >= max_concurrent_tasks + 5 - Log clear error and exit if violated 2. Add automatic sizing option: - New config: pool_auto_size = true/false - If true: pool_max_size = max_concurrent_tasks + 10 - Document the setting 3. Add monitoring: - Log pool utilization during heartbeat - Warn if pool utilization > 80% ## Configuration Changes [database] pool_max_size = 50 pool_auto_size = false # If true, override pool_max_size based on max_concurrent_tasks [worker] max_concurrent_tasks = 32 pool_safety_margin = 5 # Reserved connections for system operations ## Acceptance Criteria - [ ] Startup validation prevents misconfiguration - [ ] Clear error message on pool/bulkhead mismatch - [ ] Optional auto-sizing implemented - [ ] Pool utilization monitoring added - [ ] Documentation updated with sizing guidance ## Tags: reliability, connection-pool, configuration, worker

Comments

Loading comments...

Context

Loading context...

Audit History

View All
Loading audit history...