#264 Connection Pool Starvation: Link Bulkhead Semaphore to DB Pool Size
Description
Edit## Problem
The Orchestrator Bulkhead limit (max_concurrent_tasks) is decoupled from the database connection pool size. If max_concurrent_tasks exceeds the pool size, worker threads will starve waiting for DB connections.
## Impact
- Worker threads block indefinitely waiting for connections
- Cascading timeouts across the system
- Potential deadlock scenarios
- Reduced throughput despite available CPU/memory
## Technical Details
- Current config:
- max_concurrent_tasks = 32 (from config.ini)
- pool_max_size = 50 (from config.ini)
- Safe margin exists currently, but no enforcement
- Each task execution requires at least one connection
- Long-running tasks may hold connections for extended periods
## Failure Scenario
1. max_concurrent_tasks = 100
2. pool_max_size = 50
3. All 100 tasks start simultaneously
4. First 50 tasks get connections
5. Remaining 50 tasks block on connection acquisition
6. Timeout errors or deadlock
## Proposed Solutions
### Option A: Startup Validation (Recommended)
Add validation during worker startup:
- Assert: max_concurrent_tasks <= pool_max_size - safety_margin
- safety_margin = 5 (for heartbeat, registration, etc.)
- Fail fast with clear error message if misconfigured
### Option B: Hard-Link Semaphores
Create a shared semaphore that gates both:
- Task acquisition from Bulkhead
- Connection acquisition from pool
This ensures no task starts without guaranteed connection
### Option C: Dynamic Pool Sizing
Automatically size pool based on max_concurrent_tasks:
- pool_max_size = max_concurrent_tasks + safety_margin
- Less flexible but eliminates misconfiguration
## Recommended: Option A + Option C
## Implementation Steps
1. Add startup validation in worker.py:
- Read pool_max_size from config
- Read max_concurrent_tasks from config
- Assert pool_max_size >= max_concurrent_tasks + 5
- Log clear error and exit if violated
2. Add automatic sizing option:
- New config: pool_auto_size = true/false
- If true: pool_max_size = max_concurrent_tasks + 10
- Document the setting
3. Add monitoring:
- Log pool utilization during heartbeat
- Warn if pool utilization > 80%
## Configuration Changes
[database]
pool_max_size = 50
pool_auto_size = false # If true, override pool_max_size based on max_concurrent_tasks
[worker]
max_concurrent_tasks = 32
pool_safety_margin = 5 # Reserved connections for system operations
## Acceptance Criteria
- [ ] Startup validation prevents misconfiguration
- [ ] Clear error message on pool/bulkhead mismatch
- [ ] Optional auto-sizing implemented
- [ ] Pool utilization monitoring added
- [ ] Documentation updated with sizing guidance
## Tags: reliability, connection-pool, configuration, worker
Comments
Loading comments...
Context
Loading context...
Audit History
View AllLoading audit history...