#169 Docker Tool for Workflow Engine
Description
Edit## Summary
Implement a full-featured Docker tool (`tools.docker.*`) for Highway Workflow Engine that enables workflows to spawn, manage, and clean up Docker containers as durable activities.
## Motivation
Enterprise workflows require containerized task execution for:
- Isolated execution environments (dependencies, security)
- Reproducible builds and data processing
- Running third-party tools without polluting host
- Ephemeral compute for batch jobs
- Durable cron jobs that spin up containers on schedule
## Requirements
### 1. Core Container Operations
```python
self.register("tools.docker.run", docker_run) # Run container (main tool)
self.register("tools.docker.exec", docker_exec) # Exec into running container
self.register("tools.docker.stop", docker_stop) # Stop container
self.register("tools.docker.remove", docker_remove) # Remove container
self.register("tools.docker.logs", docker_logs) # Get container logs
self.register("tools.docker.inspect", docker_inspect) # Inspect container
```
### 2. docker_run - Full Parameter Support
All Docker SDK parameters must be supported:
- image (REQUIRED), command, entrypoint
- environment, volumes, ports, network, network_mode
- cpu_limit, memory_limit, user, working_dir, labels
- auto_remove (default: True for GC), detach, timeout (default: 3600)
- pull_policy: always, never, if_not_present
- privileged (default: False), cap_add, cap_drop
- devices, shm_size, tmpfs, read_only
- hostname, dns, extra_hosts
- init (default: True), oom_kill_disable, pids_limit, ulimits
- log_config, healthcheck, restart_policy
### 3. CRITICAL: Garbage Collection (GC)
Containers MUST be cleaned up in ALL scenarios:
1. Normal completion - Container exits normally → remove
2. Container failure - Container exits non-zero → remove + log error
3. Timeout - Execution exceeds timeout → SIGTERM → wait 10s → SIGKILL → remove
4. Workflow cancellation - Parent workflow cancelled → stop all containers → remove
5. Worker crash - Worker process dies → orphan detection → cleanup
6. Activity timeout - Activity heartbeat timeout → stop container → remove
Container Tracking Table:
```sql
CREATE TABLE highway.docker_containers (
container_id TEXT PRIMARY KEY,
short_id TEXT NOT NULL,
activity_id UUID NOT NULL,
workflow_run_id UUID NOT NULL,
tenant_id TEXT NOT NULL,
image TEXT NOT NULL,
status TEXT NOT NULL DEFAULT 'created',
exit_code INTEGER,
created_at TIMESTAMPTZ DEFAULT now(),
started_at TIMESTAMPTZ,
finished_at TIMESTAMPTZ,
removed_at TIMESTAMPTZ,
labels JSONB DEFAULT '{}',
gc_protected_until TIMESTAMPTZ
);
```
GC Daemon:
- DockerGarbageCollector runs as background service
- _cleanup_orphaned_containers(): containers where activity no longer exists
- _cleanup_stale_containers(): force-remove after max_age
### 4. Activity Worker Integration
Docker tools MUST run on activity workers:
- Worker shutdown handler stops owned containers
- Worker crash triggers GC daemon cleanup
- Container labels include highway_worker_id for tracking
### 5. Durable Cron Mode
Support ephemeral containers for cron jobs:
- Container wakes up, executes short task, shuts down
- Works with tools.cron.durable_cron scheduler
- Auto-remove after completion
### 6. Output Handling
Return structure:
- container_id, short_id, status, exit_code
- stdout, stderr (truncated if too large)
- duration_ms, image, logs_truncated
### 7. Security Considerations
- No privileged by default
- Resource limits recommended (warn if missing)
- Network restrictions optional
- Audit logging to DataShard
- Tenant isolation via labels
- Secrets injection via tools.secrets.get_secret
### 8. Circuit Breaker Protection
Follow shell_command.py pattern for Docker daemon protection.
### 9. Image Pull Behavior
- always: Pull on every run
- if_not_present: Pull only if missing (default)
- never: Error if not present
### 10. Implementation Files
```
engine/tools/docker/
├── __init__.py
├── container.py # docker_run, docker_exec, docker_stop, docker_remove
├── images.py # docker_pull, docker_images
├── gc.py # DockerGarbageCollector
└── client.py # Docker SDK wrapper with retries
engine/migrations/sql/
└── highway_X.X.X_docker_containers.sql
engine/services/
└── docker_gc_worker.py # Standalone GC daemon
```
## Acceptance Criteria
- [ ] tools.docker.run executes containers with all Docker SDK parameters
- [ ] Containers automatically removed on completion/failure/timeout
- [ ] Worker crash does not leave orphaned containers (GC daemon handles)
- [ ] Activity timeout triggers container stop+remove
- [ ] Workflow cancellation stops all associated containers
- [ ] Container output captured and returned
- [ ] Image pull happens based on policy
- [ ] Resource limits enforced
- [ ] All operations logged to DataShard
- [ ] Circuit breaker protects against Docker daemon issues
- [ ] Integration tests with real Docker daemon
- [ ] Durable cron mode works for scheduled container jobs
## Dependencies
- docker Python SDK
- Docker daemon accessible from activity workers
- Migration for highway.docker_containers table
## Future Considerations
- docker-compose support for multi-container workflows
- Kubernetes support for cloud-native deployments
- GPU support via device_requests parameter
Key Features Specified
1. Core Operations: tools.docker.run, docker.exec, docker.stop, docker.remove, docker.logs, docker.inspect
2. Full Docker SDK Parameters: image, command, environment, volumes, ports, network, cpu/memory limits, timeout, pull policy,
capabilities, security options
3. Garbage Collection (Critical):
- Layer 1: Immediate cleanup via auto_remove=True + try/finally
- Layer 2: Worker shutdown handler
- Layer 3: Background GC daemon for orphan detection
4. Container Tracking Table: highway.docker_containers with full lifecycle tracking
5. Security Defaults: no-new-privileges, cap_drop=ALL, no privileged by default
6. Durable Cron Mode: Containers wake up on schedule, execute, shut down
Implementation Phases
- Phase 1 (MVP): Basic docker_run, tracking table, inline timeout
- Phase 2 (Production): Full params, GC daemon, activity worker integration, circuit breaker
- Phase 3 (Enterprise): exec/logs/inspect, network policies, compose support
Comments
Loading comments...
Context
Loading context...
Audit History
View AllLoading audit history...