>_
.issue.db
/highway-workflow-engine
Dashboard
Issues
Memory
Lessons
Audit Log
New Issue
Edit Issue #169
Update issue details
Title *
Description
## Summary Implement a full-featured Docker tool (`tools.docker.*`) for Highway Workflow Engine that enables workflows to spawn, manage, and clean up Docker containers as durable activities. ## Motivation Enterprise workflows require containerized task execution for: - Isolated execution environments (dependencies, security) - Reproducible builds and data processing - Running third-party tools without polluting host - Ephemeral compute for batch jobs - Durable cron jobs that spin up containers on schedule ## Requirements ### 1. Core Container Operations ```python self.register("tools.docker.run", docker_run) # Run container (main tool) self.register("tools.docker.exec", docker_exec) # Exec into running container self.register("tools.docker.stop", docker_stop) # Stop container self.register("tools.docker.remove", docker_remove) # Remove container self.register("tools.docker.logs", docker_logs) # Get container logs self.register("tools.docker.inspect", docker_inspect) # Inspect container ``` ### 2. docker_run - Full Parameter Support All Docker SDK parameters must be supported: - image (REQUIRED), command, entrypoint - environment, volumes, ports, network, network_mode - cpu_limit, memory_limit, user, working_dir, labels - auto_remove (default: True for GC), detach, timeout (default: 3600) - pull_policy: always, never, if_not_present - privileged (default: False), cap_add, cap_drop - devices, shm_size, tmpfs, read_only - hostname, dns, extra_hosts - init (default: True), oom_kill_disable, pids_limit, ulimits - log_config, healthcheck, restart_policy ### 3. CRITICAL: Garbage Collection (GC) Containers MUST be cleaned up in ALL scenarios: 1. Normal completion - Container exits normally → remove 2. Container failure - Container exits non-zero → remove + log error 3. Timeout - Execution exceeds timeout → SIGTERM → wait 10s → SIGKILL → remove 4. Workflow cancellation - Parent workflow cancelled → stop all containers → remove 5. Worker crash - Worker process dies → orphan detection → cleanup 6. Activity timeout - Activity heartbeat timeout → stop container → remove Container Tracking Table: ```sql CREATE TABLE highway.docker_containers ( container_id TEXT PRIMARY KEY, short_id TEXT NOT NULL, activity_id UUID NOT NULL, workflow_run_id UUID NOT NULL, tenant_id TEXT NOT NULL, image TEXT NOT NULL, status TEXT NOT NULL DEFAULT 'created', exit_code INTEGER, created_at TIMESTAMPTZ DEFAULT now(), started_at TIMESTAMPTZ, finished_at TIMESTAMPTZ, removed_at TIMESTAMPTZ, labels JSONB DEFAULT '{}', gc_protected_until TIMESTAMPTZ ); ``` GC Daemon: - DockerGarbageCollector runs as background service - _cleanup_orphaned_containers(): containers where activity no longer exists - _cleanup_stale_containers(): force-remove after max_age ### 4. Activity Worker Integration Docker tools MUST run on activity workers: - Worker shutdown handler stops owned containers - Worker crash triggers GC daemon cleanup - Container labels include highway_worker_id for tracking ### 5. Durable Cron Mode Support ephemeral containers for cron jobs: - Container wakes up, executes short task, shuts down - Works with tools.cron.durable_cron scheduler - Auto-remove after completion ### 6. Output Handling Return structure: - container_id, short_id, status, exit_code - stdout, stderr (truncated if too large) - duration_ms, image, logs_truncated ### 7. Security Considerations - No privileged by default - Resource limits recommended (warn if missing) - Network restrictions optional - Audit logging to DataShard - Tenant isolation via labels - Secrets injection via tools.secrets.get_secret ### 8. Circuit Breaker Protection Follow shell_command.py pattern for Docker daemon protection. ### 9. Image Pull Behavior - always: Pull on every run - if_not_present: Pull only if missing (default) - never: Error if not present ### 10. Implementation Files ``` engine/tools/docker/ ├── __init__.py ├── container.py # docker_run, docker_exec, docker_stop, docker_remove ├── images.py # docker_pull, docker_images ├── gc.py # DockerGarbageCollector └── client.py # Docker SDK wrapper with retries engine/migrations/sql/ └── highway_X.X.X_docker_containers.sql engine/services/ └── docker_gc_worker.py # Standalone GC daemon ``` ## Acceptance Criteria - [ ] tools.docker.run executes containers with all Docker SDK parameters - [ ] Containers automatically removed on completion/failure/timeout - [ ] Worker crash does not leave orphaned containers (GC daemon handles) - [ ] Activity timeout triggers container stop+remove - [ ] Workflow cancellation stops all associated containers - [ ] Container output captured and returned - [ ] Image pull happens based on policy - [ ] Resource limits enforced - [ ] All operations logged to DataShard - [ ] Circuit breaker protects against Docker daemon issues - [ ] Integration tests with real Docker daemon - [ ] Durable cron mode works for scheduled container jobs ## Dependencies - docker Python SDK - Docker daemon accessible from activity workers - Migration for highway.docker_containers table ## Future Considerations - docker-compose support for multi-container workflows - Kubernetes support for cloud-native deployments - GPU support via device_requests parameter Key Features Specified 1. Core Operations: tools.docker.run, docker.exec, docker.stop, docker.remove, docker.logs, docker.inspect 2. Full Docker SDK Parameters: image, command, environment, volumes, ports, network, cpu/memory limits, timeout, pull policy, capabilities, security options 3. Garbage Collection (Critical): - Layer 1: Immediate cleanup via auto_remove=True + try/finally - Layer 2: Worker shutdown handler - Layer 3: Background GC daemon for orphan detection 4. Container Tracking Table: highway.docker_containers with full lifecycle tracking 5. Security Defaults: no-new-privileges, cap_drop=ALL, no privileged by default 6. Durable Cron Mode: Containers wake up on schedule, execute, shut down Implementation Phases - Phase 1 (MVP): Basic docker_run, tracking table, inline timeout - Phase 2 (Production): Full params, GC daemon, activity worker integration, circuit breaker - Phase 3 (Enterprise): exec/logs/inspect, network policies, compose support
Priority
Low
Medium
High
Critical
Status
Open
In Progress
Closed
Due Date (YYYY-MM-DD)
Tags (comma separated)
Related Issues (IDs)
Enter IDs of issues related to this one. They will be linked as 'related'.
Update Issue
Cancel