ID Title Status Priority Created Due Date Actions
#728 Tiltfile + docker-compose for local dev with hot reload
## Task Create Tiltfile and docker-compose.dev.yml for local development with hot reload. ## Requir...
closed high 2025-12-28 17:50 -
#727 DockerHub private repo setup for highway images
## Task Setup private DockerHub repository for highway images. ## Details - DockerHub username: rod...
closed high 2025-12-28 17:49 -
#726 EPIC: Deployment Infrastructure Overhaul - Tilt + Docker Compose + Production Split
## Overview Restructure deployment infrastructure to separate production and development environment...
closed high 2025-12-28 17:49 -
#725 CRITICAL: Zombie transactions block task recovery when worker pods crash
## Summary When a worker pod crashes during task execution, the PostgreSQL connection may stay open ...
closed critical 2025-12-28 16:45 -
#724 ## Problem Two issues with worker signal handling prevent proper graceful shutdown: ### Issue 1: Delayed Shutdown Response - Signal handler sets `_shutdown_requested = True` - But `wakeup_event` is a local variable in `worker_command()` - Signal handler can't access it to wake up blocked `Event.wait()` - Worker waits up to `poll_interval` (1-5s) before noticing SIGTERM ### Issue 2: In-Flight Tasks Abandoned - After main loop breaks, `finally` block stops services immediately - No wait for tasks in `orchestrator.bulkhead` to complete - Tasks get killed mid-execution - They stay claimed until heartbeat timeout, then re-execute from scratch ## Impact - K8s rolling updates will kill tasks mid-execution - Wasted compute from task re-execution - Potential data corruption for non-idempotent tasks ## Solution 1. Make `wakeup_event` module-level so signal handler can access it 2. Add graceful drain logic like activity_worker has: - Wait up to 30s for in-flight tasks - Call `bulkhead.shutdown(wait=True)` ## Files - engine/cli/worker.py ## Testing - Send SIGTERM to worker with active tasks - Verify tasks complete before worker exits - Verify shutdown happens within seconds, not waiting for poll_interval closed high 2025-12-28 00:21 -
#723 ## Objective Deploy Highway Workflow Engine on Kubernetes for production use. ## Current State - Docker Compose: Working (local dev) - Kubernetes: Not supported ## Required Changes ### 1. Secrets Management **Current**: .env file with VAULT_TOKEN_ADMIN, Vault client reads tokens from env **K8s Options**: - Option A: Vault Agent sidecar (recommended) - K8s auth method, secrets injected as files - Option B: K8s Secrets + External Secrets Operator - map K8s secrets to config paths - Option C: Environment variable injection from K8s Secrets **Code Changes Needed**: - Support file-based secrets at /vault/secrets/* path - Support HIGHWAY_* env var overrides for config values - Graceful fallback chain: env vars → file secrets → Vault API ### 2. Storage **Current**: Local filesystem via bind mounts - /app/artifacts/ - workflow artifacts - /app/highway-test-logs/ - datashard logs - /app/highway-test-logs/uploads/ - file uploads **K8s Options**: - Option A: S3/MinIO (recommended for multi-replica) - re-enable S3StorageProvider with IAM auth - Option B: PersistentVolumeClaim with ReadWriteMany (requires NFS/EFS) - Option C: PVC per worker with node affinity (limits scaling) **Code Changes Needed**: - Re-enable S3 provider in s3_provider.py - Add IAM/IRSA authentication for S3 - Config: storage_type = auto (detect S3 creds, fall back to local) ### 3. Docker-in-Docker Sandboxing **Current**: Mounts /var/run/docker.sock for python_sandbox **K8s Options**: - Option A: Disable sandboxing (python_sandbox.mode = disabled) - acceptable for trusted tenants - Option B: DinD sidecar container per worker pod - Option C: Kaniko/Tekton for isolated execution - Option D: gVisor/Kata for pod-level isolation ### 4. Configuration Delivery **Current**: Bind-mounted config.ini from host **K8s**: ConfigMap mounted as /etc/highway/config.ini **Code Changes Needed**: - Support HIGHWAY_DATABASE_HOST style env var overrides - Environment variables take precedence over config.ini ### 5. Service Discovery **Current**: Docker DNS (postgres, api, ollama, dsl-compiler) **K8s**: K8s Service DNS - same pattern, just need Service manifests ### 6. Database **Current**: Docker PostgreSQL container **K8s Options**: - Managed DB: RDS, CloudSQL, Azure Database (recommended) - StatefulSet with PVC (self-managed) Required PostgreSQL extensions: uuid-ossp, pgcrypto Required databases: highway_db_v2, resilient_circuit_db ## Deliverables ### Helm Chart Structure ``` highway/ ├── Chart.yaml ├── values.yaml ├── templates/ │ ├── configmap.yaml # config.ini │ ├── secrets.yaml # JWT, DB password, encryption key │ ├── deployment-api.yaml │ ├── deployment-worker.yaml │ ├── deployment-activity-worker.yaml │ ├── deployment-internal-worker.yaml │ ├── deployment-dsl-compiler.yaml │ ├── deployment-ollama.yaml (optional) │ ├── service-api.yaml │ ├── service-dsl-compiler.yaml │ ├── ingress.yaml │ ├── pvc.yaml (if not using S3) │ └── hpa.yaml (horizontal pod autoscaler) ``` ### Services to Deploy | Service | Type | Replicas | Notes | |---------|------|----------|-------| | api | Deployment | 1+ | Ingress, port 7822 | | worker | Deployment | 2+ | HPA based on queue depth | | activity-worker | Deployment | 1+ | | | internal-worker | Deployment | 2 | Async logging | | dsl-compiler | Deployment | 1+ | Isolated, no DB access | | ollama | Deployment | 0-1 | Optional, GPU preferred | | postgres | External/StatefulSet | 1 | Prefer managed DB | ### Health Checks (from docker-compose) - API: GET /api/v1/health - Workers: Jumper heartbeat mechanism - DSL Compiler: GET /health ## Migration Path 1. Create Helm chart structure 2. Implement config env var overrides 3. Implement file-based secrets support 4. Re-enable S3 with IAM auth (or configure PVC) 5. Deploy to K8s cluster 6. Debug and iterate 7. Document deployment process ## Testing - Deploy all services - Run platform bootstrap workflow - Run demo workflows (v2, disaster, matrix) - Verify multi-replica worker scaling - Test pod restart recovery ## References - Issue #721: Local storage support (completed) - Issue #722: JoinMode consistency (completed) - docker-compose.yml: Current service definitions - docker/config.ini: Configuration reference closed high 2025-12-27 23:38 -
#722 Refactor wait_for_parallel_branches to use JoinMode for consistency
Currently wait_for_parallel_branches uses ad-hoc fail_on_error=True boolean while JoinOperator has p...
closed high 2025-12-27 21:09 -
#721 Workers become zombies after asyncio crash - container healthy but all threads dead
OBSERVED: Workers can crash with asyncio.exceptions.CancelledError during HTTP operations, causing a...
closed critical 2025-12-27 03:27 -
#720 Sandbox temp files fail in Docker-in-Docker: can't find __main__ module
FIXED: When running sandboxed code execution (tools.code.exec) inside Docker containers with Docker ...
closed high 2025-12-27 03:27 -
#719 Parallel branch retry creates orphaned waits due to absurd_run_id in join_event_name
FIXED: When a workflow with parallel branches is retried (e.g., due to worker crash), the join_event...
closed critical 2025-12-27 02:52 -
#718 Branch tasks now inherit retry policy from ParallelOperator
Fixed branch execution tasks to inherit max_attempts and retry_strategy from the ParallelOperator's ...
closed high 2025-12-26 23:58 -
#717 Branch tasks now retry on worker crash
Fixed branch execution tasks to have max_attempts=3 with exponential backoff (10s, 30s). Previously ...
closed high 2025-12-26 23:53 -
#716 Skip propagation bug in switch/condition operators
When switch/condition marks unselected branches as executed, downstream tasks depending on skipped b...
closed high 2025-12-26 16:17 -
#715 BUG: python_task.py swallows AbsurdSleepError in sandbox fallback path
In engine/tools/python_task.py lines 373-376, the except Exception block catches AbsurdSleepError wh...
closed critical 2025-12-26 00:06 -
#714 Add generic configurable update handler tool
Current state: The tools.testing.update_handler is hardcoded to handle 'approve_request' updates wit...
closed high 2025-12-25 21:46 -
#713 ReflexiveOperator completes workflow before activity finishes
ReflexiveOperator calls execute_activity_operator which is queue-only, but doesn't wait for the acti...
closed high 2025-12-25 21:02 -
#712 Agentic Bot Demo - Conference Presentation
## Overview Implement a full agentic conversational bot demo for conference presentation, showcasin...
closed high 2025-12-25 18:32 -
#711 CFG-01: No schema validation for configuration
Location: engine/config.py. Issue: ConfigParser used without schema validation. No type/range/requir...
closed high 2025-12-25 02:56 -
#710 VAL-05: JoinOperator join_tasks not validated
Location: highway_dsl/workflow_dsl.py. Issue: JoinOperator.join_tasks references not validated again...
closed high 2025-12-25 02:56 -
#709 TEST-01: Missing unit tests for core orchestrator components
Location: tests/unit/. Issue: No unit tests for orchestrator.py, durable_context.py, absurd_client.p...
closed high 2025-12-25 02:56 -
Previous Page 5 of 39 Next