Issues [highway-workflow-engine]

ID	Title	Status	Priority	Created	Due Date	Actions
#740	Verify Vault has LLM API keys for K8s deployment After adding LLM API keys to Helm templates (api-deployment.yaml, worker-deployment.yaml), need to v...	closed	medium	2025-12-29 09:36	-	Edit
#739	API auto-extracts app metadata from source code Fixed bad API design where UI had to parse Python to extract class_name and actions. API now auto-ex...	closed	medium	2025-12-29 06:26	-	Edit
#738	Shell security check incorrectly rejects commands in K8s pods When workers run inside K8s pods (not using Docker-in-Docker sandbox), shell commands with && or \|\| ...	closed	high	2025-12-29 04:23	-	Edit
#737	Enable network access in sandbox containers after Sysbox deployment network sandbox sysbox After Sysbox+DinD is deployed on srv2 workers, update sandbox to allow network access. Current: netw...	closed	medium	2025-12-29 01:24	-	Edit
#736	Event and IPC tests failing in production events integration-tests ipc Event/workflow tests fail with 'failed' status. Tests: test_standard_sleep_wake_event, test_final_ip...	closed	high	2025-12-29 01:00	-	Edit
#735	Datashard logging tests failing in production datashard integration-tests logging 3 datashard logging tests fail because workflows complete with 'failed' status. Tests: test_simple_w...	closed	medium	2025-12-29 01:00	-	Edit
#734	Circuit breaker returns UNKNOWN state in production circuit-breaker config integration-tests Circuit breaker tests fail because state returns 'UNKNOWN' instead of 'OPEN'. Tests: test_circuit_br...	closed	low	2025-12-29 01:00	-	Edit
#733	Artifact workflow tests stuck in pending state artifacts integration-tests workers 2 artifact workflow tests stuck in 'pending' state after 30s timeout. Tests: test_workflow_with_arti...	closed	high	2025-12-29 01:00	-	Edit
#732	Docker tools tests failing in production K8s docker integration-tests k8s All 9 docker tools integration tests fail with 'failed' status on production K8s. Workers likely don...	closed	medium	2025-12-29 01:00	-	Edit
#731	Production ansible playbook updates for multi-cluster deployment ## Task Update ansible playbooks for production deployment across both clusters. ## Current State -...	closed	high	2025-12-28 17:50	-	Edit
#730	Domain setup: highway.solutions + tilt.highway.rodmena.app ## Domains to configure ### Production: highway.solutions - Points to: Production API + Dashboard -...	closed	high	2025-12-28 17:50	-	Edit
#729	Vault/secrets handling for docker-compose dev environment ## Problem Production uses Vault sidecar injection in K8s: - vault.hashicorp.com/agent-inject annota...	closed	high	2025-12-28 17:50	-	Edit
#728	Tiltfile + docker-compose for local dev with hot reload ## Task Create Tiltfile and docker-compose.dev.yml for local development with hot reload. ## Requir...	closed	high	2025-12-28 17:50	-	Edit
#727	DockerHub private repo setup for highway images ## Task Setup private DockerHub repository for highway images. ## Details - DockerHub username: rod...	closed	high	2025-12-28 17:49	-	Edit
#726	EPIC: Deployment Infrastructure Overhaul - Tilt + Docker Compose + Production Split ## Overview Restructure deployment infrastructure to separate production and development environment...	closed	high	2025-12-28 17:49	-	Edit
#725	CRITICAL: Zombie transactions block task recovery when worker pods crash ## Summary When a worker pod crashes during task execution, the PostgreSQL connection may stay open ...	closed	critical	2025-12-28 16:45	-	Edit
#724	## Problem Two issues with worker signal handling prevent proper graceful shutdown: ### Issue 1: Delayed Shutdown Response - Signal handler sets `_shutdown_requested = True` - But `wakeup_event` is a local variable in `worker_command()` - Signal handler can't access it to wake up blocked `Event.wait()` - Worker waits up to `poll_interval` (1-5s) before noticing SIGTERM ### Issue 2: In-Flight Tasks Abandoned - After main loop breaks, `finally` block stops services immediately - No wait for tasks in `orchestrator.bulkhead` to complete - Tasks get killed mid-execution - They stay claimed until heartbeat timeout, then re-execute from scratch ## Impact - K8s rolling updates will kill tasks mid-execution - Wasted compute from task re-execution - Potential data corruption for non-idempotent tasks ## Solution 1. Make `wakeup_event` module-level so signal handler can access it 2. Add graceful drain logic like activity_worker has: - Wait up to 30s for in-flight tasks - Call `bulkhead.shutdown(wait=True)` ## Files - engine/cli/worker.py ## Testing - Send SIGTERM to worker with active tasks - Verify tasks complete before worker exits - Verify shutdown happens within seconds, not waiting for poll_interval	closed	high	2025-12-28 00:21	-	Edit
#723	## Objective Deploy Highway Workflow Engine on Kubernetes for production use. ## Current State - Docker Compose: Working (local dev) - Kubernetes: Not supported ## Required Changes ### 1. Secrets Management Current: .env file with VAULT_TOKEN_ADMIN, Vault client reads tokens from env K8s Options: - Option A: Vault Agent sidecar (recommended) - K8s auth method, secrets injected as files - Option B: K8s Secrets + External Secrets Operator - map K8s secrets to config paths - Option C: Environment variable injection from K8s Secrets Code Changes Needed: - Support file-based secrets at /vault/secrets/* path - Support HIGHWAY_* env var overrides for config values - Graceful fallback chain: env vars → file secrets → Vault API ### 2. Storage Current: Local filesystem via bind mounts - /app/artifacts/ - workflow artifacts - /app/highway-test-logs/ - datashard logs - /app/highway-test-logs/uploads/ - file uploads K8s Options: - Option A: S3/MinIO (recommended for multi-replica) - re-enable S3StorageProvider with IAM auth - Option B: PersistentVolumeClaim with ReadWriteMany (requires NFS/EFS) - Option C: PVC per worker with node affinity (limits scaling) Code Changes Needed: - Re-enable S3 provider in s3_provider.py - Add IAM/IRSA authentication for S3 - Config: storage_type = auto (detect S3 creds, fall back to local) ### 3. Docker-in-Docker Sandboxing Current: Mounts /var/run/docker.sock for python_sandbox K8s Options: - Option A: Disable sandboxing (python_sandbox.mode = disabled) - acceptable for trusted tenants - Option B: DinD sidecar container per worker pod - Option C: Kaniko/Tekton for isolated execution - Option D: gVisor/Kata for pod-level isolation ### 4. Configuration Delivery Current: Bind-mounted config.ini from host K8s: ConfigMap mounted as /etc/highway/config.ini Code Changes Needed: - Support HIGHWAY_DATABASE_HOST style env var overrides - Environment variables take precedence over config.ini ### 5. Service Discovery Current: Docker DNS (postgres, api, ollama, dsl-compiler) K8s: K8s Service DNS - same pattern, just need Service manifests ### 6. Database Current: Docker PostgreSQL container K8s Options: - Managed DB: RDS, CloudSQL, Azure Database (recommended) - StatefulSet with PVC (self-managed) Required PostgreSQL extensions: uuid-ossp, pgcrypto Required databases: highway_db_v2, resilient_circuit_db ## Deliverables ### Helm Chart Structure ``` highway/ ├── Chart.yaml ├── values.yaml ├── templates/ │ ├── configmap.yaml # config.ini │ ├── secrets.yaml # JWT, DB password, encryption key │ ├── deployment-api.yaml │ ├── deployment-worker.yaml │ ├── deployment-activity-worker.yaml │ ├── deployment-internal-worker.yaml │ ├── deployment-dsl-compiler.yaml │ ├── deployment-ollama.yaml (optional) │ ├── service-api.yaml │ ├── service-dsl-compiler.yaml │ ├── ingress.yaml │ ├── pvc.yaml (if not using S3) │ └── hpa.yaml (horizontal pod autoscaler) ``` ### Services to Deploy \| Service \| Type \| Replicas \| Notes \| \|---------\|------\|----------\|-------\| \| api \| Deployment \| 1+ \| Ingress, port 7822 \| \| worker \| Deployment \| 2+ \| HPA based on queue depth \| \| activity-worker \| Deployment \| 1+ \| \| \| internal-worker \| Deployment \| 2 \| Async logging \| \| dsl-compiler \| Deployment \| 1+ \| Isolated, no DB access \| \| ollama \| Deployment \| 0-1 \| Optional, GPU preferred \| \| postgres \| External/StatefulSet \| 1 \| Prefer managed DB \| ### Health Checks (from docker-compose) - API: GET /api/v1/health - Workers: Jumper heartbeat mechanism - DSL Compiler: GET /health ## Migration Path 1. Create Helm chart structure 2. Implement config env var overrides 3. Implement file-based secrets support 4. Re-enable S3 with IAM auth (or configure PVC) 5. Deploy to K8s cluster 6. Debug and iterate 7. Document deployment process ## Testing - Deploy all services - Run platform bootstrap workflow - Run demo workflows (v2, disaster, matrix) - Verify multi-replica worker scaling - Test pod restart recovery ## References - Issue #721: Local storage support (completed) - Issue #722: JoinMode consistency (completed) - docker-compose.yml: Current service definitions - docker/config.ini: Configuration reference	closed	high	2025-12-27 23:38	-	Edit
#722	Refactor wait_for_parallel_branches to use JoinMode for consistency Currently wait_for_parallel_branches uses ad-hoc fail_on_error=True boolean while JoinOperator has p...	closed	high	2025-12-27 21:09	-	Edit
#721	Workers become zombies after asyncio crash - container healthy but all threads dead OBSERVED: Workers can crash with asyncio.exceptions.CancelledError during HTTP operations, causing a...	closed	critical	2025-12-27 03:27	-	Edit

Previous Page 5 of 40 Next