ID Title Status Priority Created Due Date Actions
#740 Verify Vault has LLM API keys for K8s deployment
After adding LLM API keys to Helm templates (api-deployment.yaml, worker-deployment.yaml), need to v...
closed medium 2025-12-29 09:36 -
#739 API auto-extracts app metadata from source code
Fixed bad API design where UI had to parse Python to extract class_name and actions. API now auto-ex...
closed medium 2025-12-29 06:26 -
#738 Shell security check incorrectly rejects commands in K8s pods
When workers run inside K8s pods (not using Docker-in-Docker sandbox), shell commands with && or || ...
closed high 2025-12-29 04:23 -
#737 Enable network access in sandbox containers after Sysbox deployment
After Sysbox+DinD is deployed on srv2 workers, update sandbox to allow network access. Current: netw...
closed medium 2025-12-29 01:24 -
#736 Event and IPC tests failing in production
Event/workflow tests fail with 'failed' status. Tests: test_standard_sleep_wake_event, test_final_ip...
closed high 2025-12-29 01:00 -
#735 Datashard logging tests failing in production
3 datashard logging tests fail because workflows complete with 'failed' status. Tests: test_simple_w...
closed medium 2025-12-29 01:00 -
#734 Circuit breaker returns UNKNOWN state in production
Circuit breaker tests fail because state returns 'UNKNOWN' instead of 'OPEN'. Tests: test_circuit_br...
closed low 2025-12-29 01:00 -
#733 Artifact workflow tests stuck in pending state
2 artifact workflow tests stuck in 'pending' state after 30s timeout. Tests: test_workflow_with_arti...
closed high 2025-12-29 01:00 -
#732 Docker tools tests failing in production K8s
All 9 docker tools integration tests fail with 'failed' status on production K8s. Workers likely don...
closed medium 2025-12-29 01:00 -
#731 Production ansible playbook updates for multi-cluster deployment
## Task Update ansible playbooks for production deployment across both clusters. ## Current State -...
closed high 2025-12-28 17:50 -
#730 Domain setup: highway.solutions + tilt.highway.rodmena.app
## Domains to configure ### Production: highway.solutions - Points to: Production API + Dashboard -...
closed high 2025-12-28 17:50 -
#729 Vault/secrets handling for docker-compose dev environment
## Problem Production uses Vault sidecar injection in K8s: - vault.hashicorp.com/agent-inject annota...
closed high 2025-12-28 17:50 -
#728 Tiltfile + docker-compose for local dev with hot reload
## Task Create Tiltfile and docker-compose.dev.yml for local development with hot reload. ## Requir...
closed high 2025-12-28 17:50 -
#727 DockerHub private repo setup for highway images
## Task Setup private DockerHub repository for highway images. ## Details - DockerHub username: rod...
closed high 2025-12-28 17:49 -
#726 EPIC: Deployment Infrastructure Overhaul - Tilt + Docker Compose + Production Split
## Overview Restructure deployment infrastructure to separate production and development environment...
closed high 2025-12-28 17:49 -
#725 CRITICAL: Zombie transactions block task recovery when worker pods crash
## Summary When a worker pod crashes during task execution, the PostgreSQL connection may stay open ...
closed critical 2025-12-28 16:45 -
#724 ## Problem Two issues with worker signal handling prevent proper graceful shutdown: ### Issue 1: Delayed Shutdown Response - Signal handler sets `_shutdown_requested = True` - But `wakeup_event` is a local variable in `worker_command()` - Signal handler can't access it to wake up blocked `Event.wait()` - Worker waits up to `poll_interval` (1-5s) before noticing SIGTERM ### Issue 2: In-Flight Tasks Abandoned - After main loop breaks, `finally` block stops services immediately - No wait for tasks in `orchestrator.bulkhead` to complete - Tasks get killed mid-execution - They stay claimed until heartbeat timeout, then re-execute from scratch ## Impact - K8s rolling updates will kill tasks mid-execution - Wasted compute from task re-execution - Potential data corruption for non-idempotent tasks ## Solution 1. Make `wakeup_event` module-level so signal handler can access it 2. Add graceful drain logic like activity_worker has: - Wait up to 30s for in-flight tasks - Call `bulkhead.shutdown(wait=True)` ## Files - engine/cli/worker.py ## Testing - Send SIGTERM to worker with active tasks - Verify tasks complete before worker exits - Verify shutdown happens within seconds, not waiting for poll_interval closed high 2025-12-28 00:21 -
#723 ## Objective Deploy Highway Workflow Engine on Kubernetes for production use. ## Current State - Docker Compose: Working (local dev) - Kubernetes: Not supported ## Required Changes ### 1. Secrets Management **Current**: .env file with VAULT_TOKEN_ADMIN, Vault client reads tokens from env **K8s Options**: - Option A: Vault Agent sidecar (recommended) - K8s auth method, secrets injected as files - Option B: K8s Secrets + External Secrets Operator - map K8s secrets to config paths - Option C: Environment variable injection from K8s Secrets **Code Changes Needed**: - Support file-based secrets at /vault/secrets/* path - Support HIGHWAY_* env var overrides for config values - Graceful fallback chain: env vars → file secrets → Vault API ### 2. Storage **Current**: Local filesystem via bind mounts - /app/artifacts/ - workflow artifacts - /app/highway-test-logs/ - datashard logs - /app/highway-test-logs/uploads/ - file uploads **K8s Options**: - Option A: S3/MinIO (recommended for multi-replica) - re-enable S3StorageProvider with IAM auth - Option B: PersistentVolumeClaim with ReadWriteMany (requires NFS/EFS) - Option C: PVC per worker with node affinity (limits scaling) **Code Changes Needed**: - Re-enable S3 provider in s3_provider.py - Add IAM/IRSA authentication for S3 - Config: storage_type = auto (detect S3 creds, fall back to local) ### 3. Docker-in-Docker Sandboxing **Current**: Mounts /var/run/docker.sock for python_sandbox **K8s Options**: - Option A: Disable sandboxing (python_sandbox.mode = disabled) - acceptable for trusted tenants - Option B: DinD sidecar container per worker pod - Option C: Kaniko/Tekton for isolated execution - Option D: gVisor/Kata for pod-level isolation ### 4. Configuration Delivery **Current**: Bind-mounted config.ini from host **K8s**: ConfigMap mounted as /etc/highway/config.ini **Code Changes Needed**: - Support HIGHWAY_DATABASE_HOST style env var overrides - Environment variables take precedence over config.ini ### 5. Service Discovery **Current**: Docker DNS (postgres, api, ollama, dsl-compiler) **K8s**: K8s Service DNS - same pattern, just need Service manifests ### 6. Database **Current**: Docker PostgreSQL container **K8s Options**: - Managed DB: RDS, CloudSQL, Azure Database (recommended) - StatefulSet with PVC (self-managed) Required PostgreSQL extensions: uuid-ossp, pgcrypto Required databases: highway_db_v2, resilient_circuit_db ## Deliverables ### Helm Chart Structure ``` highway/ ├── Chart.yaml ├── values.yaml ├── templates/ │ ├── configmap.yaml # config.ini │ ├── secrets.yaml # JWT, DB password, encryption key │ ├── deployment-api.yaml │ ├── deployment-worker.yaml │ ├── deployment-activity-worker.yaml │ ├── deployment-internal-worker.yaml │ ├── deployment-dsl-compiler.yaml │ ├── deployment-ollama.yaml (optional) │ ├── service-api.yaml │ ├── service-dsl-compiler.yaml │ ├── ingress.yaml │ ├── pvc.yaml (if not using S3) │ └── hpa.yaml (horizontal pod autoscaler) ``` ### Services to Deploy | Service | Type | Replicas | Notes | |---------|------|----------|-------| | api | Deployment | 1+ | Ingress, port 7822 | | worker | Deployment | 2+ | HPA based on queue depth | | activity-worker | Deployment | 1+ | | | internal-worker | Deployment | 2 | Async logging | | dsl-compiler | Deployment | 1+ | Isolated, no DB access | | ollama | Deployment | 0-1 | Optional, GPU preferred | | postgres | External/StatefulSet | 1 | Prefer managed DB | ### Health Checks (from docker-compose) - API: GET /api/v1/health - Workers: Jumper heartbeat mechanism - DSL Compiler: GET /health ## Migration Path 1. Create Helm chart structure 2. Implement config env var overrides 3. Implement file-based secrets support 4. Re-enable S3 with IAM auth (or configure PVC) 5. Deploy to K8s cluster 6. Debug and iterate 7. Document deployment process ## Testing - Deploy all services - Run platform bootstrap workflow - Run demo workflows (v2, disaster, matrix) - Verify multi-replica worker scaling - Test pod restart recovery ## References - Issue #721: Local storage support (completed) - Issue #722: JoinMode consistency (completed) - docker-compose.yml: Current service definitions - docker/config.ini: Configuration reference closed high 2025-12-27 23:38 -
#722 Refactor wait_for_parallel_branches to use JoinMode for consistency
Currently wait_for_parallel_branches uses ad-hoc fail_on_error=True boolean while JoinOperator has p...
closed high 2025-12-27 21:09 -
#721 Workers become zombies after asyncio crash - container healthy but all threads dead
OBSERVED: Workers can crash with asyncio.exceptions.CancelledError during HTTP operations, causing a...
closed critical 2025-12-27 03:27 -
Previous Page 5 of 40 Next