| #740 |
Verify Vault has LLM API keys for K8s deployment
After adding LLM API keys to Helm templates (api-deployment.yaml, worker-deployment.yaml), need to v...
|
closed |
medium |
2025-12-29 09:36 |
- |
|
| #739 |
API auto-extracts app metadata from source code
Fixed bad API design where UI had to parse Python to extract class_name and actions. API now auto-ex...
|
closed |
medium |
2025-12-29 06:26 |
- |
|
| #738 |
Shell security check incorrectly rejects commands in K8s pods
When workers run inside K8s pods (not using Docker-in-Docker sandbox), shell commands with && or || ...
|
closed |
high |
2025-12-29 04:23 |
- |
|
| #737 |
Enable network access in sandbox containers after Sysbox deployment
After Sysbox+DinD is deployed on srv2 workers, update sandbox to allow network access. Current: netw...
|
closed |
medium |
2025-12-29 01:24 |
- |
|
| #736 |
Event and IPC tests failing in production
Event/workflow tests fail with 'failed' status. Tests: test_standard_sleep_wake_event, test_final_ip...
|
closed |
high |
2025-12-29 01:00 |
- |
|
| #735 |
Datashard logging tests failing in production
3 datashard logging tests fail because workflows complete with 'failed' status. Tests: test_simple_w...
|
closed |
medium |
2025-12-29 01:00 |
- |
|
| #734 |
Circuit breaker returns UNKNOWN state in production
Circuit breaker tests fail because state returns 'UNKNOWN' instead of 'OPEN'. Tests: test_circuit_br...
|
closed |
low |
2025-12-29 01:00 |
- |
|
| #733 |
Artifact workflow tests stuck in pending state
2 artifact workflow tests stuck in 'pending' state after 30s timeout. Tests: test_workflow_with_arti...
|
closed |
high |
2025-12-29 01:00 |
- |
|
| #732 |
Docker tools tests failing in production K8s
All 9 docker tools integration tests fail with 'failed' status on production K8s. Workers likely don...
|
closed |
medium |
2025-12-29 01:00 |
- |
|
| #731 |
Production ansible playbook updates for multi-cluster deployment
## Task
Update ansible playbooks for production deployment across both clusters.
## Current State
-...
|
closed |
high |
2025-12-28 17:50 |
- |
|
| #730 |
Domain setup: highway.solutions + tilt.highway.rodmena.app
## Domains to configure
### Production: highway.solutions
- Points to: Production API + Dashboard
-...
|
closed |
high |
2025-12-28 17:50 |
- |
|
| #729 |
Vault/secrets handling for docker-compose dev environment
## Problem
Production uses Vault sidecar injection in K8s:
- vault.hashicorp.com/agent-inject annota...
|
closed |
high |
2025-12-28 17:50 |
- |
|
| #728 |
Tiltfile + docker-compose for local dev with hot reload
## Task
Create Tiltfile and docker-compose.dev.yml for local development with hot reload.
## Requir...
|
closed |
high |
2025-12-28 17:50 |
- |
|
| #727 |
DockerHub private repo setup for highway images
## Task
Setup private DockerHub repository for highway images.
## Details
- DockerHub username: rod...
|
closed |
high |
2025-12-28 17:49 |
- |
|
| #726 |
EPIC: Deployment Infrastructure Overhaul - Tilt + Docker Compose + Production Split
## Overview
Restructure deployment infrastructure to separate production and development environment...
|
closed |
high |
2025-12-28 17:49 |
- |
|
| #725 |
CRITICAL: Zombie transactions block task recovery when worker pods crash
## Summary
When a worker pod crashes during task execution, the PostgreSQL connection may stay open ...
|
closed |
critical |
2025-12-28 16:45 |
- |
|
| #724 |
## Problem
Two issues with worker signal handling prevent proper graceful shutdown:
### Issue 1: Delayed Shutdown Response
- Signal handler sets `_shutdown_requested = True`
- But `wakeup_event` is a local variable in `worker_command()`
- Signal handler can't access it to wake up blocked `Event.wait()`
- Worker waits up to `poll_interval` (1-5s) before noticing SIGTERM
### Issue 2: In-Flight Tasks Abandoned
- After main loop breaks, `finally` block stops services immediately
- No wait for tasks in `orchestrator.bulkhead` to complete
- Tasks get killed mid-execution
- They stay claimed until heartbeat timeout, then re-execute from scratch
## Impact
- K8s rolling updates will kill tasks mid-execution
- Wasted compute from task re-execution
- Potential data corruption for non-idempotent tasks
## Solution
1. Make `wakeup_event` module-level so signal handler can access it
2. Add graceful drain logic like activity_worker has:
- Wait up to 30s for in-flight tasks
- Call `bulkhead.shutdown(wait=True)`
## Files
- engine/cli/worker.py
## Testing
- Send SIGTERM to worker with active tasks
- Verify tasks complete before worker exits
- Verify shutdown happens within seconds, not waiting for poll_interval
|
closed |
high |
2025-12-28 00:21 |
- |
|
| #723 |
## Objective
Deploy Highway Workflow Engine on Kubernetes for production use.
## Current State
- Docker Compose: Working (local dev)
- Kubernetes: Not supported
## Required Changes
### 1. Secrets Management
**Current**: .env file with VAULT_TOKEN_ADMIN, Vault client reads tokens from env
**K8s Options**:
- Option A: Vault Agent sidecar (recommended) - K8s auth method, secrets injected as files
- Option B: K8s Secrets + External Secrets Operator - map K8s secrets to config paths
- Option C: Environment variable injection from K8s Secrets
**Code Changes Needed**:
- Support file-based secrets at /vault/secrets/* path
- Support HIGHWAY_* env var overrides for config values
- Graceful fallback chain: env vars → file secrets → Vault API
### 2. Storage
**Current**: Local filesystem via bind mounts
- /app/artifacts/ - workflow artifacts
- /app/highway-test-logs/ - datashard logs
- /app/highway-test-logs/uploads/ - file uploads
**K8s Options**:
- Option A: S3/MinIO (recommended for multi-replica) - re-enable S3StorageProvider with IAM auth
- Option B: PersistentVolumeClaim with ReadWriteMany (requires NFS/EFS)
- Option C: PVC per worker with node affinity (limits scaling)
**Code Changes Needed**:
- Re-enable S3 provider in s3_provider.py
- Add IAM/IRSA authentication for S3
- Config: storage_type = auto (detect S3 creds, fall back to local)
### 3. Docker-in-Docker Sandboxing
**Current**: Mounts /var/run/docker.sock for python_sandbox
**K8s Options**:
- Option A: Disable sandboxing (python_sandbox.mode = disabled) - acceptable for trusted tenants
- Option B: DinD sidecar container per worker pod
- Option C: Kaniko/Tekton for isolated execution
- Option D: gVisor/Kata for pod-level isolation
### 4. Configuration Delivery
**Current**: Bind-mounted config.ini from host
**K8s**: ConfigMap mounted as /etc/highway/config.ini
**Code Changes Needed**:
- Support HIGHWAY_DATABASE_HOST style env var overrides
- Environment variables take precedence over config.ini
### 5. Service Discovery
**Current**: Docker DNS (postgres, api, ollama, dsl-compiler)
**K8s**: K8s Service DNS - same pattern, just need Service manifests
### 6. Database
**Current**: Docker PostgreSQL container
**K8s Options**:
- Managed DB: RDS, CloudSQL, Azure Database (recommended)
- StatefulSet with PVC (self-managed)
Required PostgreSQL extensions: uuid-ossp, pgcrypto
Required databases: highway_db_v2, resilient_circuit_db
## Deliverables
### Helm Chart Structure
```
highway/
├── Chart.yaml
├── values.yaml
├── templates/
│ ├── configmap.yaml # config.ini
│ ├── secrets.yaml # JWT, DB password, encryption key
│ ├── deployment-api.yaml
│ ├── deployment-worker.yaml
│ ├── deployment-activity-worker.yaml
│ ├── deployment-internal-worker.yaml
│ ├── deployment-dsl-compiler.yaml
│ ├── deployment-ollama.yaml (optional)
│ ├── service-api.yaml
│ ├── service-dsl-compiler.yaml
│ ├── ingress.yaml
│ ├── pvc.yaml (if not using S3)
│ └── hpa.yaml (horizontal pod autoscaler)
```
### Services to Deploy
| Service | Type | Replicas | Notes |
|---------|------|----------|-------|
| api | Deployment | 1+ | Ingress, port 7822 |
| worker | Deployment | 2+ | HPA based on queue depth |
| activity-worker | Deployment | 1+ | |
| internal-worker | Deployment | 2 | Async logging |
| dsl-compiler | Deployment | 1+ | Isolated, no DB access |
| ollama | Deployment | 0-1 | Optional, GPU preferred |
| postgres | External/StatefulSet | 1 | Prefer managed DB |
### Health Checks (from docker-compose)
- API: GET /api/v1/health
- Workers: Jumper heartbeat mechanism
- DSL Compiler: GET /health
## Migration Path
1. Create Helm chart structure
2. Implement config env var overrides
3. Implement file-based secrets support
4. Re-enable S3 with IAM auth (or configure PVC)
5. Deploy to K8s cluster
6. Debug and iterate
7. Document deployment process
## Testing
- Deploy all services
- Run platform bootstrap workflow
- Run demo workflows (v2, disaster, matrix)
- Verify multi-replica worker scaling
- Test pod restart recovery
## References
- Issue #721: Local storage support (completed)
- Issue #722: JoinMode consistency (completed)
- docker-compose.yml: Current service definitions
- docker/config.ini: Configuration reference
|
closed |
high |
2025-12-27 23:38 |
- |
|
| #722 |
Refactor wait_for_parallel_branches to use JoinMode for consistency
Currently wait_for_parallel_branches uses ad-hoc fail_on_error=True boolean while JoinOperator has p...
|
closed |
high |
2025-12-27 21:09 |
- |
|
| #721 |
Workers become zombies after asyncio crash - container healthy but all threads dead
OBSERVED: Workers can crash with asyncio.exceptions.CancelledError during HTTP operations, causing a...
|
closed |
critical |
2025-12-27 03:27 |
- |
|