| #727 |
DockerHub private repo setup for highway images
## Task
Setup private DockerHub repository for highway images.
## Details
- DockerHub username: rod...
|
closed |
high |
2025-12-28 17:49 |
- |
|
| #726 |
EPIC: Deployment Infrastructure Overhaul - Tilt + Docker Compose + Production Split
## Overview
Restructure deployment infrastructure to separate production and development environment...
|
closed |
high |
2025-12-28 17:49 |
- |
|
| #724 |
## Problem
Two issues with worker signal handling prevent proper graceful shutdown:
### Issue 1: Delayed Shutdown Response
- Signal handler sets `_shutdown_requested = True`
- But `wakeup_event` is a local variable in `worker_command()`
- Signal handler can't access it to wake up blocked `Event.wait()`
- Worker waits up to `poll_interval` (1-5s) before noticing SIGTERM
### Issue 2: In-Flight Tasks Abandoned
- After main loop breaks, `finally` block stops services immediately
- No wait for tasks in `orchestrator.bulkhead` to complete
- Tasks get killed mid-execution
- They stay claimed until heartbeat timeout, then re-execute from scratch
## Impact
- K8s rolling updates will kill tasks mid-execution
- Wasted compute from task re-execution
- Potential data corruption for non-idempotent tasks
## Solution
1. Make `wakeup_event` module-level so signal handler can access it
2. Add graceful drain logic like activity_worker has:
- Wait up to 30s for in-flight tasks
- Call `bulkhead.shutdown(wait=True)`
## Files
- engine/cli/worker.py
## Testing
- Send SIGTERM to worker with active tasks
- Verify tasks complete before worker exits
- Verify shutdown happens within seconds, not waiting for poll_interval
|
closed |
high |
2025-12-28 00:21 |
- |
|
| #723 |
## Objective
Deploy Highway Workflow Engine on Kubernetes for production use.
## Current State
- Docker Compose: Working (local dev)
- Kubernetes: Not supported
## Required Changes
### 1. Secrets Management
**Current**: .env file with VAULT_TOKEN_ADMIN, Vault client reads tokens from env
**K8s Options**:
- Option A: Vault Agent sidecar (recommended) - K8s auth method, secrets injected as files
- Option B: K8s Secrets + External Secrets Operator - map K8s secrets to config paths
- Option C: Environment variable injection from K8s Secrets
**Code Changes Needed**:
- Support file-based secrets at /vault/secrets/* path
- Support HIGHWAY_* env var overrides for config values
- Graceful fallback chain: env vars → file secrets → Vault API
### 2. Storage
**Current**: Local filesystem via bind mounts
- /app/artifacts/ - workflow artifacts
- /app/highway-test-logs/ - datashard logs
- /app/highway-test-logs/uploads/ - file uploads
**K8s Options**:
- Option A: S3/MinIO (recommended for multi-replica) - re-enable S3StorageProvider with IAM auth
- Option B: PersistentVolumeClaim with ReadWriteMany (requires NFS/EFS)
- Option C: PVC per worker with node affinity (limits scaling)
**Code Changes Needed**:
- Re-enable S3 provider in s3_provider.py
- Add IAM/IRSA authentication for S3
- Config: storage_type = auto (detect S3 creds, fall back to local)
### 3. Docker-in-Docker Sandboxing
**Current**: Mounts /var/run/docker.sock for python_sandbox
**K8s Options**:
- Option A: Disable sandboxing (python_sandbox.mode = disabled) - acceptable for trusted tenants
- Option B: DinD sidecar container per worker pod
- Option C: Kaniko/Tekton for isolated execution
- Option D: gVisor/Kata for pod-level isolation
### 4. Configuration Delivery
**Current**: Bind-mounted config.ini from host
**K8s**: ConfigMap mounted as /etc/highway/config.ini
**Code Changes Needed**:
- Support HIGHWAY_DATABASE_HOST style env var overrides
- Environment variables take precedence over config.ini
### 5. Service Discovery
**Current**: Docker DNS (postgres, api, ollama, dsl-compiler)
**K8s**: K8s Service DNS - same pattern, just need Service manifests
### 6. Database
**Current**: Docker PostgreSQL container
**K8s Options**:
- Managed DB: RDS, CloudSQL, Azure Database (recommended)
- StatefulSet with PVC (self-managed)
Required PostgreSQL extensions: uuid-ossp, pgcrypto
Required databases: highway_db_v2, resilient_circuit_db
## Deliverables
### Helm Chart Structure
```
highway/
├── Chart.yaml
├── values.yaml
├── templates/
│ ├── configmap.yaml # config.ini
│ ├── secrets.yaml # JWT, DB password, encryption key
│ ├── deployment-api.yaml
│ ├── deployment-worker.yaml
│ ├── deployment-activity-worker.yaml
│ ├── deployment-internal-worker.yaml
│ ├── deployment-dsl-compiler.yaml
│ ├── deployment-ollama.yaml (optional)
│ ├── service-api.yaml
│ ├── service-dsl-compiler.yaml
│ ├── ingress.yaml
│ ├── pvc.yaml (if not using S3)
│ └── hpa.yaml (horizontal pod autoscaler)
```
### Services to Deploy
| Service | Type | Replicas | Notes |
|---------|------|----------|-------|
| api | Deployment | 1+ | Ingress, port 7822 |
| worker | Deployment | 2+ | HPA based on queue depth |
| activity-worker | Deployment | 1+ | |
| internal-worker | Deployment | 2 | Async logging |
| dsl-compiler | Deployment | 1+ | Isolated, no DB access |
| ollama | Deployment | 0-1 | Optional, GPU preferred |
| postgres | External/StatefulSet | 1 | Prefer managed DB |
### Health Checks (from docker-compose)
- API: GET /api/v1/health
- Workers: Jumper heartbeat mechanism
- DSL Compiler: GET /health
## Migration Path
1. Create Helm chart structure
2. Implement config env var overrides
3. Implement file-based secrets support
4. Re-enable S3 with IAM auth (or configure PVC)
5. Deploy to K8s cluster
6. Debug and iterate
7. Document deployment process
## Testing
- Deploy all services
- Run platform bootstrap workflow
- Run demo workflows (v2, disaster, matrix)
- Verify multi-replica worker scaling
- Test pod restart recovery
## References
- Issue #721: Local storage support (completed)
- Issue #722: JoinMode consistency (completed)
- docker-compose.yml: Current service definitions
- docker/config.ini: Configuration reference
|
closed |
high |
2025-12-27 23:38 |
- |
|
| #722 |
Refactor wait_for_parallel_branches to use JoinMode for consistency
Currently wait_for_parallel_branches uses ad-hoc fail_on_error=True boolean while JoinOperator has p...
|
closed |
high |
2025-12-27 21:09 |
- |
|
| #720 |
Sandbox temp files fail in Docker-in-Docker: can't find __main__ module
FIXED: When running sandboxed code execution (tools.code.exec) inside Docker containers with Docker ...
|
closed |
high |
2025-12-27 03:27 |
- |
|
| #718 |
Branch tasks now inherit retry policy from ParallelOperator
Fixed branch execution tasks to inherit max_attempts and retry_strategy from the ParallelOperator's ...
|
closed |
high |
2025-12-26 23:58 |
- |
|
| #717 |
Branch tasks now retry on worker crash
Fixed branch execution tasks to have max_attempts=3 with exponential backoff (10s, 30s). Previously ...
|
closed |
high |
2025-12-26 23:53 |
- |
|
| #716 |
Skip propagation bug in switch/condition operators
When switch/condition marks unselected branches as executed, downstream tasks depending on skipped b...
|
closed |
high |
2025-12-26 16:17 |
- |
|
| #714 |
Add generic configurable update handler tool
Current state: The tools.testing.update_handler is hardcoded to handle 'approve_request' updates wit...
|
closed |
high |
2025-12-25 21:46 |
- |
|
| #713 |
ReflexiveOperator completes workflow before activity finishes
ReflexiveOperator calls execute_activity_operator which is queue-only, but doesn't wait for the acti...
|
closed |
high |
2025-12-25 21:02 |
- |
|
| #712 |
Agentic Bot Demo - Conference Presentation
## Overview
Implement a full agentic conversational bot demo for conference presentation, showcasin...
|
closed |
high |
2025-12-25 18:32 |
- |
|
| #711 |
CFG-01: No schema validation for configuration
Location: engine/config.py. Issue: ConfigParser used without schema validation. No type/range/requir...
|
closed |
high |
2025-12-25 02:56 |
- |
|
| #710 |
VAL-05: JoinOperator join_tasks not validated
Location: highway_dsl/workflow_dsl.py. Issue: JoinOperator.join_tasks references not validated again...
|
closed |
high |
2025-12-25 02:56 |
- |
|
| #709 |
TEST-01: Missing unit tests for core orchestrator components
Location: tests/unit/. Issue: No unit tests for orchestrator.py, durable_context.py, absurd_client.p...
|
closed |
high |
2025-12-25 02:56 |
- |
|
| #708 |
RETRY-01: No jitter in Absurd retry delays
Location: a_absurd.sql:639-654. Issue: Retry delays calculated without jitter. Causes thundering her...
|
closed |
high |
2025-12-25 02:56 |
- |
|
| #707 |
ERR-02: Non-deterministic time.sleep() inside transaction
Location: operators.py:247-262. Issue: Inline retry uses time.sleep() holding DB connection open. Vi...
|
closed |
high |
2025-12-25 02:56 |
- |
|
| #706 |
ERR-01: _fail_run exceptions logged but not propagated
Location: orchestrator.py:1198-1199. Issue: Fatal failure to log a run failure only logged, not esca...
|
closed |
high |
2025-12-25 02:56 |
- |
|
| #705 |
OBS-03: No trace export to Jaeger/Zipkin
Location: engine/utils/tracing.py. Issue: W3C Trace Context implemented but traces only logged to st...
|
closed |
high |
2025-12-25 02:56 |
- |
|
| #704 |
OBS-02: No Prometheus metrics endpoint
Location: Not implemented. Issue: Cannot build Grafana dashboards or set up Prometheus alerting. Met...
|
closed |
high |
2025-12-25 02:56 |
- |
|