helix-engage/docs/weekly-status-apr06-11.md

# Helix Engage — Weekly Status Update

**Period:** April 6 – April 11, 2026
**Team:** Engineering

---

## Executive Summary

Major infrastructure milestone — Helix Engage is now running on AWS EC2 with multi-tenant architecture supporting both Ramaiah Hospitals and Global Hospital on a single instance. A full CI/CD pipeline with automated E2E testing and Teams notifications is operational. 17 defects from QA were triaged, 8 fixed and deployed, and a cross-tenant security vulnerability in the telephony layer was discovered and patched.

---

## 1. AWS EC2 Deployment (Multi-Tenant)

**Status: Live**

Migrated from single-tenant VPS to multi-tenant EC2 architecture:

- **Instance:** m6i.xlarge, Mumbai (ap-south-1), 15GB RAM
- **14 Docker containers** running: platform, 2 sidecars, telephony dispatcher, 4 Redis instances, Caddy, PostgreSQL, ClickHouse, Redpanda, MinIO
- **Strict tenant isolation:** each hospital has its own sidecar container, Redis instance, and data volume
- **Host-routed Caddy:** cross-tenant webhook routing is physically impossible

**URLs deployed:**
- ramaiah.engage.healix360.net (Ramaiah Hospitals)
- global.engage.healix360.net (Global Hospital)
- ramaiah.app.healix360.net / global.app.healix360.net (Platform)
- telephony.engage.healix360.net (Event dispatcher)
- operations.healix360.net (CI/CD dashboard)
- git.healix360.net (Git forge)

---

## 2. Telephony Event Dispatcher

**Status: Live**

Built a NestJS service that routes Ozonetel agent/call events to the correct hospital's sidecar:

- Ozonetel event subscriptions are **account-level** (not per-campaign) — one URL for all agents
- Dispatcher receives all events, looks up `agentId` in Redis, forwards to the correct sidecar
- Sidecars self-register on boot with their agent list; heartbeat every 30s, TTL 90s
- No manual configuration needed when adding new hospitals

---

## 3. Cross-Tenant Security Fix (defaultAgentId)

**Status: Fixed and deployed**

Discovered that 6 sidecar endpoints used a hardcoded `OZONETEL_AGENT_ID` env var as a fallback when `agentId` wasn't provided by the frontend. In a multi-tenant setup, this caused Ramaiah sidecar operations to silently affect Global Hospital's agent.

**Impact:** Agent state changes, call disposition, outbound dialing, performance metrics, and maintenance commands could operate on the wrong hospital's agent with no error or warning.

**Fix:**
- Removed `defaultAgentId` getter and all hardcoded fallbacks (`agent3`, `Test123$`, `521814`)
- All 6 endpoints now require `agentId` from the caller (400 if missing)
- Frontend updated to send `agentId` from `localStorage.helix_agent_config` in all calls
- `OZONETEL_AGENT_ID` removed from env config entirely

---

## 4. Defect Fixes (8 of 17)

| Bug | Title | Status |
|-----|-------|--------|
| #527 | Appointment creation updates existing patient incorrectly | Fixed |
| #529 | Break/Training status doesn't block outbound calls | Fixed |
| #531 | Agent can log out during active call | Fixed |
| #533 | Redundant "Call History" header | Fixed |
| #534 | Redundant "Patients" header | Fixed |
| #536 | My Performance shows wrong agent's data | Fixed |
| #538 | Supervisor dashboard metrics incorrect | Fixed |
| #540 | Ghost calls visible for logged-out agents | Fixed |
| #547 | SLA rules not reflected in Call Desk | Fixed (config seeded) |

**Deferred (by product):** #516 (recordings real-time), #517/#548 (AI transcription), #519 (supervisor call — needs SIP seat), #539 (missed calls real-time), #541 (whisper/barge/listen)

---

## 5. E2E Test Suite (Playwright)

**Status: 40 tests, all passing**

Automated smoke tests covering every page for both hospitals:

- **Login (4):** branding, invalid creds, supervisor login, auth guard
- **Ramaiah CC Agent (10):** call desk, call history, patients, appointments, my performance, sidebar, sign-out
- **Ramaiah Supervisor (12):** dashboard, team performance, live monitor, leads, patients, appointments, call log, recordings, missed calls, campaigns, settings, sidebar
- **Global CC Agent (7):** all pages + sign-out
- **Global Supervisor (5):** all pages

Self-healing: auto-clears agent session locks before login, completes sign-out after tests.

---

## 6. CI/CD Pipeline (Woodpecker + Gitea)

**Status: Operational**

End-to-end CI/CD on EC2:

- **Gitea** mirrors Azure DevOps repos every 15 minutes
- **Woodpecker CI** triggers pipelines on push or manual run
- **Frontend pipeline:** TypeScript typecheck → 40 E2E tests → HTML report published to MinIO → Teams notification
- **Sidecar pipeline:** Jest unit tests → Teams notification
- **Reports:** Playwright HTML reports with screenshots at `operations.healix360.net/reports/{run}/index.html`
- **Teams notifications:** Adaptive Cards to "Deployment updates" channel with pass/fail summary + report link

---

## 7. Documentation

Three docs committed to the repo:

- **architecture.md** — Multi-tenant topology with Mermaid diagram, telephony dispatcher, failure modes
- **developer-operations-runbook.md** — SSH access, accounts, deploy steps, Redis ops, DB access, troubleshooting
- **ci-cd-operations.md** — Gitea, Woodpecker, MinIO, Teams notification setup and troubleshooting

---

## 8. Data Seeding

- **Ramaiah:** 195 real doctors scraped from msrmh.com, clinics, visit slots, campaign data
- **Global:** CC agent accounts (rekha.cc, ganesh.cc), marketing (sanjay), supervisor (dr.ramesh) created with proper roles
- **Rules engine:** 6 priority scoring rules seeded (missed call, follow-up, campaign lead, 2nd/3rd attempt, spam deprioritize)
- **Seed script:** idempotent `mkMember`, cleanup phase before seeding, runs against any workspace via env vars

---

## 9. Other Improvements

- **SIP agent tracing:** Browser console logs `agent=ramaiahadmin ext=524435` on every SIP connect/disconnect/state change for multi-agent debugging
- **ACW 3-layer protection:** beforeunload warning → sendBeacon auto-dispose → server 30s timer
- **Maint endpoints:** `force-ready` and `unlock-agent` now accept `agentId` from body (was hardcoded)
- **Security group automation:** SSH IP auto-updated via AWS CLI when ISP changes

---

## Metrics

| Metric | Value |
|--------|-------|
| Commits (frontend) | 35 |
| Commits (sidecar) | 20 |
| Commits (SDK app) | 2 |
| Bugs fixed | 9 |
| E2E tests | 40 |
| Docker containers | 17 (14 app + 3 CI) |
| DNS records | 6 |
| Uptime | EC2 live since Apr 9 |

---

## Next Week Priorities

1. Merge `feature/omnichannel-widget` → `master` (frontend)
2. Frontend Docker image (stop rsync, bake into image)
3. Appointment date validation (no past dates, auto-tomorrow after hours)
4. Pre-built CI Docker image (skip `yarn install` on every run)
5. Deferred defects: #516, #539 (real-time updates)