Files
helix-engage/docs/weekly-status-apr06-11.md
2026-04-15 06:49:41 +05:30

163 lines
6.6 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Helix Engage — Weekly Status Update
**Period:** April 6 April 11, 2026
**Team:** Engineering
---
## Executive Summary
Major infrastructure milestone — Helix Engage is now running on AWS EC2 with multi-tenant architecture supporting both Ramaiah Hospitals and Global Hospital on a single instance. A full CI/CD pipeline with automated E2E testing and Teams notifications is operational. 17 defects from QA were triaged, 8 fixed and deployed, and a cross-tenant security vulnerability in the telephony layer was discovered and patched.
---
## 1. AWS EC2 Deployment (Multi-Tenant)
**Status: Live**
Migrated from single-tenant VPS to multi-tenant EC2 architecture:
- **Instance:** m6i.xlarge, Mumbai (ap-south-1), 15GB RAM
- **14 Docker containers** running: platform, 2 sidecars, telephony dispatcher, 4 Redis instances, Caddy, PostgreSQL, ClickHouse, Redpanda, MinIO
- **Strict tenant isolation:** each hospital has its own sidecar container, Redis instance, and data volume
- **Host-routed Caddy:** cross-tenant webhook routing is physically impossible
**URLs deployed:**
- ramaiah.engage.healix360.net (Ramaiah Hospitals)
- global.engage.healix360.net (Global Hospital)
- ramaiah.app.healix360.net / global.app.healix360.net (Platform)
- telephony.engage.healix360.net (Event dispatcher)
- operations.healix360.net (CI/CD dashboard)
- git.healix360.net (Git forge)
---
## 2. Telephony Event Dispatcher
**Status: Live**
Built a NestJS service that routes Ozonetel agent/call events to the correct hospital's sidecar:
- Ozonetel event subscriptions are **account-level** (not per-campaign) — one URL for all agents
- Dispatcher receives all events, looks up `agentId` in Redis, forwards to the correct sidecar
- Sidecars self-register on boot with their agent list; heartbeat every 30s, TTL 90s
- No manual configuration needed when adding new hospitals
---
## 3. Cross-Tenant Security Fix (defaultAgentId)
**Status: Fixed and deployed**
Discovered that 6 sidecar endpoints used a hardcoded `OZONETEL_AGENT_ID` env var as a fallback when `agentId` wasn't provided by the frontend. In a multi-tenant setup, this caused Ramaiah sidecar operations to silently affect Global Hospital's agent.
**Impact:** Agent state changes, call disposition, outbound dialing, performance metrics, and maintenance commands could operate on the wrong hospital's agent with no error or warning.
**Fix:**
- Removed `defaultAgentId` getter and all hardcoded fallbacks (`agent3`, `Test123$`, `521814`)
- All 6 endpoints now require `agentId` from the caller (400 if missing)
- Frontend updated to send `agentId` from `localStorage.helix_agent_config` in all calls
- `OZONETEL_AGENT_ID` removed from env config entirely
---
## 4. Defect Fixes (8 of 17)
| Bug | Title | Status |
|-----|-------|--------|
| #527 | Appointment creation updates existing patient incorrectly | Fixed |
| #529 | Break/Training status doesn't block outbound calls | Fixed |
| #531 | Agent can log out during active call | Fixed |
| #533 | Redundant "Call History" header | Fixed |
| #534 | Redundant "Patients" header | Fixed |
| #536 | My Performance shows wrong agent's data | Fixed |
| #538 | Supervisor dashboard metrics incorrect | Fixed |
| #540 | Ghost calls visible for logged-out agents | Fixed |
| #547 | SLA rules not reflected in Call Desk | Fixed (config seeded) |
**Deferred (by product):** #516 (recordings real-time), #517/#548 (AI transcription), #519 (supervisor call — needs SIP seat), #539 (missed calls real-time), #541 (whisper/barge/listen)
---
## 5. E2E Test Suite (Playwright)
**Status: 40 tests, all passing**
Automated smoke tests covering every page for both hospitals:
- **Login (4):** branding, invalid creds, supervisor login, auth guard
- **Ramaiah CC Agent (10):** call desk, call history, patients, appointments, my performance, sidebar, sign-out
- **Ramaiah Supervisor (12):** dashboard, team performance, live monitor, leads, patients, appointments, call log, recordings, missed calls, campaigns, settings, sidebar
- **Global CC Agent (7):** all pages + sign-out
- **Global Supervisor (5):** all pages
Self-healing: auto-clears agent session locks before login, completes sign-out after tests.
---
## 6. CI/CD Pipeline (Woodpecker + Gitea)
**Status: Operational**
End-to-end CI/CD on EC2:
- **Gitea** mirrors Azure DevOps repos every 15 minutes
- **Woodpecker CI** triggers pipelines on push or manual run
- **Frontend pipeline:** TypeScript typecheck → 40 E2E tests → HTML report published to MinIO → Teams notification
- **Sidecar pipeline:** Jest unit tests → Teams notification
- **Reports:** Playwright HTML reports with screenshots at `operations.healix360.net/reports/{run}/index.html`
- **Teams notifications:** Adaptive Cards to "Deployment updates" channel with pass/fail summary + report link
---
## 7. Documentation
Three docs committed to the repo:
- **architecture.md** — Multi-tenant topology with Mermaid diagram, telephony dispatcher, failure modes
- **developer-operations-runbook.md** — SSH access, accounts, deploy steps, Redis ops, DB access, troubleshooting
- **ci-cd-operations.md** — Gitea, Woodpecker, MinIO, Teams notification setup and troubleshooting
---
## 8. Data Seeding
- **Ramaiah:** 195 real doctors scraped from msrmh.com, clinics, visit slots, campaign data
- **Global:** CC agent accounts (rekha.cc, ganesh.cc), marketing (sanjay), supervisor (dr.ramesh) created with proper roles
- **Rules engine:** 6 priority scoring rules seeded (missed call, follow-up, campaign lead, 2nd/3rd attempt, spam deprioritize)
- **Seed script:** idempotent `mkMember`, cleanup phase before seeding, runs against any workspace via env vars
---
## 9. Other Improvements
- **SIP agent tracing:** Browser console logs `agent=ramaiahadmin ext=524435` on every SIP connect/disconnect/state change for multi-agent debugging
- **ACW 3-layer protection:** beforeunload warning → sendBeacon auto-dispose → server 30s timer
- **Maint endpoints:** `force-ready` and `unlock-agent` now accept `agentId` from body (was hardcoded)
- **Security group automation:** SSH IP auto-updated via AWS CLI when ISP changes
---
## Metrics
| Metric | Value |
|--------|-------|
| Commits (frontend) | 35 |
| Commits (sidecar) | 20 |
| Commits (SDK app) | 2 |
| Bugs fixed | 9 |
| E2E tests | 40 |
| Docker containers | 17 (14 app + 3 CI) |
| DNS records | 6 |
| Uptime | EC2 live since Apr 9 |
---
## Next Week Priorities
1. Merge `feature/omnichannel-widget``master` (frontend)
2. Frontend Docker image (stop rsync, bake into image)
3. Appointment date validation (no past dates, auto-tomorrow after hours)
4. Pre-built CI Docker image (skip `yarn install` on every run)
5. Deferred defects: #516, #539 (real-time updates)