Files
helix-engage/docs/weekly-status-apr06-11.md
2026-04-15 06:49:41 +05:30

6.6 KiB
Raw Permalink Blame History

Helix Engage — Weekly Status Update

Period: April 6 April 11, 2026 Team: Engineering


Executive Summary

Major infrastructure milestone — Helix Engage is now running on AWS EC2 with multi-tenant architecture supporting both Ramaiah Hospitals and Global Hospital on a single instance. A full CI/CD pipeline with automated E2E testing and Teams notifications is operational. 17 defects from QA were triaged, 8 fixed and deployed, and a cross-tenant security vulnerability in the telephony layer was discovered and patched.


1. AWS EC2 Deployment (Multi-Tenant)

Status: Live

Migrated from single-tenant VPS to multi-tenant EC2 architecture:

  • Instance: m6i.xlarge, Mumbai (ap-south-1), 15GB RAM
  • 14 Docker containers running: platform, 2 sidecars, telephony dispatcher, 4 Redis instances, Caddy, PostgreSQL, ClickHouse, Redpanda, MinIO
  • Strict tenant isolation: each hospital has its own sidecar container, Redis instance, and data volume
  • Host-routed Caddy: cross-tenant webhook routing is physically impossible

URLs deployed:

  • ramaiah.engage.healix360.net (Ramaiah Hospitals)
  • global.engage.healix360.net (Global Hospital)
  • ramaiah.app.healix360.net / global.app.healix360.net (Platform)
  • telephony.engage.healix360.net (Event dispatcher)
  • operations.healix360.net (CI/CD dashboard)
  • git.healix360.net (Git forge)

2. Telephony Event Dispatcher

Status: Live

Built a NestJS service that routes Ozonetel agent/call events to the correct hospital's sidecar:

  • Ozonetel event subscriptions are account-level (not per-campaign) — one URL for all agents
  • Dispatcher receives all events, looks up agentId in Redis, forwards to the correct sidecar
  • Sidecars self-register on boot with their agent list; heartbeat every 30s, TTL 90s
  • No manual configuration needed when adding new hospitals

3. Cross-Tenant Security Fix (defaultAgentId)

Status: Fixed and deployed

Discovered that 6 sidecar endpoints used a hardcoded OZONETEL_AGENT_ID env var as a fallback when agentId wasn't provided by the frontend. In a multi-tenant setup, this caused Ramaiah sidecar operations to silently affect Global Hospital's agent.

Impact: Agent state changes, call disposition, outbound dialing, performance metrics, and maintenance commands could operate on the wrong hospital's agent with no error or warning.

Fix:

  • Removed defaultAgentId getter and all hardcoded fallbacks (agent3, Test123$, 521814)
  • All 6 endpoints now require agentId from the caller (400 if missing)
  • Frontend updated to send agentId from localStorage.helix_agent_config in all calls
  • OZONETEL_AGENT_ID removed from env config entirely

4. Defect Fixes (8 of 17)

Bug Title Status
#527 Appointment creation updates existing patient incorrectly Fixed
#529 Break/Training status doesn't block outbound calls Fixed
#531 Agent can log out during active call Fixed
#533 Redundant "Call History" header Fixed
#534 Redundant "Patients" header Fixed
#536 My Performance shows wrong agent's data Fixed
#538 Supervisor dashboard metrics incorrect Fixed
#540 Ghost calls visible for logged-out agents Fixed
#547 SLA rules not reflected in Call Desk Fixed (config seeded)

Deferred (by product): #516 (recordings real-time), #517/#548 (AI transcription), #519 (supervisor call — needs SIP seat), #539 (missed calls real-time), #541 (whisper/barge/listen)


5. E2E Test Suite (Playwright)

Status: 40 tests, all passing

Automated smoke tests covering every page for both hospitals:

  • Login (4): branding, invalid creds, supervisor login, auth guard
  • Ramaiah CC Agent (10): call desk, call history, patients, appointments, my performance, sidebar, sign-out
  • Ramaiah Supervisor (12): dashboard, team performance, live monitor, leads, patients, appointments, call log, recordings, missed calls, campaigns, settings, sidebar
  • Global CC Agent (7): all pages + sign-out
  • Global Supervisor (5): all pages

Self-healing: auto-clears agent session locks before login, completes sign-out after tests.


6. CI/CD Pipeline (Woodpecker + Gitea)

Status: Operational

End-to-end CI/CD on EC2:

  • Gitea mirrors Azure DevOps repos every 15 minutes
  • Woodpecker CI triggers pipelines on push or manual run
  • Frontend pipeline: TypeScript typecheck → 40 E2E tests → HTML report published to MinIO → Teams notification
  • Sidecar pipeline: Jest unit tests → Teams notification
  • Reports: Playwright HTML reports with screenshots at operations.healix360.net/reports/{run}/index.html
  • Teams notifications: Adaptive Cards to "Deployment updates" channel with pass/fail summary + report link

7. Documentation

Three docs committed to the repo:

  • architecture.md — Multi-tenant topology with Mermaid diagram, telephony dispatcher, failure modes
  • developer-operations-runbook.md — SSH access, accounts, deploy steps, Redis ops, DB access, troubleshooting
  • ci-cd-operations.md — Gitea, Woodpecker, MinIO, Teams notification setup and troubleshooting

8. Data Seeding

  • Ramaiah: 195 real doctors scraped from msrmh.com, clinics, visit slots, campaign data
  • Global: CC agent accounts (rekha.cc, ganesh.cc), marketing (sanjay), supervisor (dr.ramesh) created with proper roles
  • Rules engine: 6 priority scoring rules seeded (missed call, follow-up, campaign lead, 2nd/3rd attempt, spam deprioritize)
  • Seed script: idempotent mkMember, cleanup phase before seeding, runs against any workspace via env vars

9. Other Improvements

  • SIP agent tracing: Browser console logs agent=ramaiahadmin ext=524435 on every SIP connect/disconnect/state change for multi-agent debugging
  • ACW 3-layer protection: beforeunload warning → sendBeacon auto-dispose → server 30s timer
  • Maint endpoints: force-ready and unlock-agent now accept agentId from body (was hardcoded)
  • Security group automation: SSH IP auto-updated via AWS CLI when ISP changes

Metrics

Metric Value
Commits (frontend) 35
Commits (sidecar) 20
Commits (SDK app) 2
Bugs fixed 9
E2E tests 40
Docker containers 17 (14 app + 3 CI)
DNS records 6
Uptime EC2 live since Apr 9

Next Week Priorities

  1. Merge feature/omnichannel-widgetmaster (frontend)
  2. Frontend Docker image (stop rsync, bake into image)
  3. Appointment date validation (no past dates, auto-tomorrow after hours)
  4. Pre-built CI Docker image (skip yarn install on every run)
  5. Deferred defects: #516, #539 (real-time updates)