Files
helix-engage/docs/architecture.md
saridsa2 1cdb7fe9e7 feat: add E2E smoke tests, architecture docs, and operations runbook
- 27 Playwright E2E tests covering login (3 roles), CC Agent pages
  (call desk, call history, patients, appointments, my performance,
  sidebar, sign-out), and Supervisor pages (all 11 pages + sidebar)
- Tests run against live EC2 at ramaiah.engage.healix360.net
- Last test completes sign-out to release agent session for next run
- Architecture doc with updated Mermaid diagram including telephony
  dispatcher, service discovery, and multi-tenant topology
- Operations runbook with SSH access (VPS + EC2), accounts, container
  reference, deploy steps, Redis ops, and troubleshooting guide

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-10 20:54:20 +05:30

9.7 KiB

Helix Engage — Architecture

Single EC2 instance (Mumbai ap-south-1) hosting two isolated Helix Engage workspaces on top of a shared FortyTwo platform. Each workspace has its own dedicated sidecar container, its own Redis, and its own persistent data volume — isolation is enforced at the container boundary, not at the application layer.

Host: 13.234.31.194 (m6i.xlarge, Ubuntu 22.04) DNS: Cloudflare zone healix360.net TLS: Caddy + Let's Encrypt, HTTP-01 challenge per hostname


Principles

  1. Platform is multi-tenant by design. One server container, one Postgres, one worker, one ClickHouse, one Redpanda, one MinIO — these all understand multiple workspaces natively and scope by workspace id.

  2. Sidecar is single-tenant by design. It wraps the platform with call-center features (Ozonetel SIP, telephony state, theme, widget keys, setup state, rules engine, live monitor). Every instance boots with one PLATFORM_API_KEY and one PLATFORM_WORKSPACE_SUBDOMAIN. We run one instance per workspace.

  3. Caddy is strictly host-routed. No default or catchall tenant. A request lands on a host block or it 404s. The apex engage.healix360.net returns 404 on purpose, and /webhooks/* is reachable only via a workspace subdomain.

  4. Redis is per-sidecar. Sidecars share Redis key names without a workspace dimension. Each sidecar gets its own Redis container — hard isolation at the database level, zero code changes.

  5. Telephony dispatcher routes events by agent. Ozonetel event subscriptions are account-level (not per-campaign). A single dispatcher receives all agent/call events and routes them to the correct sidecar using Redis-backed service discovery.


URL Layout

Who URL Routes to
Ramaiah platform UI https://ramaiah.app.healix360.net server:4000
Ramaiah Helix Engage https://ramaiah.engage.healix360.net sidecar-ramaiah:4100
Global platform UI https://global.app.healix360.net server:4000
Global Helix Engage https://global.engage.healix360.net sidecar-global:4100
Telephony dispatcher https://telephony.engage.healix360.net telephony:4200
Apex (dead-end) https://engage.healix360.net 404

Ozonetel campaign webhook URLs — per tenant:

Campaign DID Webhook URL
Inbound_918041763400 Ramaiah https://ramaiah.engage.healix360.net/webhooks/ozonetel/missed-call
Inbound_918041763265 Global (on VPS until cutover) https://global.engage.healix360.net/webhooks/ozonetel/missed-call

Ozonetel event subscription (account-level):

Event URL
Agent events https://telephony.engage.healix360.net/api/supervisor/agent-event
Call events https://telephony.engage.healix360.net/api/supervisor/call-event

Diagram

flowchart TB
    subgraph Internet
        OZO[Ozonetel<br/>CCaaS]
        USR_R[Ramaiah users]
        USR_G[Global users]
    end

    subgraph EC2 ["EC2 — 13.234.31.194 (ap-south-1)"]
        CADDY{{"caddy<br/>host-routed<br/>Let's Encrypt"}}

        subgraph TEL ["Telephony Dispatcher"]
            DISP["telephony<br/>NestJS:4200<br/>routes by agentId"]
            RD_T[("redis-telephony")]
        end

        subgraph PLATFORM ["Platform (shared, multi-tenant)"]
            SRV["server<br/>NestJS:4000<br/>platform API + SPA"]
            WKR["worker<br/>BullMQ"]
            DB[("db<br/>postgres:16<br/>workspace-per-schema")]
            CH[("clickhouse<br/>analytics")]
            RP[("redpanda<br/>event bus")]
            MINIO[("minio<br/>S3 storage")]
        end

        subgraph RAMAIAH ["Ramaiah tenant (isolated)"]
            SC_R["sidecar-ramaiah<br/>NestJS:4100<br/>API_KEY=ramaiah admin"]
            RD_R[("redis-ramaiah")]
            VOL_R[/"data-ramaiah volume<br/>theme, telephony,<br/>widget, rules,<br/>setup-state"/]
        end

        subgraph GLOBAL ["Global tenant (isolated)"]
            SC_G["sidecar-global<br/>NestJS:4100<br/>API_KEY=global admin"]
            RD_G[("redis-global")]
            VOL_G[/"data-global volume<br/>theme, telephony,<br/>widget, rules,<br/>setup-state"/]
        end
    end

    USR_R -->|"ramaiah.app.healix360.net"| CADDY
    USR_R -->|"ramaiah.engage.healix360.net"| CADDY
    USR_G -->|"global.app.healix360.net"| CADDY
    USR_G -->|"global.engage.healix360.net"| CADDY

    OZO -->|"webhooks/ozonetel/missed-call<br/>(Ramaiah DID 918041763400)"| CADDY
    OZO -.->|"webhooks/ozonetel/missed-call<br/>(Global DID 918041763265<br/>— still on VPS today)"| CADDY
    OZO -->|"agent + call events<br/>(account-level subscription)"| CADDY
    CADDY -->|"telephony.engage.*"| DISP

    CADDY -->|"*.app.healix360.net<br/>/graphql, /auth/*, SPA"| SRV
    CADDY -->|"ramaiah.engage.*<br/>/api/*, /webhooks/*, SPA"| SC_R
    CADDY -->|"global.engage.*<br/>/api/*, /webhooks/*, SPA"| SC_G

    DISP -->|"agentId lookup<br/>→ forward to sidecar"| SC_R
    DISP -->|"agentId lookup<br/>→ forward to sidecar"| SC_G
    DISP --- RD_T

    SC_R -->|"self-register on boot<br/>heartbeat 30s"| DISP
    SC_G -->|"self-register on boot<br/>heartbeat 30s"| DISP

    SC_R -->|"GraphQL<br/>Origin: ramaiah.app.*"| SRV
    SC_G -->|"GraphQL<br/>Origin: global.app.*"| SRV

    SC_R --- RD_R
    SC_G --- RD_G
    SC_R --- VOL_R
    SC_G --- VOL_G

    SRV --- DB
    SRV --- CH
    SRV --- RP
    SRV --- MINIO
    WKR --- DB
    WKR --- RP

    classDef shared fill:#e3f2fd,stroke:#1976d2,stroke-width:2px,color:#000
    classDef ramaiah fill:#fff3e0,stroke:#f57c00,stroke-width:2px,color:#000
    classDef global fill:#e8f5e9,stroke:#388e3c,stroke-width:2px,color:#000
    classDef external fill:#f5f5f5,stroke:#757575,stroke-width:1px,color:#000
    classDef edge fill:#fce4ec,stroke:#c2185b,stroke-width:2px,color:#000
    classDef telephony fill:#f3e5f5,stroke:#7b1fa2,stroke-width:2px,color:#000

    class SRV,WKR,DB,CH,RP,MINIO shared
    class SC_R,RD_R,VOL_R ramaiah
    class SC_G,RD_G,VOL_G global
    class OZO,USR_R,USR_G external
    class CADDY edge
    class DISP,RD_T telephony

Components

Component Scope Container count Purpose
Caddy Shared 1 Host-routed reverse proxy, TLS terminator
Platform server Shared 1 Natively multi-tenant by Origin/subdomain
Platform worker Shared 1 BullMQ jobs carry workspace context per-job
Postgres Shared 1 Multi-tenant via per-workspace schemas
ClickHouse Shared 1 Analytics — workspace dimension per event
Redpanda Shared 1 Event bus — workspace dimension per message
MinIO Shared 1 S3-compatible storage
Telephony dispatcher Shared 1 Routes Ozonetel events to correct sidecar by agentId
Redis (telephony) Shared 1 Service discovery registry for dispatcher
Sidecar Per-tenant 2 Call center layer (Ramaiah + Global)
Redis (sidecar) Per-tenant 2 Session, agent state, theme, rules cache
Data volume Per-tenant 2 File-based config in /app/data/

Telephony Event Flow

Ozonetel event subscriptions are account-level — one subscription per Ozonetel account, not per campaign. All agent login/logout/state events and call events are POSTed to a single URL.

Ozonetel → POST telephony.engage.healix360.net/api/supervisor/agent-event
  → Dispatcher receives { agentId: "ramaiahadmin", action: "incall", ... }
  → Redis lookup: agentId "ramaiahadmin" → sidecar-ramaiah:4100
  → Forward event to sidecar-ramaiah
  → sidecar-ramaiah updates SupervisorService state, emits SSE

Service discovery: Each sidecar self-registers on boot via POST /api/supervisor/register with its agent list. Heartbeat every 30s, TTL 90s. If a sidecar goes down, its entries expire and the dispatcher stops routing to it.


Request Flow

Agent opens Ramaiah Helix Engage

Browser → https://ramaiah.engage.healix360.net/
  → Caddy (TLS, Host=ramaiah.engage.healix360.net)
  → static SPA from /srv/engage

Browser → POST /api/auth/login { email, password }
  → Caddy → sidecar-ramaiah:4100
  → sidecar calls platform with:
      Origin: https://ramaiah.app.healix360.net
      Authorization: Bearer <Ramaiah API key>
  → platform resolves workspace by Origin → Ramaiah
  → JWT returned

Ozonetel POSTs a missed-call webhook

Ozonetel → POST https://ramaiah.engage.healix360.net/webhooks/ozonetel/missed-call
  → Caddy (Host=ramaiah.engage.healix360.net)
  → sidecar-ramaiah:4100 ONLY
  → writes call row into Ramaiah workspace via platform

Cross-tenant leakage is physically impossible — Caddy's host-routing guarantees a Ramaiah webhook can never reach sidecar-global.


Failure Modes

Failure Blast radius
sidecar-ramaiah crashes Ramaiah Engage 502s. Global + platform unaffected.
sidecar-global crashes Global Engage 502s. Ramaiah + platform unaffected.
redis-ramaiah crashes Ramaiah agents kicked from SIP. Global unaffected.
telephony crashes Agent/call state events stop routing. Sidecars still serve UI.
server (platform) crashes Both workspaces down for data.
db crashes Same as above.
Caddy crashes Nothing reachable until restart.

Adding a New Hospital

  1. Add sidecar container + Redis + data volume in docker-compose.yml
  2. Add Caddy host block for newhospital.engage.healix360.net
  3. Create workspace on platform, generate API key
  4. Set sidecar env: PLATFORM_API_KEY, PLATFORM_WORKSPACE_SUBDOMAIN
  5. Configure Ozonetel campaign webhook to newhospital.engage.healix360.net/webhooks/ozonetel/missed-call
  6. Sidecar self-registers with telephony dispatcher on boot — no dispatcher config needed