feat: add E2E smoke tests, architecture docs, and operations runbook

- 27 Playwright E2E tests covering login (3 roles), CC Agent pages
  (call desk, call history, patients, appointments, my performance,
  sidebar, sign-out), and Supervisor pages (all 11 pages + sidebar)
- Tests run against live EC2 at ramaiah.engage.healix360.net
- Last test completes sign-out to release agent session for next run
- Architecture doc with updated Mermaid diagram including telephony
  dispatcher, service discovery, and multi-tenant topology
- Operations runbook with SSH access (VPS + EC2), accounts, container
  reference, deploy steps, Redis ops, and troubleshooting guide

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-04-10 20:54:20 +05:30
parent a1598716ee
commit 1cdb7fe9e7
11 changed files with 1056 additions and 0 deletions

248
docs/architecture.md Normal file
View File

@@ -0,0 +1,248 @@
# Helix Engage — Architecture
Single EC2 instance (Mumbai `ap-south-1`) hosting two isolated Helix Engage
workspaces on top of a shared FortyTwo platform. Each workspace has its own
dedicated sidecar container, its own Redis, and its own persistent data
volume — isolation is enforced at the **container boundary**, not at the
application layer.
**Host:** `13.234.31.194` (m6i.xlarge, Ubuntu 22.04)
**DNS:** Cloudflare zone `healix360.net`
**TLS:** Caddy + Let's Encrypt, HTTP-01 challenge per hostname
---
## Principles
1. **Platform is multi-tenant by design.** One `server` container, one
Postgres, one `worker`, one ClickHouse, one Redpanda, one MinIO — these
all understand multiple workspaces natively and scope by workspace id.
2. **Sidecar is single-tenant by design.** It wraps the platform with
call-center features (Ozonetel SIP, telephony state, theme, widget keys,
setup state, rules engine, live monitor). Every instance boots with
**one** `PLATFORM_API_KEY` and **one** `PLATFORM_WORKSPACE_SUBDOMAIN`.
We run one instance per workspace.
3. **Caddy is strictly host-routed.** No default or catchall tenant.
A request lands on a host block or it 404s. The apex
`engage.healix360.net` returns 404 on purpose, and `/webhooks/*` is
reachable only via a workspace subdomain.
4. **Redis is per-sidecar.** Sidecars share Redis key names without a
workspace dimension. Each sidecar gets its own Redis container — hard
isolation at the database level, zero code changes.
5. **Telephony dispatcher routes events by agent.** Ozonetel event
subscriptions are account-level (not per-campaign). A single dispatcher
receives all agent/call events and routes them to the correct sidecar
using Redis-backed service discovery.
---
## URL Layout
| Who | URL | Routes to |
|---|---|---|
| Ramaiah platform UI | `https://ramaiah.app.healix360.net` | `server:4000` |
| Ramaiah Helix Engage | `https://ramaiah.engage.healix360.net` | `sidecar-ramaiah:4100` |
| Global platform UI | `https://global.app.healix360.net` | `server:4000` |
| Global Helix Engage | `https://global.engage.healix360.net` | `sidecar-global:4100` |
| Telephony dispatcher | `https://telephony.engage.healix360.net` | `telephony:4200` |
| Apex (dead-end) | `https://engage.healix360.net` | `404` |
Ozonetel campaign webhook URLs — per tenant:
| Campaign | DID | Webhook URL |
|---|---|---|
| `Inbound_918041763400` | Ramaiah | `https://ramaiah.engage.healix360.net/webhooks/ozonetel/missed-call` |
| `Inbound_918041763265` | Global (on VPS until cutover) | `https://global.engage.healix360.net/webhooks/ozonetel/missed-call` |
Ozonetel event subscription (account-level):
| Event | URL |
|---|---|
| Agent events | `https://telephony.engage.healix360.net/api/supervisor/agent-event` |
| Call events | `https://telephony.engage.healix360.net/api/supervisor/call-event` |
---
## Diagram
```mermaid
flowchart TB
subgraph Internet
OZO[Ozonetel<br/>CCaaS]
USR_R[Ramaiah users]
USR_G[Global users]
end
subgraph EC2 ["EC2 — 13.234.31.194 (ap-south-1)"]
CADDY{{"caddy<br/>host-routed<br/>Let's Encrypt"}}
subgraph TEL ["Telephony Dispatcher"]
DISP["telephony<br/>NestJS:4200<br/>routes by agentId"]
RD_T[("redis-telephony")]
end
subgraph PLATFORM ["Platform (shared, multi-tenant)"]
SRV["server<br/>NestJS:4000<br/>platform API + SPA"]
WKR["worker<br/>BullMQ"]
DB[("db<br/>postgres:16<br/>workspace-per-schema")]
CH[("clickhouse<br/>analytics")]
RP[("redpanda<br/>event bus")]
MINIO[("minio<br/>S3 storage")]
end
subgraph RAMAIAH ["Ramaiah tenant (isolated)"]
SC_R["sidecar-ramaiah<br/>NestJS:4100<br/>API_KEY=ramaiah admin"]
RD_R[("redis-ramaiah")]
VOL_R[/"data-ramaiah volume<br/>theme, telephony,<br/>widget, rules,<br/>setup-state"/]
end
subgraph GLOBAL ["Global tenant (isolated)"]
SC_G["sidecar-global<br/>NestJS:4100<br/>API_KEY=global admin"]
RD_G[("redis-global")]
VOL_G[/"data-global volume<br/>theme, telephony,<br/>widget, rules,<br/>setup-state"/]
end
end
USR_R -->|"ramaiah.app.healix360.net"| CADDY
USR_R -->|"ramaiah.engage.healix360.net"| CADDY
USR_G -->|"global.app.healix360.net"| CADDY
USR_G -->|"global.engage.healix360.net"| CADDY
OZO -->|"webhooks/ozonetel/missed-call<br/>(Ramaiah DID 918041763400)"| CADDY
OZO -.->|"webhooks/ozonetel/missed-call<br/>(Global DID 918041763265<br/>— still on VPS today)"| CADDY
OZO -->|"agent + call events<br/>(account-level subscription)"| CADDY
CADDY -->|"telephony.engage.*"| DISP
CADDY -->|"*.app.healix360.net<br/>/graphql, /auth/*, SPA"| SRV
CADDY -->|"ramaiah.engage.*<br/>/api/*, /webhooks/*, SPA"| SC_R
CADDY -->|"global.engage.*<br/>/api/*, /webhooks/*, SPA"| SC_G
DISP -->|"agentId lookup<br/>→ forward to sidecar"| SC_R
DISP -->|"agentId lookup<br/>→ forward to sidecar"| SC_G
DISP --- RD_T
SC_R -->|"self-register on boot<br/>heartbeat 30s"| DISP
SC_G -->|"self-register on boot<br/>heartbeat 30s"| DISP
SC_R -->|"GraphQL<br/>Origin: ramaiah.app.*"| SRV
SC_G -->|"GraphQL<br/>Origin: global.app.*"| SRV
SC_R --- RD_R
SC_G --- RD_G
SC_R --- VOL_R
SC_G --- VOL_G
SRV --- DB
SRV --- CH
SRV --- RP
SRV --- MINIO
WKR --- DB
WKR --- RP
classDef shared fill:#e3f2fd,stroke:#1976d2,stroke-width:2px,color:#000
classDef ramaiah fill:#fff3e0,stroke:#f57c00,stroke-width:2px,color:#000
classDef global fill:#e8f5e9,stroke:#388e3c,stroke-width:2px,color:#000
classDef external fill:#f5f5f5,stroke:#757575,stroke-width:1px,color:#000
classDef edge fill:#fce4ec,stroke:#c2185b,stroke-width:2px,color:#000
classDef telephony fill:#f3e5f5,stroke:#7b1fa2,stroke-width:2px,color:#000
class SRV,WKR,DB,CH,RP,MINIO shared
class SC_R,RD_R,VOL_R ramaiah
class SC_G,RD_G,VOL_G global
class OZO,USR_R,USR_G external
class CADDY edge
class DISP,RD_T telephony
```
---
## Components
| Component | Scope | Container count | Purpose |
|---|---|---|---|
| Caddy | Shared | 1 | Host-routed reverse proxy, TLS terminator |
| Platform server | Shared | 1 | Natively multi-tenant by Origin/subdomain |
| Platform worker | Shared | 1 | BullMQ jobs carry workspace context per-job |
| Postgres | Shared | 1 | Multi-tenant via per-workspace schemas |
| ClickHouse | Shared | 1 | Analytics — workspace dimension per event |
| Redpanda | Shared | 1 | Event bus — workspace dimension per message |
| MinIO | Shared | 1 | S3-compatible storage |
| **Telephony dispatcher** | **Shared** | **1** | Routes Ozonetel events to correct sidecar by agentId |
| **Redis (telephony)** | **Shared** | **1** | Service discovery registry for dispatcher |
| **Sidecar** | **Per-tenant** | **2** | Call center layer (Ramaiah + Global) |
| **Redis (sidecar)** | **Per-tenant** | **2** | Session, agent state, theme, rules cache |
| **Data volume** | **Per-tenant** | **2** | File-based config in `/app/data/` |
---
## Telephony Event Flow
Ozonetel event subscriptions are **account-level** — one subscription per Ozonetel account, not per campaign. All agent login/logout/state events and call events are POSTed to a single URL.
```
Ozonetel → POST telephony.engage.healix360.net/api/supervisor/agent-event
→ Dispatcher receives { agentId: "ramaiahadmin", action: "incall", ... }
→ Redis lookup: agentId "ramaiahadmin" → sidecar-ramaiah:4100
→ Forward event to sidecar-ramaiah
→ sidecar-ramaiah updates SupervisorService state, emits SSE
```
**Service discovery:** Each sidecar self-registers on boot via `POST /api/supervisor/register` with its agent list. Heartbeat every 30s, TTL 90s. If a sidecar goes down, its entries expire and the dispatcher stops routing to it.
---
## Request Flow
### Agent opens Ramaiah Helix Engage
```
Browser → https://ramaiah.engage.healix360.net/
→ Caddy (TLS, Host=ramaiah.engage.healix360.net)
→ static SPA from /srv/engage
Browser → POST /api/auth/login { email, password }
→ Caddy → sidecar-ramaiah:4100
→ sidecar calls platform with:
Origin: https://ramaiah.app.healix360.net
Authorization: Bearer <Ramaiah API key>
→ platform resolves workspace by Origin → Ramaiah
→ JWT returned
```
### Ozonetel POSTs a missed-call webhook
```
Ozonetel → POST https://ramaiah.engage.healix360.net/webhooks/ozonetel/missed-call
→ Caddy (Host=ramaiah.engage.healix360.net)
→ sidecar-ramaiah:4100 ONLY
→ writes call row into Ramaiah workspace via platform
```
Cross-tenant leakage is physically impossible — Caddy's host-routing guarantees a Ramaiah webhook can never reach sidecar-global.
---
## Failure Modes
| Failure | Blast radius |
|---|---|
| `sidecar-ramaiah` crashes | Ramaiah Engage 502s. Global + platform unaffected. |
| `sidecar-global` crashes | Global Engage 502s. Ramaiah + platform unaffected. |
| `redis-ramaiah` crashes | Ramaiah agents kicked from SIP. Global unaffected. |
| `telephony` crashes | Agent/call state events stop routing. Sidecars still serve UI. |
| `server` (platform) crashes | **Both workspaces** down for data. |
| `db` crashes | Same as above. |
| Caddy crashes | Nothing reachable until restart. |
---
## Adding a New Hospital
1. Add sidecar container + Redis + data volume in `docker-compose.yml`
2. Add Caddy host block for `newhospital.engage.healix360.net`
3. Create workspace on platform, generate API key
4. Set sidecar env: `PLATFORM_API_KEY`, `PLATFORM_WORKSPACE_SUBDOMAIN`
5. Configure Ozonetel campaign webhook to `newhospital.engage.healix360.net/webhooks/ozonetel/missed-call`
6. Sidecar self-registers with telephony dispatcher on boot — no dispatcher config needed

322
docs/runbook.md Normal file
View File

@@ -0,0 +1,322 @@
# Helix Engage — Operations Runbook
Day-to-day operations guide for deploying, debugging, and maintaining Helix Engage.
---
## Environments
| | **VPS (Global)** | **EC2 (Ramaiah)** |
|---|---|---|
| **Host** | `148.230.67.184` | `13.234.31.194` |
| **Domain** | `engage-api.srv1477139.hstgr.cloud` | `*.engage.healix360.net` |
| **Docker path** | `/opt/fortytwo` | `/opt/fortytwo` |
| **Topology** | Single-tenant | Multi-tenant (2 sidecars + telephony) |
---
## SSH Access
### VPS (Global)
```bash
sshpass -p 'SasiSuman@2007' ssh -o StrictHostKeyChecking=no root@148.230.67.184
```
### EC2 (Ramaiah)
The SSH key is at `~/Downloads/fortytwoai_hostinger` (passphrase-protected).
A decrypted copy must exist at `/tmp/ramaiah-ec2-key`.
**First-time setup (one of these):**
```bash
# Option A: Decrypt key file (non-interactive, passphrase: SasiSuman@2007)
openssl pkey -in ~/Downloads/fortytwoai_hostinger -out /tmp/ramaiah-ec2-key
chmod 600 /tmp/ramaiah-ec2-key
# Option B: Add to ssh-agent (interactive — prompts for passphrase)
ssh-add ~/Downloads/fortytwoai_hostinger
```
**After setup:**
```bash
ssh -i /tmp/ramaiah-ec2-key -o StrictHostKeyChecking=no ubuntu@13.234.31.194
```
**Quick alias for repeated use:**
```bash
alias ec2="ssh -i /tmp/ramaiah-ec2-key -o StrictHostKeyChecking=no ubuntu@13.234.31.194"
alias vps="sshpass -p 'SasiSuman@2007' ssh -o StrictHostKeyChecking=no root@148.230.67.184"
```
---
## Accounts
### Ramaiah (EC2)
| Role | Email | Password | Notes |
|------|-------|----------|-------|
| Marketing Executive | `marketing@ramaiahcare.com` | `AdRamaiah@2026` | Landing: Lead Workspace |
| Marketing Executive | `supervisor@ramaiahcare.com` | `MrRamaiah@2026` | Landing: Lead Workspace |
| CC Agent | `ccagent@ramaiahcare.com` | `CcRamaiah@2026` | Ozonetel agent: `ramaiahadmin` |
| Platform Admin | `dev@fortytwo.dev` | `tim@apple.dev` | Break-glass admin. **NEVER delete.** |
### Ozonetel
| Field | Value |
|-------|-------|
| API Key | `KK8110e6c3de02527f7243ffaa924fa93e` |
| Username | `global_healthx` |
| Ramaiah Campaign | `Inbound_918041763400` |
| Global Campaign | `Inbound_918041763265` |
| Ramaiah Agent | `ramaiahadmin` / ext `524435` |
---
## EC2 Containers
```bash
ssh -i /tmp/ramaiah-ec2-key ubuntu@13.234.31.194 \
"docker ps --format 'table {{.Names}}\t{{.Status}}'"
```
| Container | Purpose | Port |
|-----------|---------|------|
| `ramaiah-prod-caddy-1` | Reverse proxy + TLS | 80, 443 |
| `ramaiah-prod-server-1` | Platform API | 4000 |
| `ramaiah-prod-worker-1` | BullMQ worker | — |
| `ramaiah-prod-sidecar-ramaiah-1` | Ramaiah sidecar | 4100 |
| `ramaiah-prod-sidecar-global-1` | Global sidecar | 4100 |
| `ramaiah-prod-telephony-1` | Event dispatcher | 4200 |
| `ramaiah-prod-redis-ramaiah-1` | Ramaiah Redis | 6379 |
| `ramaiah-prod-redis-global-1` | Global Redis | 6379 |
| `ramaiah-prod-redis-telephony-1` | Telephony Redis | 6379 |
| `ramaiah-prod-redis-1` | Platform Redis | 6379 |
| `ramaiah-prod-db-1` | PostgreSQL | 5432 |
---
## Checking Logs
```bash
# EC2 — Ramaiah sidecar
ssh -i /tmp/ramaiah-ec2-key ubuntu@13.234.31.194 \
"docker logs ramaiah-prod-sidecar-ramaiah-1 --tail 30 2>&1"
# EC2 — Telephony dispatcher
ssh -i /tmp/ramaiah-ec2-key ubuntu@13.234.31.194 \
"docker logs ramaiah-prod-telephony-1 --tail 30 2>&1"
# EC2 — Platform server
ssh -i /tmp/ramaiah-ec2-key ubuntu@13.234.31.194 \
"docker logs ramaiah-prod-server-1 --tail 30 2>&1"
# EC2 — Caddy
ssh -i /tmp/ramaiah-ec2-key ubuntu@13.234.31.194 \
"docker logs ramaiah-prod-caddy-1 --tail 20 2>&1"
# EC2 — Filter errors only
ssh -i /tmp/ramaiah-ec2-key ubuntu@13.234.31.194 \
"docker logs ramaiah-prod-sidecar-ramaiah-1 --tail 100 2>&1" | grep -i "error\|fail\|crash"
# VPS — Sidecar
sshpass -p 'SasiSuman@2007' ssh root@148.230.67.184 \
"docker logs fortytwo-staging-sidecar-1 --tail 30 2>&1"
```
**Healthy sidecar output:**
- `Nest application successfully started`
- `Helix Engage Server running on port 4100`
- `SessionService Redis connected`
---
## Deploying
### Pre-flight checks
```bash
# Frontend type check
cd helix-engage && npx tsc --noEmit
# Sidecar build check
cd helix-engage-server && npm run build
```
### Frontend (EC2)
```bash
cd helix-engage && npm run build
rsync -avz -e "ssh -i /tmp/ramaiah-ec2-key -o StrictHostKeyChecking=no" \
dist/ ubuntu@13.234.31.194:/opt/fortytwo/helix-engage-frontend/
ssh -i /tmp/ramaiah-ec2-key ubuntu@13.234.31.194 \
"cd /opt/fortytwo && sudo docker compose restart caddy"
```
### Sidecar (EC2 — via ECR)
```bash
cd helix-engage-server
# ECR login + build + push
aws ecr get-login-password --region ap-south-1 | \
docker login --username AWS --password-stdin 043728036361.dkr.ecr.ap-south-1.amazonaws.com
docker buildx build --platform linux/amd64 \
-t 043728036361.dkr.ecr.ap-south-1.amazonaws.com/fortytwo-eap/helix-engage-sidecar:alpha \
--push .
# Pull + restart on EC2
ssh -i /tmp/ramaiah-ec2-key ubuntu@13.234.31.194 \
"cd /opt/fortytwo && sudo docker compose pull sidecar-ramaiah sidecar-global && sudo docker compose up -d sidecar-ramaiah sidecar-global"
```
### VPS (Global)
```bash
cd /Users/satyasumansaridae/Downloads/fortytwo-eap
bash deploy.sh frontend # Frontend only
bash deploy.sh sidecar # Sidecar only
bash deploy.sh all # Both
```
---
## Post-Deploy: E2E Smoke Tests
```bash
cd helix-engage
# Run against EC2 (default)
npx playwright test
# Run against VPS
E2E_BASE_URL=https://engage-api.srv1477139.hstgr.cloud npx playwright test
```
27 tests covering login (invalid creds, CC Agent, Supervisor), every page
for both roles, and sign-out. The last test completes sign-out so the agent
session is released for the next run.
---
## Redis Operations
### EC2 (Ramaiah sidecar Redis)
```bash
SSH="ssh -i /tmp/ramaiah-ec2-key -o StrictHostKeyChecking=no ubuntu@13.234.31.194"
REDIS="docker exec ramaiah-prod-redis-ramaiah-1 redis-cli"
# Clear agent session lock (fixes "already logged in from another device")
$SSH "$REDIS DEL agent:session:ramaiahadmin"
# List all keys
$SSH "$REDIS KEYS '*'"
# Clear caller cache (stale patient names)
$SSH "$REDIS --scan --pattern 'caller:*' | xargs -r docker exec -i ramaiah-prod-redis-ramaiah-1 redis-cli DEL"
# Clear masterdata cache
$SSH "$REDIS --scan --pattern 'masterdata:*' | xargs -r docker exec -i ramaiah-prod-redis-ramaiah-1 redis-cli DEL"
# Clear agent name cache
$SSH "$REDIS --scan --pattern 'agent:name:*' | xargs -r docker exec -i ramaiah-prod-redis-ramaiah-1 redis-cli DEL"
# Nuclear: flush all
$SSH "$REDIS FLUSHDB"
```
### VPS (Global sidecar Redis)
```bash
SSH="sshpass -p 'SasiSuman@2007' ssh -o StrictHostKeyChecking=no root@148.230.67.184"
REDIS="docker exec fortytwo-staging-redis-1 redis-cli"
$SSH "$REDIS DEL agent:session:<agentId>"
$SSH "$REDIS FLUSHDB"
```
---
## Troubleshooting
### "Already logged in from another device"
The sidecar enforces single-session per Ozonetel agent. Clear the lock:
```bash
ssh -i /tmp/ramaiah-ec2-key ubuntu@13.234.31.194 \
"docker exec ramaiah-prod-redis-ramaiah-1 redis-cli DEL agent:session:ramaiahadmin"
```
### Agent stuck in ACW / Wrapping Up
Three protection layers exist (beforeunload → sendBeacon → server 30s timer).
If all fail, force-ready:
```bash
curl -X POST https://ramaiah.engage.healix360.net/api/maint/force-ready \
-H "Content-Type: application/json" \
-d '{"agentId": "ramaiahadmin"}'
```
### Container restart loop
```bash
ssh -i /tmp/ramaiah-ec2-key ubuntu@13.234.31.194 \
"docker logs ramaiah-prod-sidecar-ramaiah-1 --tail 50 2>&1" | grep -i "error\|fail\|crash"
```
Common causes:
- `Cannot find module` → need ECR rebuild (new dependencies)
- `UndefinedModuleException` → circular dependency in code
- `ECONNREFUSED` to Redis → Redis container down, `docker compose up -d redis-ramaiah`
### Theme/branding reset after sidecar restart
Config is in Redis. If flushed, re-apply:
```bash
curl -X PUT https://ramaiah.engage.healix360.net/api/config/theme \
-H "Content-Type: application/json" \
-d '{"defaults": {"brandName": "Helix Engage", "hospitalName": "Ramaiah Hospitals"}}'
```
### Telephony events not routing
Check dispatcher logs and verify sidecar registration:
```bash
# Dispatcher logs
ssh -i /tmp/ramaiah-ec2-key ubuntu@13.234.31.194 \
"docker logs ramaiah-prod-telephony-1 --tail 30 2>&1"
# Check service discovery registry
ssh -i /tmp/ramaiah-ec2-key ubuntu@13.234.31.194 \
"docker exec ramaiah-prod-redis-telephony-1 redis-cli KEYS '*'"
```
### Full DB Reset (nuclear — destroys all data)
Only when field metadata is missing (0 rows in `core.fieldMetadata`):
```bash
ssh -i /tmp/ramaiah-ec2-key ubuntu@13.234.31.194 << 'EOF'
cd /opt/fortytwo
sudo docker compose stop server worker
sudo docker exec ramaiah-prod-db-1 psql -U fortytwo -d fortytwo_eap -c "DELETE FROM core.workspace;"
# Find and drop orphaned workspace schemas
sudo docker exec ramaiah-prod-db-1 psql -U fortytwo -d fortytwo_eap -c "SELECT schema_name FROM information_schema.schemata WHERE schema_name LIKE 'workspace_%';"
# DROP SCHEMA ... CASCADE for each
sudo docker exec ramaiah-prod-redis-1 redis-cli FLUSHALL
sudo docker compose up -d server worker
EOF
```