AWS Architecture
All backend infrastructure is managed by Terraform. Production runs in ap-south-1 (Mumbai). Staging runs in ap-southeast-1 (Singapore). Frontend SPAs are on Cloudflare Pages and are not in Terraform scope.
Environments
| Production | Staging | |
|---|---|---|
| Region | ap-south-1 (Mumbai) | ap-southeast-1 (Singapore) |
| Terraform root | terraform/ (root) | terraform/environments/staging/ |
| State file | s3://zelly-terraform-state/production/terraform.tfstate | s3://zelly-terraform-state/staging/terraform.tfstate |
| VPC CIDR | 10.0.0.0/16 | 10.1.0.0/16 |
| Aurora | Private subnet only (VPN required) | Publicly accessible (no VPN needed) |
| Aurora max ACU | 8 | 2 |
| Redis | cache.t4g.small | cache.t4g.micro |
| ClickHouse | t3.medium / 50 GB | t3.small / 20 GB |
| ECS max tasks | fastify-nova: 4, storefront: 4 | fastify-nova: 2, storefront: 2 |
| deletion_protection | true | false |
| ECR repos | Creates them | Reads existing (data source) |
| Est. monthly cost | ~$460/mo | ~$320/mo |
Network Layout
Each environment has a VPC with 3 public and 3 private subnets across availability zones (a, b, c), one NAT Gateway in public subnet 1a, and one Internet Gateway.
ECS Services
| Service | vCPU | Memory | Min→Max tasks | Scaling | Ingress |
|---|---|---|---|---|---|
| fastify-nova | 1 | 2048 MB | 1→4 (prod) / 1→2 (stg) | CPU 70% | ALB HTTPS |
| customer-panel | 0.5 | 1024 MB | 1→2 | CPU 70% | ALB HTTPS |
| orion-backend | 0.5 | 1024 MB | 1→2 | CPU 70% | ALB HTTPS |
| events-consumer | 0.25 | 512 MB | 1→1 | — | None (queue worker) |
| storefront (Astro+Caddy) | 1 | 2048 MB | 1→4 (prod) / 1→2 (stg) | CPU 70% | NLB TCP passthrough |
| internal_tools | 0.25 | 512 MB | 1→1 | — | None (VPN only) |
Load Balancers
| LB | Type | Routes to | Notes |
|---|---|---|---|
alb-nova | ALB | fastify-nova:3000 | HTTP→HTTPS redirect, ACM cert |
alb-customer | ALB | customer-panel:5174 | HTTP→HTTPS redirect, ACM cert |
alb-orion | ALB | orion-backend:3022 | HTTP→HTTPS redirect, ACM cert. Public — React SPA calls it from browser. |
nlb-storefront | NLB | Caddy:80 and :443 | TCP passthrough, Elastic IP. Preserves client IP for ACME challenges. |
Databases
Aurora MySQL (RDS)
Aurora MySQL 8.0 Serverless v2. Scales automatically between min and max ACU when queries arrive, drops to near-zero ACU when idle.
- Master credentials stored in Secrets Manager:
zelly/aurora/master - Deletion protection enabled in production
- Schemas:
astro_primary,ecom_store_front,backoffice— created by app migrations, not Terraform - Staging: publicly accessible (connect directly without VPN)
- Production: private subnet only (requires WireGuard VPN via bastion)
ElastiCache Redis
Redis 7.x single-node cluster in private subnet. Used for BullMQ job queues and application caching (tenant tokens, sessions).
- Auth token in Secrets Manager:
zelly/redis/auth - Queue names:
store-events,SHOPIFY_WEBHOOK,SHOPIFY_SETTINGS_PUSH,SHOPIFY_SETTINGS_FETCH,SHOPIFY_CATALOG_SYNC
ClickHouse (EC2)
ClickHouse 24.3 running in Docker on a private EC2 instance. Used exclusively for analytics — events-consumer writes, orion-backend reads.
- HTTP API on port 8123, native protocol on port 9000
- Init SQL from
store-events-consumer/docker/clickhouse/init/ - EBS volume for persistence (gp3)
Bastion & WireGuard VPN
A t3.micro EC2 bastion in public subnet 1a provides access to private resources.
- SSH key pair required (provide public key in
terraform.tfvars) - WireGuard VPN server on UDP port 51820 — VPN subnet:
10.10.0.0/24 - Once connected to VPN, all private resources are reachable by private IP — no per-resource SSH tunnels needed
- Client config generated as Terraform output (one peer config per developer)
# After terraform apply, get your client config: terraform output wireguard_client_config_peer1 # Save to /etc/wireguard/wg0.conf, then: sudo wg-quick up wg0 # Now private resources are reachable: mysql -h <aurora_endpoint> -u root -p redis-cli -h <elasticache_endpoint> -a <token> curl http://<clickhouse_private_ip>:8123/?query=SELECT+1
Secrets Management
All sensitive values live in AWS Secrets Manager. ECS task execution role has secretsmanager:GetSecretValue on zelly/*.
terraform.tfvars or in the ECS task definition JSON environment block. All sensitive values (DB_PASSWORD, JWT_SECRET, API keys, etc.) must live in Secrets Manager only.| Secret path | Contents |
|---|---|
zelly/aurora/master | DB_HOST, DB_USER, DB_PASSWORD |
zelly/redis/auth | REDIS_HOST, REDIS_PORT, REDIS_PASSWORD |
zelly/fastify-nova/env | JWT_SECRET, RAZORPAY keys, SLACK_WEBHOOK_URL, Firebase config, CLICKHOUSE creds |
zelly/customer-panel/env | JWT_SECRET, COOKIE_SECRET, SSO config |
zelly/orion-backend/env | JWT_SECRET_KEY, CORS_ORIGINS |
Terraform Execution
Prerequisites
Bootstrap the S3 state backend once (run manually before the first terraform init):
aws s3api create-bucket \ --bucket zelly-terraform-state \ --region ap-south-1 \ --create-bucket-configuration LocationConstraint=ap-south-1 aws s3api put-bucket-versioning \ --bucket zelly-terraform-state \ --versioning-configuration Status=Enabled aws dynamodb create-table \ --table-name zelly-terraform-locks \ --attribute-definitions AttributeName=LockID,AttributeType=S \ --key-schema AttributeName=LockID,KeyType=HASH \ --billing-mode PAY_PER_REQUEST \ --region ap-south-1
Apply staging
cd terraform/environments/staging terraform init # Create terraform.tfvars from the example and fill in real values cp ../../terraform.tfvars.example terraform.tfvars # Edit: bastion_ssh_public_key, cloudflare_api_token, domain names, alert email terraform plan terraform apply
Apply production
cd terraform # root directory — NOT environments/production/ terraform init cp terraform.tfvars.example terraform.tfvars # Edit terraform.tfvars terraform plan terraform apply
terraform.tfvars is gitignored. Never commit it — it contains SSH keys and Cloudflare API tokens.EFS (Caddy Certificate Storage)
EFS is mounted to the Caddy container in the storefront ECS task at /data. Caddy uses this to persist ACME certificates across task restarts and replacements. Without EFS, every new task would request a fresh cert and hit Let's Encrypt rate limits.
Observability
| Tool | Access | Notes |
|---|---|---|
| CloudWatch Logs | AWS Console | Log group: /zelly/ecs/<service-name>, 30-day retention |
| CloudWatch Dashboard | AWS Console | ECS CPU/Memory, ALB request count + latency + 5xx, Redis hits |
| BullMQ Dashboard | WireGuard VPN → <internal_tools_ip>:3000 | Queue depth, job states, retries |
| RedisInsight | WireGuard VPN → <internal_tools_ip>:5540 | Key browser, memory, slow log |