Infrastructure

AWS Architecture

All backend infrastructure is managed by Terraform. Production runs in ap-south-1 (Mumbai). Staging runs in ap-southeast-1 (Singapore). Frontend SPAs are on Cloudflare Pages and are not in Terraform scope.

Environments

ProductionStaging
Regionap-south-1 (Mumbai)ap-southeast-1 (Singapore)
Terraform rootterraform/ (root)terraform/environments/staging/
State files3://zelly-terraform-state/production/terraform.tfstates3://zelly-terraform-state/staging/terraform.tfstate
VPC CIDR10.0.0.0/1610.1.0.0/16
AuroraPrivate subnet only (VPN required)Publicly accessible (no VPN needed)
Aurora max ACU82
Rediscache.t4g.smallcache.t4g.micro
ClickHouset3.medium / 50 GBt3.small / 20 GB
ECS max tasksfastify-nova: 4, storefront: 4fastify-nova: 2, storefront: 2
deletion_protectiontruefalse
ECR reposCreates themReads existing (data source)
Est. monthly cost~$460/mo~$320/mo

Network Layout

Each environment has a VPC with 3 public and 3 private subnets across availability zones (a, b, c), one NAT Gateway in public subnet 1a, and one Internet Gateway.

VPC (10.0.0.0/16)
├── Public subnets:  10.0.1.0/24  10.0.2.0/24  10.0.3.0/24
│   └── Bastion EC2, NAT Gateway, ALB/NLB
└── Private subnets: 10.0.10.0/24 10.0.11.0/24 10.0.12.0/24
    └── ECS tasks, Aurora, ElastiCache, ClickHouse EC2
i
All ECS tasks, Aurora, Redis, and ClickHouse are in private subnets. They can reach the internet via NAT Gateway but are not directly reachable from outside. Access is via ALB (ECS) or WireGuard VPN (databases).

ECS Services

ServicevCPUMemoryMin→Max tasksScalingIngress
fastify-nova12048 MB1→4 (prod) / 1→2 (stg)CPU 70%ALB HTTPS
customer-panel0.51024 MB1→2CPU 70%ALB HTTPS
orion-backend0.51024 MB1→2CPU 70%ALB HTTPS
events-consumer0.25512 MB1→1None (queue worker)
storefront (Astro+Caddy)12048 MB1→4 (prod) / 1→2 (stg)CPU 70%NLB TCP passthrough
internal_tools0.25512 MB1→1None (VPN only)

Load Balancers

LBTypeRoutes toNotes
alb-novaALBfastify-nova:3000HTTP→HTTPS redirect, ACM cert
alb-customerALBcustomer-panel:5174HTTP→HTTPS redirect, ACM cert
alb-orionALBorion-backend:3022HTTP→HTTPS redirect, ACM cert. Public — React SPA calls it from browser.
nlb-storefrontNLBCaddy:80 and :443TCP passthrough, Elastic IP. Preserves client IP for ACME challenges.

Databases

Aurora MySQL (RDS)

Aurora MySQL 8.0 Serverless v2. Scales automatically between min and max ACU when queries arrive, drops to near-zero ACU when idle.

ElastiCache Redis

Redis 7.x single-node cluster in private subnet. Used for BullMQ job queues and application caching (tenant tokens, sessions).

ClickHouse (EC2)

ClickHouse 24.3 running in Docker on a private EC2 instance. Used exclusively for analytics — events-consumer writes, orion-backend reads.

Bastion & WireGuard VPN

A t3.micro EC2 bastion in public subnet 1a provides access to private resources.

bash — connect WireGuard
# After terraform apply, get your client config:
terraform output wireguard_client_config_peer1

# Save to /etc/wireguard/wg0.conf, then:
sudo wg-quick up wg0

# Now private resources are reachable:
mysql -h <aurora_endpoint> -u root -p
redis-cli -h <elasticache_endpoint> -a <token>
curl http://<clickhouse_private_ip>:8123/?query=SELECT+1

Secrets Management

All sensitive values live in AWS Secrets Manager. ECS task execution role has secretsmanager:GetSecretValue on zelly/*.

NEVER put secrets in terraform.tfvars or in the ECS task definition JSON environment block. All sensitive values (DB_PASSWORD, JWT_SECRET, API keys, etc.) must live in Secrets Manager only.
Secret pathContents
zelly/aurora/masterDB_HOST, DB_USER, DB_PASSWORD
zelly/redis/authREDIS_HOST, REDIS_PORT, REDIS_PASSWORD
zelly/fastify-nova/envJWT_SECRET, RAZORPAY keys, SLACK_WEBHOOK_URL, Firebase config, CLICKHOUSE creds
zelly/customer-panel/envJWT_SECRET, COOKIE_SECRET, SSO config
zelly/orion-backend/envJWT_SECRET_KEY, CORS_ORIGINS

Terraform Execution

Prerequisites

Bootstrap the S3 state backend once (run manually before the first terraform init):

bash — bootstrap state backend (once ever)
aws s3api create-bucket \
  --bucket zelly-terraform-state \
  --region ap-south-1 \
  --create-bucket-configuration LocationConstraint=ap-south-1

aws s3api put-bucket-versioning \
  --bucket zelly-terraform-state \
  --versioning-configuration Status=Enabled

aws dynamodb create-table \
  --table-name zelly-terraform-locks \
  --attribute-definitions AttributeName=LockID,AttributeType=S \
  --key-schema AttributeName=LockID,KeyType=HASH \
  --billing-mode PAY_PER_REQUEST \
  --region ap-south-1

Apply staging

bash
cd terraform/environments/staging

terraform init

# Create terraform.tfvars from the example and fill in real values
cp ../../terraform.tfvars.example terraform.tfvars
# Edit: bastion_ssh_public_key, cloudflare_api_token, domain names, alert email

terraform plan
terraform apply

Apply production

bash
cd terraform    # root directory — NOT environments/production/

terraform init

cp terraform.tfvars.example terraform.tfvars
# Edit terraform.tfvars

terraform plan
terraform apply
!
terraform.tfvars is gitignored. Never commit it — it contains SSH keys and Cloudflare API tokens.

EFS (Caddy Certificate Storage)

EFS is mounted to the Caddy container in the storefront ECS task at /data. Caddy uses this to persist ACME certificates across task restarts and replacements. Without EFS, every new task would request a fresh cert and hit Let's Encrypt rate limits.

Observability

ToolAccessNotes
CloudWatch LogsAWS ConsoleLog group: /zelly/ecs/<service-name>, 30-day retention
CloudWatch DashboardAWS ConsoleECS CPU/Memory, ALB request count + latency + 5xx, Redis hits
BullMQ DashboardWireGuard VPN → <internal_tools_ip>:3000Queue depth, job states, retries
RedisInsightWireGuard VPN → <internal_tools_ip>:5540Key browser, memory, slow log