Infrastructure

First-Time Deployment

This guide walks through bringing up a fresh environment from zero — no existing AWS infrastructure, no Docker images, nothing. Follow the phases in order. Staging and production share ECR repositories (both live in the production AWS account in ap-south-1), so ECR is created once and reused.

Time estimate: ~45 minutes for staging, ~60 minutes for production. Most of that is waiting for AWS to provision Aurora, ACM cert validation, and ECS tasks to stabilise.

Read the Gotchas section before you start. Several of the errors below will eat 30+ minutes if you hit them cold.

Phase 0 — Prerequisites

Tools

Tool	Minimum version	Install
Terraform	1.6	`brew install terraform` or terraform.io/install
AWS CLI	v2	`brew install awscli`
Docker	24+	Docker Desktop
WireGuard	any	`brew install wireguard-tools` — production access only
MySQL client	8.0	`brew install mysql-client`

AWS credentials

bash

# Verify you are authenticated to the correct account
aws sts get-caller-identity
# Should return the Zelly production AWS account ID

Use a profile with AdministratorAccess (or a scoped policy covering EC2, ECS, RDS, ElastiCache, Secrets Manager, IAM, ECR). Set AWS_PROFILE=zelly if needed. Both staging and production Terraform runs use the same account — staging is just a separate environment in ap-southeast-1.

One-time: Terraform state backend

Terraform state is stored in S3 with a DynamoDB lock table. These must exist before terraform init. Run once, ever.

bash — run once, ever

# Create the state bucket (versioning + encryption)
aws s3api create-bucket \
  --bucket zelly-terraform-state \
  --region ap-south-1 \
  --create-bucket-configuration LocationConstraint=ap-south-1

aws s3api put-bucket-versioning \
  --bucket zelly-terraform-state \
  --versioning-configuration Status=Enabled

aws s3api put-bucket-encryption \
  --bucket zelly-terraform-state \
  --server-side-encryption-configuration \
    '{"Rules":[{"ApplyServerSideEncryptionByDefault":{"SSEAlgorithm":"AES256"}}]}'

# Create the DynamoDB lock table
aws dynamodb create-table \
  --table-name zelly-terraform-locks \
  --attribute-definitions AttributeName=LockID,AttributeType=S \
  --key-schema AttributeName=LockID,KeyType=HASH \
  --billing-mode PAY_PER_REQUEST \
  --region ap-south-1

Phase 1 — ECR Repositories

ECR repositories live in ap-south-1 in the production account and are shared between staging and production. Create them first — ECS cannot start tasks without images, and images cannot be pushed without repos.

bash

cd d:/zelly/terraform

# Copy and fill in your tfvars (see terraform.tfvars.example)
cp terraform.tfvars.example terraform.tfvars

terraform init

# Apply only ECR — takes ~30 seconds
terraform apply -target module.ecr

# Verify repos were created
aws ecr describe-repositories \
  --region ap-south-1 \
  --query 'repositories[*].repositoryName' \
  --output table

You should see 8 repos: zelly/fastify-nova, zelly/customer-panel, zelly/orion-backend, zelly/events-consumer, zelly/astro-storefront, zelly/caddy, zelly/bull-studio, zelly/redis-insight.

Phase 2 — Staging

Step 1 — Fill in terraform.tfvars

bash

cd d:/zelly/terraform/environments/staging
cp terraform.tfvars.example terraform.tfvars

Edit terraform.tfvars and fill in:

bastion_ssh_public_key = "ssh-rsa AAAA..."   # your public key: cat ~/.ssh/id_rsa.pub
bastion_allowed_cidrs  = ["YOUR_IP/32"]       # curl ifconfig.me

wireguard_peers = [{
  name        = "your-laptop"
  public_key  = "..."                          # wg genkey | tee privkey | wg pubkey
  allowed_ips = "10.10.0.2/32"
}]

cloudflare_api_token = "..."                   # dash.cloudflare.com → API Tokens → DNS:Edit
cloudflare_zone_id   = "..."                   # Cloudflare dashboard → zone overview, right column

nova_domain     = "api.staging.zelly.in"
customer_domain = "customer.staging.zelly.in"
orion_domain    = "backoffice-api.staging.zelly.in"

alert_email = "dev@zelly.in"

✓

subdomain_base defaults to zelly.in — do not override it.

Step 2 — Terraform init & plan

bash

terraform init
terraform plan -out=staging.tfplan

Expected: ~85 resources to create. Review and confirm nothing looks wrong before applying.

Step 3 — Apply

bash — takes ~15 minutes

terraform apply staging.tfplan

Terraform will block on aws_acm_certificate_validation until ACM confirms the Cloudflare DNS CNAMEs — usually 2–5 minutes after the records are created automatically. If it times out, see Gotchas → ACM validation stuck.

Step 4 — Save outputs

bash

terraform output
# Key values you'll need:
#   aurora_endpoint      — DB host for migrations
#   bastion_public_ip    — SSH / WireGuard entry point
#   clickhouse_private_ip — needed for schema apply
#   nova_alb_dns         — verify after DNS propagation

Step 5 — Populate Secrets Manager

Infrastructure secrets (Aurora password, Redis auth token, ClickHouse password) are auto-generated and stored by Terraform. You only need to populate the application-level secrets that Terraform cannot know.

Secrets must exist in Secrets Manager before you deploy the ECS task. If they are missing, the task will fail to start with a CannotPullSecrets or ResourceInitializationError. See Gotchas → Secret missing at task start.

bash — staging, ap-southeast-1

REGION=ap-southeast-1

# fastify-nova app secrets
aws secretsmanager create-secret \
  --region $REGION \
  --name "zelly/fastify-nova/env" \
  --secret-string '{
    "JWT_SECRET":               "CHANGE_ME_strong_random_string",
    "COOKIE_SECRET":            "CHANGE_ME_strong_random_string",
    "RAZORPAY_KEY_ID":          "rzp_test_...",
    "RAZORPAY_KEY_SECRET":      "...",
    "RAZORPAY_WEBHOOK_SECRET":  "...",
    "SLACK_WEBHOOK_URL":        "https://hooks.slack.com/services/...",
    "ID_ENCRYPTION_KEY":        "CHANGE_ME_32_hex_chars",
    "LOG_URL_SECRET":           "CHANGE_ME_64_hex_chars",
    "MARKETING_SECRET_KEY":     "CHANGE_ME_64_hex_chars",
    "MARKETING_INTERNAL_TOKEN": "CHANGE_ME_64_hex_chars"
  }'

# customer-panel (neptune) app secrets
aws secretsmanager create-secret \
  --region $REGION \
  --name "zelly/customer-panel/env" \
  --secret-string '{
    "JWT_SECRET":               "CHANGE_ME_strong_random_string",
    "COOKIE_SECRET":            "CHANGE_ME_strong_random_string",
    "RAZORPAY_KEY_ID":          "rzp_test_...",
    "RAZORPAY_KEY_SECRET":      "...",
    "RAZORPAY_WEBHOOK_SECRET":  "...",
    "SLACK_WEBHOOK_URL":        "https://hooks.slack.com/services/...",
    "EXTERNAL_ADDRESS_API_KEY": "...",
    "SESSION_COOKIE_DOMAIN":    "staging.zelly.in",
    "ID_ENCRYPTION_KEY":        "CHANGE_ME_32_hex_chars",
    "LOG_URL_SECRET":           "CHANGE_ME_64_hex_chars",
    "MARKETING_SECRET_KEY":     "CHANGE_ME_64_hex_chars",
    "MARKETING_INTERNAL_TOKEN": "CHANGE_ME_64_hex_chars"
  }'

# orion-backend app secrets
aws secretsmanager create-secret \
  --region $REGION \
  --name "zelly/orion/env" \
  --secret-string '{
    "JWT_SECRET_KEY":           "CHANGE_ME_strong_random_string",
    "MARKETING_INTERNAL_TOKEN": "CHANGE_ME_64_hex_chars"
  }'

Generate strong random strings with openssl rand -hex 32. Use test/sandbox Razorpay credentials in staging — never live keys. CORS_ORIGINS for orion-backend is a Terraform variable (orion_cors_origins), not a secret — set it in terraform.tfvars.

Step 6 — Run DB migrations

Staging Aurora is publicly accessible — connect directly from your machine without WireGuard.

bash

# Get the Aurora endpoint
AURORA=$(terraform output -raw aurora_endpoint)

# Get the generated master password from Secrets Manager
DB_PASS=$(aws secretsmanager get-secret-value \
  --region ap-southeast-1 \
  --secret-id zelly/aurora/staging \
  --query SecretString --output text \
  | python3 -c "import sys,json; print(json.load(sys.stdin)['password'])")

# Create databases (one-time)
mysql -h "$AURORA" -u root -p"$DB_PASS" -e "
  CREATE DATABASE IF NOT EXISTS astro_primary;
  CREATE DATABASE IF NOT EXISTS ecom_store_front;
"

# Run migrations for each service
cd d:/zelly/backend-api-fastify-nova
DB_HOST="$AURORA" DB_USER=root DB_PASSWORD="$DB_PASS" DB_NAME=astro_primary npm run migrate

cd d:/zelly/customer-panel-neptune
DB_HOST="$AURORA" DB_USER=root DB_PASSWORD="$DB_PASS" DB_NAME=ecom_store_front npm run migrate

Step 7 — Apply ClickHouse schema

The ClickHouse EC2 is in a private subnet. Connect via SSH tunnel through the staging bastion.

bash

BASTION=$(terraform output -raw bastion_public_ip)
CH_IP=$(terraform output -raw clickhouse_private_ip)

CH_PASS=$(aws secretsmanager get-secret-value \
  --region ap-southeast-1 \
  --secret-id zelly/clickhouse/credentials \
  --query SecretString --output text \
  | python3 -c "import sys,json; print(json.load(sys.stdin)['password'])")

# Open tunnel, apply schema, close tunnel
ssh -o StrictHostKeyChecking=no -L 8124:"$CH_IP":8123 ec2-user@"$BASTION" -N &
TUNNEL_PID=$!
sleep 3

cat d:/zelly/store-events-consumer/docker/clickhouse/init/clickhouse_init.sql \
  | curl -s -X POST \
    "http://127.0.0.1:8124/?user=default&password=${CH_PASS}" \
    --data-binary @-

# Verify
curl -s "http://127.0.0.1:8124/?user=default&password=${CH_PASS}&query=SELECT+1"

kill $TUNNEL_PID

Step 8 — Build & push Docker images

bash

ACCOUNT=$(aws sts get-caller-identity --query Account --output text)
ECR="${ACCOUNT}.dkr.ecr.ap-south-1.amazonaws.com"
TAG="staging-initial"

# Authenticate Docker to ECR (token lasts 12h)
aws ecr get-login-password --region ap-south-1 \
  | docker login --username AWS --password-stdin "$ECR"

# Build and push each service
for SERVICE_DIR in \
  "backend-api-fastify-nova:zelly/fastify-nova" \
  "customer-panel-neptune:zelly/customer-panel" \
  "internal-admin-panel-orion/backend:zelly/orion-backend" \
  "store-events-consumer:zelly/events-consumer" \
  "storefront-astro-titan:zelly/astro-storefront"; do
  DIR="${SERVICE_DIR%%:*}"
  REPO="${SERVICE_DIR##*:}"
  IMAGE="${ECR}/${REPO}:${TAG}"
  echo "==> Building ${REPO}"
  docker build -t "$IMAGE" "d:/zelly/${DIR}"
  docker push "$IMAGE"
done

Step 9 — Update tfvars with image tags & re-apply

terraform.tfvars (staging)

nova_image_tag       = "staging-initial"
customer_image_tag   = "staging-initial"
orion_image_tag      = "staging-initial"
consumer_image_tag   = "staging-initial"
storefront_image_tag = "staging-initial"
caddy_image_tag      = "staging-initial"

bash

terraform apply -auto-approve

Because ECS services use lifecycle { ignore_changes = [task_definition] }, Terraform registers a new task definition revision but does NOT automatically redeploy running containers. Force a fresh deployment after apply — see How-To: Force ECS redeployment.

Step 10 — Verify staging

bash

# All ECS services should show RUNNING
aws ecs list-tasks \
  --cluster zelly-staging \
  --region ap-southeast-1 \
  --query 'taskArns' --output table

# API health check
curl https://api.staging.zelly.in/health

# Tail logs live
aws logs tail /zelly/ecs/fastify-nova \
  --region ap-southeast-1 \
  --follow \
  --since 5m

Phase 3 — Production

Do not apply production until staging has been running stably for at least one day. All migrations must be validated on staging first. Never copy staging secrets to production.

Step 1 — Fill in terraform.tfvars

bash

cd d:/zelly/terraform
cp terraform.tfvars.example terraform.tfvars

Use production Razorpay (live) keys, production domain names, and strong unique secrets.

Step 2 — Plan & apply

bash — takes ~20 minutes

terraform init
terraform plan -out=prod.tfplan
terraform apply prod.tfplan

Step 3 — Connect WireGuard VPN

Production Aurora and ClickHouse are in private subnets. All post-apply steps (migrations, ClickHouse schema) require WireGuard to be connected.

bash

# Get your WireGuard client config from Terraform outputs
terraform output wireguard_client_config_peer1
# Save the output to /etc/wireguard/wg0.conf (or paste into the WireGuard app)

sudo wg-quick up wg0

# Verify VPN is active — bastion VPN address
ping 10.10.0.1

Step 4 — Populate Secrets Manager (production)

bash — ap-south-1

REGION=ap-south-1

aws secretsmanager create-secret \
  --region $REGION \
  --name "zelly/fastify-nova/env" \
  --secret-string '{
    "JWT_SECRET":               "CHANGE_ME_production_secret",
    "COOKIE_SECRET":            "CHANGE_ME_production_secret",
    "RAZORPAY_KEY_ID":          "rzp_live_...",
    "RAZORPAY_KEY_SECRET":      "...",
    "RAZORPAY_WEBHOOK_SECRET":  "...",
    "SLACK_WEBHOOK_URL":        "https://hooks.slack.com/services/...",
    "ID_ENCRYPTION_KEY":        "CHANGE_ME_32_hex_chars",
    "LOG_URL_SECRET":           "CHANGE_ME_64_hex_chars",
    "MARKETING_SECRET_KEY":     "CHANGE_ME_64_hex_chars",
    "MARKETING_INTERNAL_TOKEN": "CHANGE_ME_64_hex_chars"
  }'

aws secretsmanager create-secret \
  --region $REGION \
  --name "zelly/customer-panel/env" \
  --secret-string '{
    "JWT_SECRET":               "CHANGE_ME_production_secret",
    "COOKIE_SECRET":            "CHANGE_ME_production_secret",
    "RAZORPAY_KEY_ID":          "rzp_live_...",
    "RAZORPAY_KEY_SECRET":      "...",
    "RAZORPAY_WEBHOOK_SECRET":  "...",
    "SLACK_WEBHOOK_URL":        "https://hooks.slack.com/services/...",
    "EXTERNAL_ADDRESS_API_KEY": "...",
    "SESSION_COOKIE_DOMAIN":    "zelly.in",
    "ID_ENCRYPTION_KEY":        "CHANGE_ME_32_hex_chars",
    "LOG_URL_SECRET":           "CHANGE_ME_64_hex_chars",
    "MARKETING_SECRET_KEY":     "CHANGE_ME_64_hex_chars",
    "MARKETING_INTERNAL_TOKEN": "CHANGE_ME_64_hex_chars"
  }'

aws secretsmanager create-secret \
  --region $REGION \
  --name "zelly/orion/env" \
  --secret-string '{
    "JWT_SECRET_KEY":           "CHANGE_ME_production_secret",
    "MARKETING_INTERNAL_TOKEN": "CHANGE_ME_64_hex_chars"
  }'

Step 5 — Run DB migrations

Production Aurora is private — WireGuard must be connected.

bash — requires WireGuard active

AURORA=$(terraform output -raw aurora_endpoint)

DB_PASS=$(aws secretsmanager get-secret-value \
  --region ap-south-1 \
  --secret-id zelly/aurora/production \
  --query SecretString --output text \
  | python3 -c "import sys,json; print(json.load(sys.stdin)['password'])")

mysql -h "$AURORA" -u root -p"$DB_PASS" -e "
  CREATE DATABASE IF NOT EXISTS astro_primary;
  CREATE DATABASE IF NOT EXISTS ecom_store_front;
"

cd d:/zelly/backend-api-fastify-nova
DB_HOST="$AURORA" DB_USER=root DB_PASSWORD="$DB_PASS" DB_NAME=astro_primary npm run migrate

cd d:/zelly/customer-panel-neptune
DB_HOST="$AURORA" DB_USER=root DB_PASSWORD="$DB_PASS" DB_NAME=ecom_store_front npm run migrate

Step 6 — Apply ClickHouse schema

bash — requires WireGuard active

CH_IP=$(terraform output -raw clickhouse_private_ip)

CH_PASS=$(aws secretsmanager get-secret-value \
  --region ap-south-1 \
  --secret-id zelly/clickhouse/credentials \
  --query SecretString --output text \
  | python3 -c "import sys,json; print(json.load(sys.stdin)['password'])")

# Reachable directly via WireGuard — no tunnel needed
cat d:/zelly/store-events-consumer/docker/clickhouse/init/clickhouse_init.sql \
  | curl -s -X POST \
    "http://${CH_IP}:8123/?user=default&password=${CH_PASS}" \
    --data-binary @-

Step 7 — Build & push Docker images

bash

ACCOUNT=$(aws sts get-caller-identity --query Account --output text)
ECR="${ACCOUNT}.dkr.ecr.ap-south-1.amazonaws.com"
TAG="prod-initial"

aws ecr get-login-password --region ap-south-1 \
  | docker login --username AWS --password-stdin "$ECR"

for SERVICE_DIR in \
  "backend-api-fastify-nova:zelly/fastify-nova" \
  "customer-panel-neptune:zelly/customer-panel" \
  "internal-admin-panel-orion/backend:zelly/orion-backend" \
  "store-events-consumer:zelly/events-consumer" \
  "storefront-astro-titan:zelly/astro-storefront"; do
  DIR="${SERVICE_DIR%%:*}"
  REPO="${SERVICE_DIR##*:}"
  IMAGE="${ECR}/${REPO}:${TAG}"
  echo "==> Building ${REPO}"
  docker build -t "$IMAGE" "d:/zelly/${DIR}"
  docker push "$IMAGE"
done

# Update tfvars with prod tag, re-apply
terraform apply -auto-approve

Step 8 — Verify production

bash

# All tasks running
aws ecs list-tasks \
  --cluster zelly-production \
  --region ap-south-1

# Health checks
curl https://api.zelly.in/health
curl https://customer.zelly.in/health

# Events pipeline: verify ClickHouse has rows (via WireGuard)
CH_IP=$(terraform output -raw clickhouse_private_ip)
curl -s "http://${CH_IP}:8123/?query=SELECT+count()+FROM+analytics.store_events"

Post-deployment checklist

Check	Staging	Production
All ECS tasks show `RUNNING`	✓	✓
ALB health checks green (all 3 target groups)	✓	✓
ACM certs show `ISSUED` in AWS Console	✓	✓
`GET /health` returns 200 on nova, customer, orion ALB domains	✓	✓
Storefront: custom domain → valid cert served by Caddy	✓	✓
BullMQ: enqueue a job → job visible in Redis → consumed by events-consumer within 3s	✓	✓
ClickHouse: `SELECT count() FROM analytics.store_events` returns without error	✓	✓
CloudWatch log groups receiving logs under `/zelly/ecs/`	✓	✓
GitHub Actions secrets: `AWS_ROLE_ARN` set in each service repo	✓	✓
Secrets Manager: all `zelly/*/env` secrets populated with real values	✓	✓

GitHub Actions — set AWS_ROLE_ARN

CI/CD uses GitHub OIDC — no long-lived AWS keys. After terraform apply, set the role ARN as a repository secret in each service repo.

bash — requires GitHub CLI

ROLE_ARN=$(cd d:/zelly/terraform && terraform output -raw github_actions_role_arn)

for REPO in \
  zelly-in/backend-api-fastify-nova \
  zelly-in/customer-panel-neptune \
  zelly-in/internal-admin-panel-orion \
  zelly-in/store-events-consumer \
  zelly-in/storefront-astro-titan; do
  gh secret set AWS_ROLE_ARN --body "$ROLE_ARN" --repo "$REPO"
done

After this, pushing to main triggers a production deploy; pushing to staging triggers a staging deploy. No manual image builds needed.

Production apply — pending steps (July 2026)

Staging was stabilised and hardened on 2026-07-03. The following Terraform changes have been applied to staging but not yet to production. Apply them after staging has been stable for at least one week (target: ~2026-07-10).

Do not run terraform apply on production without completing the import step first. The github_oidc module was moved from staging state to production config on 2026-07-03. The IAM role exists in AWS but is not yet in production Terraform state. Applying without importing will fail with EntityAlreadyExists.

What changed in the Terraform code

Module	Change
`module.github_oidc`	Added to production `main.tf`. Removed from staging (account-wide resource should live in production state). IAM role already exists — must import before apply.
`module.elasticache`	`num_cache_clusters` increased from 1 to 2 with `automatic_failover_enabled = true` and `multi_az_enabled = true`. Production gets replication; staging stays single-node.
`module.fastify_nova`	`health_check_grace_period_seconds = 60` and `deployment_circuit_breaker` added.
`module.customer_panel`	Same — grace period 60s + circuit breaker.
`module.orion_backend`	Same — grace period 60s + circuit breaker.
`module.events_consumer`	`deployment_circuit_breaker` added (no ALB so no grace period).
`module.internal_tools`	`deployment_circuit_breaker` added to both `bull-studio` and `redis-insight` services.
`module.storefront`	Already applied to staging on 2026-07-03 (rev 6). Production apply registers a new task def — force-redeploy needed after.

Step 1 — Import the github_oidc resources into production state

The zelly-github-actions IAM role was previously managed by staging state. After removing it from staging, it is orphaned. Import it into production state before applying.

bash — run from d:/zelly/terraform

cd d:/zelly/terraform
terraform init

# Import the IAM role
terraform import module.github_oidc.aws_iam_role.github_actions zelly-github-actions

# Import the inline policy attached to that role
terraform import module.github_oidc.aws_iam_role_policy.github_actions \
  zelly-github-actions:zelly-github-actions-policy

# Verify — plan should show 0 changes for github_oidc after import
terraform plan -target=module.github_oidc

PowerShell — run from d:\zelly\terraform

Set-Location "d:\zelly\terraform"
$tf = "C:\tools\terraform\terraform.exe"
& $tf init

& $tf import module.github_oidc.aws_iam_role.github_actions zelly-github-actions
& $tf import module.github_oidc.aws_iam_role_policy.github_actions `
  zelly-github-actions:zelly-github-actions-policy

& $tf plan -target=module.github_oidc

The plan after import should show 0 resources to add/change/destroy for module.github_oidc. If it shows any change, review the diff before applying — the policy content must not drift from what was deployed by staging.

Step 2 — Apply all pending modules

bash

# ElastiCache: adds 1 read replica + enables automatic failover (~5 min, no downtime)
# ECS services: in-place update to add grace period + circuit breaker (no task restart)
# github_oidc: no-op after import
terraform apply \
  -target=module.github_oidc \
  -target=module.elasticache \
  -target=module.fastify_nova \
  -target=module.customer_panel \
  -target=module.orion_backend \
  -target=module.events_consumer \
  -target=module.internal_tools \
  -target=module.storefront

PowerShell

$tf = "C:\tools\terraform\terraform.exe"
$tfArgs = @("apply",
  "-target=module.github_oidc",
  "-target=module.elasticache",
  "-target=module.fastify_nova",
  "-target=module.customer_panel",
  "-target=module.orion_backend",
  "-target=module.events_consumer",
  "-target=module.internal_tools",
  "-target=module.storefront",
  "-auto-approve")
& $tf @tfArgs

ElastiCache will scale from 1 to 2 nodes and enable Multi-AZ. This is an in-place modification (apply_immediately = true) — expect ~5 minutes for the replica to initialise. Redis connections are not interrupted during this change.

Step 3 — Force-redeploy storefront to pick up the new task definition

Storefront task definition gets a new revision (adds AUTH_HUB_URL env var). Because of lifecycle { ignore_changes = [task_definition] }, the service stays on the old revision until force-redeployed.

bash

aws ecs update-service \
  --cluster zelly-production \
  --service storefront \
  --force-new-deployment \
  --region ap-south-1

# Wait for stable
aws ecs wait services-stable \
  --cluster zelly-production \
  --services storefront \
  --region ap-south-1 && echo "Stable."

PowerShell

$aws = "C:\Program Files\Amazon\AWSCLIV2\aws.exe"
& $aws ecs update-service `
  --cluster zelly-production `
  --service storefront `
  --force-new-deployment `
  --region ap-south-1

Production apply checklist

Step	Command	Done
Import `github_oidc` IAM role	`terraform import module.github_oidc.aws_iam_role.github_actions zelly-github-actions`	☐
Import `github_oidc` inline policy	`terraform import module.github_oidc.aws_iam_role_policy.github_actions zelly-github-actions:zelly-github-actions-policy`	☐
Verify plan shows 0 changes for github_oidc	`terraform plan -target=module.github_oidc`	☐
Apply all modules	`terraform apply -target=module.github_oidc -target=module.elasticache ...`	☐
Force-redeploy storefront	`aws ecs update-service --service storefront --force-new-deployment --region ap-south-1`	☐
Confirm storefront stable	`aws ecs wait services-stable --cluster zelly-production --services storefront --region ap-south-1`	☐
Confirm ElastiCache has 2 nodes	AWS Console → ElastiCache → `zelly-production` → Nodes: 2	☐
Confirm circuit breaker enabled on all services	`aws ecs describe-services --cluster zelly-production --services fastify-nova customer-panel orion-backend events-consumer --region ap-south-1 --query 'services[*].{s:serviceName,cb:deploymentConfiguration.deploymentCircuitBreaker.enable}'`	☐

Gotchas

Things that have bitten us, collected here so they don't bite you.

ECR is always in ap-south-1, even for staging

Staging runs in ap-southeast-1 but ECR repos live in ap-south-1 (the production account). When building and pushing images for staging, always target the ap-south-1 registry. When Terraform builds image URIs for staging, it uses the account ID but hardcodes ap-south-1 as the registry region. If you run docker push to the wrong region, the image won't exist.

Always authenticate with: aws ecr get-login-password --region ap-south-1

subdomain_base must always be zelly.in — never storego.in

The SUBDOMAIN_BASE env var in fastify-nova controls how storefront subdomain routing works. It defaults to zelly.in in Terraform. If you accidentally set it to storego.in anywhere, storefront tenant lookups will break silently — the app won't error, it just won't find the right tenant.

ACM validation CNAMEs must NOT be proxied through Cloudflare

Terraform automatically creates the Cloudflare CNAME records for ACM DNS validation with proxied = false. If you manually flip those records to proxied (orange cloud), ACM cannot complete validation and the cert will stay in PENDING_VALIDATION forever.

The records have the comment "ACM certificate validation — do not delete". Leave them alone after creation.

Secrets must exist before ECS tasks start — task fails with ResourceInitializationError

ECS pulls secrets from Secrets Manager at task startup via the execution role. If a referenced secret does not exist, the task fails immediately with:

ECS stopped reason

ResourceInitializationError: unable to pull secrets or
registry auth: execution role does not have permissions
or the secret does not exist

Always populate zelly/fastify-nova/env, zelly/customer-panel/env, and zelly/orion/env before running terraform apply for the ECS services.

To check what stopped a task: see How-To: Debug a stopped task.

Terraform apply won't redeploy running ECS tasks

All ECS services have lifecycle { ignore_changes = [task_definition] }. This means terraform apply registers a new task definition revision but the running service stays on the old one. You must force a redeployment to roll out changes.

bash

aws ecs update-service \
  --region ap-southeast-1 \
  --cluster zelly-staging \
  --service fastify-nova \
  --force-new-deployment

This is intentional — it prevents Terraform from accidentally restarting services during infrastructure-only changes (e.g., updating a security group rule).

Secrets Manager: can't immediately recreate a deleted secret

When you delete a secret, AWS puts it in a 7-day recovery window. If you try to create a secret with the same name it will fail. To force-delete without the recovery window:

bash — destructive, irreversible

aws secretsmanager delete-secret \
  --region ap-southeast-1 \
  --secret-id "zelly/fastify-nova/env" \
  --force-delete-without-recovery

To update a secret's value without deleting it, use put-secret-value instead — see How-To: Update a secret.

Aurora version 8.0.mysql_aurora.3.05.2 is not available in ap-southeast-1

Aurora MySQL Serverless v2 version 8.0.mysql_aurora.3.05.2 is not available in Singapore (ap-southeast-1). The Terraform module uses 8.0.mysql_aurora.3.04.1 for staging. Do not try to upgrade this — the apply will fail with Cannot find version.

Production (ap-south-1) can use the higher version if needed, but this has not been tested.

AL2023 root volume minimum is 30 GB

Amazon Linux 2023 AMI snapshots require a minimum root EBS volume of 30 GB. If you set volume_size = 20 for bastion or ClickHouse EC2 instances, the apply fails with InvalidBlockDeviceMapping. Both modules are correctly set to 30 GB — do not lower them.

ElastiCache and Security Group descriptions: ASCII only

AWS does not accept non-ASCII characters in ElastiCache cluster descriptions or Security Group descriptions. Em-dashes (—), smart quotes, and other Unicode punctuation cause InvalidParameterValue errors. Always use plain hyphens in these fields.

This most commonly bites when copy-pasting descriptions from a Markdown file or a macOS editor that auto-converts hyphens to em-dashes.

ElastiCache Redis TLS: port 6379, but the client needs tls: {} option

ElastiCache with transit_encryption_enabled = true still listens on port 6379 — the port doesn't change. What changes is that the Redis client must use TLS when connecting. In Node.js (ioredis or @redis/client), pass tls: {} in the connection options. Without it, the connection hangs silently because the server expects TLS but receives plaintext.

// ioredis
const redis = new Redis({
  host: process.env.REDIS_HOST,
  port: 6379,
  password: process.env.REDIS_PASSWORD,
  tls: {},
})

Cloudflare Terraform provider v4: use content, not value

The cloudflare/cloudflare provider v4 renamed the DNS record field from value to content. Using value will cause a validation error or a deprecation warning that breaks plans. All records in the Terraform modules already use content — if you add a new record manually, make sure to use content.

Never put secrets in terraform.tfvars or ECS environment blocks

All sensitive values live in AWS Secrets Manager only. ECS task definitions reference them via the secrets block (which injects them as env vars at container start via the execution role). Plaintext env vars in the environment block are visible in the ECS console, CloudWatch logs, and Terraform state — never put passwords, tokens, or keys there.

Non-sensitive config that happens to look like a secret (e.g. CLICKHOUSE_USER=default) is fine in environment.

Fargate does not use AWS_ACCESS_KEY_ID / AWS_SECRET_ACCESS_KEY

ECS Fargate tasks use an IAM task role for AWS API access. Credentials are automatically injected via the ECS container metadata endpoint — no static keys needed or wanted. If the app code checks for AWS_ACCESS_KEY_ID, it will get them automatically from the role. Never add these to the task definition environment or Secrets Manager.

Terraform for_each fails with "keys derived from resource attributes unknown until apply"

This error occurs when ACM certificate domain_validation_options are passed through a module output and then used as for_each keys. AWS doesn't know the CNAME names until the cert is created, so the keys are "unknown at plan time" and Terraform refuses to plan.

The fix: ACM certificates must be created in the root config (not inside the alb module), so the domain_name — which IS known at plan time — can be used as the for_each key. This is already the case in both terraform/main.tf and environments/staging/main.tf. Do not move cert resources back into the alb module.

orion-backend CORS_ORIGINS is a Terraform variable, not a secret

CORS_ORIGINS for the orion-backend is the URL of the Cloudflare Pages frontend (seller panel). It is not sensitive, so it lives as a Terraform variable (orion_cors_origins) rather than in Secrets Manager. Set it in terraform.tfvars:

orion_cors_origins = "https://seller.zelly.in"   # your Cloudflare Pages URL

For staging you may want to point it at the Cloudflare Pages preview URL instead. The default is https://seller.zelly.in.

Neptune (customer-panel) shares almost all env vars with fastify-nova

Neptune runs the same backend codebase as fastify-nova in a split-deployment topology (DEPLOYMENT_TOPOLOGY=split, APP_ROLE=api). As a result, its task definition is nearly identical to nova — same Aurora/Redis/ClickHouse secrets, same payment, Slack, logging, and marketing env vars. Its zelly/customer-panel/env secret in Secrets Manager must contain all the same keys as zelly/fastify-nova/env.

Kafka is completely removed — do not re-add it

The analytics pipeline previously used Kafka (KafkaJS in fastify-nova, a Kafka consumer in store-events-consumer). Kafka and all Kafka-related env vars (KAFKA_BROKERS, KAFKA_CLIENT_ID, KAFKA_TOPIC, KAFKA_SSL, etc.) have been removed. The pipeline is now BullMQ + ElastiCache Redis. If you see Kafka config in old .env files, ignore it.

How-Tos

Common operational tasks after the initial deployment.

Update a secret value in Secrets Manager

Use put-secret-value (not create-secret) to update an existing secret. After updating, force-redeploy the affected service so it picks up the new value.

bash

# Fetch current value, edit, put back
aws secretsmanager get-secret-value \
  --region ap-southeast-1 \
  --secret-id "zelly/fastify-nova/env" \
  --query SecretString --output text > /tmp/nova-env.json

# Edit /tmp/nova-env.json, then:
aws secretsmanager put-secret-value \
  --region ap-southeast-1 \
  --secret-id "zelly/fastify-nova/env" \
  --secret-string file:///tmp/nova-env.json

rm /tmp/nova-env.json

Force ECS service redeployment

Needed after: updating Secrets Manager, changing env vars in the task definition, pushing a new Docker image without updating the image tag in tfvars.

bash

SERVICE="fastify-nova"     # or customer-panel, orion-backend, etc.
CLUSTER="zelly-staging"    # or zelly-production
REGION="ap-southeast-1"    # or ap-south-1

aws ecs update-service \
  --region $REGION \
  --cluster $CLUSTER \
  --service $SERVICE \
  --force-new-deployment

# Watch the rollout
aws ecs wait services-stable \
  --region $REGION \
  --cluster $CLUSTER \
  --services $SERVICE
echo "Stable."

Debug a stopped ECS task

When ECS tasks fail to start, the stopped reason tells you exactly why.

bash

CLUSTER="zelly-staging"
REGION="ap-southeast-1"

# List the most recently stopped tasks
aws ecs list-tasks \
  --region $REGION \
  --cluster $CLUSTER \
  --desired-status STOPPED \
  --query 'taskArns[0:5]' --output json

# Describe a stopped task to see stoppedReason + container exit codes
TASK_ARN="arn:aws:ecs:..."   # paste from above

aws ecs describe-tasks \
  --region $REGION \
  --cluster $CLUSTER \
  --tasks "$TASK_ARN" \
  --query 'tasks[0].{stopped:stoppedReason,containers:containers[*].{name:name,exit:exitCode,reason:reason}}'

Common stopped reasons

Stopped reason	Cause	Fix
`ResourceInitializationError: unable to pull secrets`	Secret doesn't exist in Secrets Manager or execution role lacks permission	Create the secret; check the execution role policy covers `zelly/*`
`CannotPullContainerError`	Image doesn't exist in ECR, or ECR is in the wrong region	Push the image; verify image URI uses `ap-south-1` ECR endpoint
`Essential container exited` / exit code 1	App crashed on startup — bad config, failed DB connection, etc.	Check CloudWatch logs: `aws logs tail /zelly/ecs/fastify-nova --follow`
`Task failed ELB health checks`	ALB health check path returns non-2xx or task isn't listening on the expected port	Ensure `/health` endpoint returns 200; check container port mapping

Tail ECS logs live

bash

# Staging
aws logs tail /zelly/ecs/fastify-nova \
  --region ap-southeast-1 \
  --follow \
  --since 10m

# Production
aws logs tail /zelly/ecs/fastify-nova \
  --region ap-south-1 \
  --follow \
  --since 10m

# Filter to errors only
aws logs tail /zelly/ecs/fastify-nova \
  --region ap-southeast-1 \
  --follow \
  --filter-pattern "ERROR"

Roll back to a previous task definition revision

ECS keeps all task definition revisions. To roll back, update the service to use an older revision.

bash

# List recent revisions for a service
aws ecs list-task-definitions \
  --region ap-southeast-1 \
  --family-prefix "zelly-staging-fastify-nova" \
  --sort DESC \
  --query 'taskDefinitionArns[0:5]' --output table

# Roll back to a specific revision
aws ecs update-service \
  --region ap-southeast-1 \
  --cluster zelly-staging \
  --service fastify-nova \
  --task-definition "zelly-staging-fastify-nova:42"   # replace 42 with revision number

Scale an ECS service manually

Auto-scaling handles traffic-based scaling, but you can also set desired count directly. Useful for scaling to zero during off-hours on staging.

bash

# Scale down staging overnight
aws ecs update-service \
  --region ap-southeast-1 \
  --cluster zelly-staging \
  --service fastify-nova \
  --desired-count 0

# Scale back up
aws ecs update-service \
  --region ap-southeast-1 \
  --cluster zelly-staging \
  --service fastify-nova \
  --desired-count 1

Scaling to 0 on production will cause downtime. Only do this on staging, and only if you know what you're doing.

SSH into the bastion host

bash

BASTION=$(cd d:/zelly/terraform/environments/staging && terraform output -raw bastion_public_ip)

ssh -i ~/.ssh/id_rsa ec2-user@"$BASTION"

# Or use SSM Session Manager (no open port 22 needed)
INSTANCE_ID=$(cd d:/zelly/terraform/environments/staging && terraform output -raw bastion_instance_id)
aws ssm start-session \
  --region ap-southeast-1 \
  --target "$INSTANCE_ID"

Access Bull Studio (BullMQ UI) via SSH tunnel

Bull Studio runs as an internal ECS Fargate task with no public ALB. Access it via an SSH tunnel through the bastion.

bash

BASTION=$(cd d:/zelly/terraform/environments/staging && terraform output -raw bastion_public_ip)

# Find the Bull Studio task private IP
BULL_IP=$(aws ecs list-tasks \
  --region ap-southeast-1 \
  --cluster zelly-staging \
  --service-name bull-studio \
  --query 'taskArns[0]' --output text \
  | xargs -I{} aws ecs describe-tasks \
    --region ap-southeast-1 \
    --cluster zelly-staging \
    --tasks {} \
    --query 'tasks[0].attachments[0].details[?name==`privateIPv4Address`].value' \
    --output text)

# Open tunnel — then visit http://localhost:3001 in your browser
ssh -L 3001:"$BULL_IP":3000 ec2-user@"$BASTION" -N

Access RedisInsight via SSH tunnel

bash

BASTION=$(cd d:/zelly/terraform/environments/staging && terraform output -raw bastion_public_ip)

REDIS_INSIGHT_IP=$(aws ecs list-tasks \
  --region ap-southeast-1 \
  --cluster zelly-staging \
  --service-name redis-insight \
  --query 'taskArns[0]' --output text \
  | xargs -I{} aws ecs describe-tasks \
    --region ap-southeast-1 \
    --cluster zelly-staging \
    --tasks {} \
    --query 'tasks[0].attachments[0].details[?name==`privateIPv4Address`].value' \
    --output text)

# Open tunnel — then visit http://localhost:5540 in your browser
ssh -L 5540:"$REDIS_INSIGHT_IP":5540 ec2-user@"$BASTION" -N

Connect to Redis from the bastion

bash — run on the bastion

REDIS_HOST=$(aws secretsmanager get-secret-value \
  --region ap-southeast-1 \
  --secret-id zelly/redis/auth \
  --query SecretString --output text \
  | python3 -c "import sys,json; print(json.load(sys.stdin)['host'])")

REDIS_PASS=$(aws secretsmanager get-secret-value \
  --region ap-southeast-1 \
  --secret-id zelly/redis/auth \
  --query SecretString --output text \
  | python3 -c "import sys,json; print(json.load(sys.stdin)['auth_token'])")

redis-cli -h "$REDIS_HOST" -p 6379 -a "$REDIS_PASS" --tls PING

Connect to Aurora from the bastion

bash — run on the bastion

AURORA_HOST=$(aws secretsmanager get-secret-value \
  --region ap-south-1 \
  --secret-id zelly/aurora/production \
  --query SecretString --output text \
  | python3 -c "import sys,json; print(json.load(sys.stdin)['host'])")

DB_PASS=$(aws secretsmanager get-secret-value \
  --region ap-south-1 \
  --secret-id zelly/aurora/production \
  --query SecretString --output text \
  | python3 -c "import sys,json; print(json.load(sys.stdin)['password'])")

mysql -h "$AURORA_HOST" -u root -p"$DB_PASS"

Run a one-off migration against staging Aurora

For ad-hoc SQL you need to run after the initial migration (schema patches, data fixes).

bash — staging only (publicly accessible)

AURORA=$(cd d:/zelly/terraform/environments/staging && terraform output -raw aurora_endpoint)

DB_PASS=$(aws secretsmanager get-secret-value \
  --region ap-southeast-1 \
  --secret-id zelly/aurora/staging \
  --query SecretString --output text \
  | python3 -c "import sys,json; print(json.load(sys.stdin)['password'])")

# Run a SQL file
mysql -h "$AURORA" -u root -p"$DB_PASS" astro_primary < patch.sql

# Or interactive session
mysql -h "$AURORA" -u root -p"$DB_PASS"

Check BullMQ queue depth from Redis CLI

bash — run on the bastion

# After connecting to redis-cli (see above), run:

# List all BullMQ queues
KEYS bull:*:meta

# Count waiting jobs in a specific queue
LLEN bull:store-events:wait
LLEN bull:store-events:active
ZCOUNT bull:store-events:delayed -inf +inf
ZCOUNT bull:store-events:failed -inf +inf

Terraform: apply changes to a single module

When you change only one service's task definition, use -target to limit the apply scope and avoid accidentally touching unrelated resources.

bash

# Apply only the fastify-nova service module
terraform apply -target module.fastify_nova

# Apply only the ECS cluster (e.g. after adding a log group)
terraform apply -target module.ecs_cluster

# Apply only security groups
terraform apply -target module.security_groups