First-Time Deployment
This guide walks through bringing up a fresh environment from zero — no existing AWS infrastructure, no Docker images, nothing. Follow the phases in order. Staging and production share ECR repositories (both live in the production AWS account in ap-south-1), so ECR is created once and reused.
Phase 0 — Prerequisites
Tools
| Tool | Minimum version | Install |
|---|---|---|
| Terraform | 1.6 | brew install terraform or terraform.io/install |
| AWS CLI | v2 | brew install awscli |
| Docker | 24+ | Docker Desktop |
| WireGuard | any | brew install wireguard-tools — production access only |
| MySQL client | 8.0 | brew install mysql-client |
AWS credentials
# Verify you are authenticated to the correct account aws sts get-caller-identity # Should return the Zelly production AWS account ID
Use a profile with AdministratorAccess (or a scoped policy covering EC2, ECS, RDS, ElastiCache, Secrets Manager, IAM, ECR). Set AWS_PROFILE=zelly if needed. Both staging and production Terraform runs use the same account — staging is just a separate environment in ap-southeast-1.
One-time: Terraform state backend
Terraform state is stored in S3 with a DynamoDB lock table. These must exist before terraform init. Run once, ever.
# Create the state bucket (versioning + encryption) aws s3api create-bucket \ --bucket zelly-terraform-state \ --region ap-south-1 \ --create-bucket-configuration LocationConstraint=ap-south-1 aws s3api put-bucket-versioning \ --bucket zelly-terraform-state \ --versioning-configuration Status=Enabled aws s3api put-bucket-encryption \ --bucket zelly-terraform-state \ --server-side-encryption-configuration \ '{"Rules":[{"ApplyServerSideEncryptionByDefault":{"SSEAlgorithm":"AES256"}}]}' # Create the DynamoDB lock table aws dynamodb create-table \ --table-name zelly-terraform-locks \ --attribute-definitions AttributeName=LockID,AttributeType=S \ --key-schema AttributeName=LockID,KeyType=HASH \ --billing-mode PAY_PER_REQUEST \ --region ap-south-1
Phase 1 — ECR Repositories
ECR repositories live in ap-south-1 in the production account and are shared between staging and production. Create them first — ECS cannot start tasks without images, and images cannot be pushed without repos.
cd d:/zelly/terraform # Copy and fill in your tfvars (see terraform.tfvars.example) cp terraform.tfvars.example terraform.tfvars terraform init # Apply only ECR — takes ~30 seconds terraform apply -target module.ecr # Verify repos were created aws ecr describe-repositories \ --region ap-south-1 \ --query 'repositories[*].repositoryName' \ --output table
You should see 8 repos: zelly/fastify-nova, zelly/customer-panel, zelly/orion-backend, zelly/events-consumer, zelly/astro-storefront, zelly/caddy, zelly/bull-studio, zelly/redis-insight.
Phase 2 — Staging
Step 1 — Fill in terraform.tfvars
cd d:/zelly/terraform/environments/staging cp terraform.tfvars.example terraform.tfvars
Edit terraform.tfvars and fill in:
subdomain_base defaults to zelly.in — do not override it.Step 2 — Terraform init & plan
terraform init terraform plan -out=staging.tfplan
Expected: ~85 resources to create. Review and confirm nothing looks wrong before applying.
Step 3 — Apply
terraform apply staging.tfplan
Terraform will block on aws_acm_certificate_validation until ACM confirms the Cloudflare DNS CNAMEs — usually 2–5 minutes after the records are created automatically. If it times out, see Gotchas → ACM validation stuck.
Step 4 — Save outputs
terraform output # Key values you'll need: # aurora_endpoint — DB host for migrations # bastion_public_ip — SSH / WireGuard entry point # clickhouse_private_ip — needed for schema apply # nova_alb_dns — verify after DNS propagation
Step 5 — Populate Secrets Manager
Infrastructure secrets (Aurora password, Redis auth token, ClickHouse password) are auto-generated and stored by Terraform. You only need to populate the application-level secrets that Terraform cannot know.
CannotPullSecrets or ResourceInitializationError. See Gotchas → Secret missing at task start.REGION=ap-southeast-1 # fastify-nova app secrets aws secretsmanager create-secret \ --region $REGION \ --name "zelly/fastify-nova/env" \ --secret-string '{ "JWT_SECRET": "CHANGE_ME_strong_random_string", "COOKIE_SECRET": "CHANGE_ME_strong_random_string", "RAZORPAY_KEY_ID": "rzp_test_...", "RAZORPAY_KEY_SECRET": "...", "RAZORPAY_WEBHOOK_SECRET": "...", "SLACK_WEBHOOK_URL": "https://hooks.slack.com/services/...", "ID_ENCRYPTION_KEY": "CHANGE_ME_32_hex_chars", "LOG_URL_SECRET": "CHANGE_ME_64_hex_chars", "MARKETING_SECRET_KEY": "CHANGE_ME_64_hex_chars", "MARKETING_INTERNAL_TOKEN": "CHANGE_ME_64_hex_chars" }' # customer-panel (neptune) app secrets aws secretsmanager create-secret \ --region $REGION \ --name "zelly/customer-panel/env" \ --secret-string '{ "JWT_SECRET": "CHANGE_ME_strong_random_string", "COOKIE_SECRET": "CHANGE_ME_strong_random_string", "RAZORPAY_KEY_ID": "rzp_test_...", "RAZORPAY_KEY_SECRET": "...", "RAZORPAY_WEBHOOK_SECRET": "...", "SLACK_WEBHOOK_URL": "https://hooks.slack.com/services/...", "EXTERNAL_ADDRESS_API_KEY": "...", "SESSION_COOKIE_DOMAIN": "staging.zelly.in", "ID_ENCRYPTION_KEY": "CHANGE_ME_32_hex_chars", "LOG_URL_SECRET": "CHANGE_ME_64_hex_chars", "MARKETING_SECRET_KEY": "CHANGE_ME_64_hex_chars", "MARKETING_INTERNAL_TOKEN": "CHANGE_ME_64_hex_chars" }' # orion-backend app secrets aws secretsmanager create-secret \ --region $REGION \ --name "zelly/orion/env" \ --secret-string '{ "JWT_SECRET_KEY": "CHANGE_ME_strong_random_string", "MARKETING_INTERNAL_TOKEN": "CHANGE_ME_64_hex_chars" }'
openssl rand -hex 32. Use test/sandbox Razorpay credentials in staging — never live keys. CORS_ORIGINS for orion-backend is a Terraform variable (orion_cors_origins), not a secret — set it in terraform.tfvars.Step 6 — Run DB migrations
Staging Aurora is publicly accessible — connect directly from your machine without WireGuard.
# Get the Aurora endpoint AURORA=$(terraform output -raw aurora_endpoint) # Get the generated master password from Secrets Manager DB_PASS=$(aws secretsmanager get-secret-value \ --region ap-southeast-1 \ --secret-id zelly/aurora/staging \ --query SecretString --output text \ | python3 -c "import sys,json; print(json.load(sys.stdin)['password'])") # Create databases (one-time) mysql -h "$AURORA" -u root -p"$DB_PASS" -e " CREATE DATABASE IF NOT EXISTS astro_primary; CREATE DATABASE IF NOT EXISTS ecom_store_front; " # Run migrations for each service cd d:/zelly/backend-api-fastify-nova DB_HOST="$AURORA" DB_USER=root DB_PASSWORD="$DB_PASS" DB_NAME=astro_primary npm run migrate cd d:/zelly/customer-panel-neptune DB_HOST="$AURORA" DB_USER=root DB_PASSWORD="$DB_PASS" DB_NAME=ecom_store_front npm run migrate
Step 7 — Apply ClickHouse schema
The ClickHouse EC2 is in a private subnet. Connect via SSH tunnel through the staging bastion.
BASTION=$(terraform output -raw bastion_public_ip) CH_IP=$(terraform output -raw clickhouse_private_ip) CH_PASS=$(aws secretsmanager get-secret-value \ --region ap-southeast-1 \ --secret-id zelly/clickhouse/credentials \ --query SecretString --output text \ | python3 -c "import sys,json; print(json.load(sys.stdin)['password'])") # Open tunnel, apply schema, close tunnel ssh -o StrictHostKeyChecking=no -L 8124:"$CH_IP":8123 ec2-user@"$BASTION" -N & TUNNEL_PID=$! sleep 3 cat d:/zelly/store-events-consumer/docker/clickhouse/init/clickhouse_init.sql \ | curl -s -X POST \ "http://127.0.0.1:8124/?user=default&password=${CH_PASS}" \ --data-binary @- # Verify curl -s "http://127.0.0.1:8124/?user=default&password=${CH_PASS}&query=SELECT+1" kill $TUNNEL_PID
Step 8 — Build & push Docker images
ACCOUNT=$(aws sts get-caller-identity --query Account --output text)
ECR="${ACCOUNT}.dkr.ecr.ap-south-1.amazonaws.com"
TAG="staging-initial"
# Authenticate Docker to ECR (token lasts 12h)
aws ecr get-login-password --region ap-south-1 \
| docker login --username AWS --password-stdin "$ECR"
# Build and push each service
for SERVICE_DIR in \
"backend-api-fastify-nova:zelly/fastify-nova" \
"customer-panel-neptune:zelly/customer-panel" \
"internal-admin-panel-orion/backend:zelly/orion-backend" \
"store-events-consumer:zelly/events-consumer" \
"storefront-astro-titan:zelly/astro-storefront"; do
DIR="${SERVICE_DIR%%:*}"
REPO="${SERVICE_DIR##*:}"
IMAGE="${ECR}/${REPO}:${TAG}"
echo "==> Building ${REPO}"
docker build -t "$IMAGE" "d:/zelly/${DIR}"
docker push "$IMAGE"
done
Step 9 — Update tfvars with image tags & re-apply
nova_image_tag = "staging-initial" customer_image_tag = "staging-initial" orion_image_tag = "staging-initial" consumer_image_tag = "staging-initial" storefront_image_tag = "staging-initial" caddy_image_tag = "staging-initial"
terraform apply -auto-approve
lifecycle { ignore_changes = [task_definition] }, Terraform registers a new task definition revision but does NOT automatically redeploy running containers. Force a fresh deployment after apply — see How-To: Force ECS redeployment.Step 10 — Verify staging
# All ECS services should show RUNNING aws ecs list-tasks \ --cluster zelly-staging \ --region ap-southeast-1 \ --query 'taskArns' --output table # API health check curl https://api.staging.zelly.in/health # Tail logs live aws logs tail /zelly/ecs/fastify-nova \ --region ap-southeast-1 \ --follow \ --since 5m
Phase 3 — Production
Step 1 — Fill in terraform.tfvars
cd d:/zelly/terraform cp terraform.tfvars.example terraform.tfvars
Use production Razorpay (live) keys, production domain names, and strong unique secrets.
Step 2 — Plan & apply
terraform init terraform plan -out=prod.tfplan terraform apply prod.tfplan
Step 3 — Connect WireGuard VPN
Production Aurora and ClickHouse are in private subnets. All post-apply steps (migrations, ClickHouse schema) require WireGuard to be connected.
# Get your WireGuard client config from Terraform outputs terraform output wireguard_client_config_peer1 # Save the output to /etc/wireguard/wg0.conf (or paste into the WireGuard app) sudo wg-quick up wg0 # Verify VPN is active — bastion VPN address ping 10.10.0.1
Step 4 — Populate Secrets Manager (production)
REGION=ap-south-1
aws secretsmanager create-secret \
--region $REGION \
--name "zelly/fastify-nova/env" \
--secret-string '{
"JWT_SECRET": "CHANGE_ME_production_secret",
"COOKIE_SECRET": "CHANGE_ME_production_secret",
"RAZORPAY_KEY_ID": "rzp_live_...",
"RAZORPAY_KEY_SECRET": "...",
"RAZORPAY_WEBHOOK_SECRET": "...",
"SLACK_WEBHOOK_URL": "https://hooks.slack.com/services/...",
"ID_ENCRYPTION_KEY": "CHANGE_ME_32_hex_chars",
"LOG_URL_SECRET": "CHANGE_ME_64_hex_chars",
"MARKETING_SECRET_KEY": "CHANGE_ME_64_hex_chars",
"MARKETING_INTERNAL_TOKEN": "CHANGE_ME_64_hex_chars"
}'
aws secretsmanager create-secret \
--region $REGION \
--name "zelly/customer-panel/env" \
--secret-string '{
"JWT_SECRET": "CHANGE_ME_production_secret",
"COOKIE_SECRET": "CHANGE_ME_production_secret",
"RAZORPAY_KEY_ID": "rzp_live_...",
"RAZORPAY_KEY_SECRET": "...",
"RAZORPAY_WEBHOOK_SECRET": "...",
"SLACK_WEBHOOK_URL": "https://hooks.slack.com/services/...",
"EXTERNAL_ADDRESS_API_KEY": "...",
"SESSION_COOKIE_DOMAIN": "zelly.in",
"ID_ENCRYPTION_KEY": "CHANGE_ME_32_hex_chars",
"LOG_URL_SECRET": "CHANGE_ME_64_hex_chars",
"MARKETING_SECRET_KEY": "CHANGE_ME_64_hex_chars",
"MARKETING_INTERNAL_TOKEN": "CHANGE_ME_64_hex_chars"
}'
aws secretsmanager create-secret \
--region $REGION \
--name "zelly/orion/env" \
--secret-string '{
"JWT_SECRET_KEY": "CHANGE_ME_production_secret",
"MARKETING_INTERNAL_TOKEN": "CHANGE_ME_64_hex_chars"
}'
Step 5 — Run DB migrations
Production Aurora is private — WireGuard must be connected.
AURORA=$(terraform output -raw aurora_endpoint) DB_PASS=$(aws secretsmanager get-secret-value \ --region ap-south-1 \ --secret-id zelly/aurora/production \ --query SecretString --output text \ | python3 -c "import sys,json; print(json.load(sys.stdin)['password'])") mysql -h "$AURORA" -u root -p"$DB_PASS" -e " CREATE DATABASE IF NOT EXISTS astro_primary; CREATE DATABASE IF NOT EXISTS ecom_store_front; " cd d:/zelly/backend-api-fastify-nova DB_HOST="$AURORA" DB_USER=root DB_PASSWORD="$DB_PASS" DB_NAME=astro_primary npm run migrate cd d:/zelly/customer-panel-neptune DB_HOST="$AURORA" DB_USER=root DB_PASSWORD="$DB_PASS" DB_NAME=ecom_store_front npm run migrate
Step 6 — Apply ClickHouse schema
CH_IP=$(terraform output -raw clickhouse_private_ip)
CH_PASS=$(aws secretsmanager get-secret-value \
--region ap-south-1 \
--secret-id zelly/clickhouse/credentials \
--query SecretString --output text \
| python3 -c "import sys,json; print(json.load(sys.stdin)['password'])")
# Reachable directly via WireGuard — no tunnel needed
cat d:/zelly/store-events-consumer/docker/clickhouse/init/clickhouse_init.sql \
| curl -s -X POST \
"http://${CH_IP}:8123/?user=default&password=${CH_PASS}" \
--data-binary @-
Step 7 — Build & push Docker images
ACCOUNT=$(aws sts get-caller-identity --query Account --output text)
ECR="${ACCOUNT}.dkr.ecr.ap-south-1.amazonaws.com"
TAG="prod-initial"
aws ecr get-login-password --region ap-south-1 \
| docker login --username AWS --password-stdin "$ECR"
for SERVICE_DIR in \
"backend-api-fastify-nova:zelly/fastify-nova" \
"customer-panel-neptune:zelly/customer-panel" \
"internal-admin-panel-orion/backend:zelly/orion-backend" \
"store-events-consumer:zelly/events-consumer" \
"storefront-astro-titan:zelly/astro-storefront"; do
DIR="${SERVICE_DIR%%:*}"
REPO="${SERVICE_DIR##*:}"
IMAGE="${ECR}/${REPO}:${TAG}"
echo "==> Building ${REPO}"
docker build -t "$IMAGE" "d:/zelly/${DIR}"
docker push "$IMAGE"
done
# Update tfvars with prod tag, re-apply
terraform apply -auto-approve
Step 8 — Verify production
# All tasks running aws ecs list-tasks \ --cluster zelly-production \ --region ap-south-1 # Health checks curl https://api.zelly.in/health curl https://customer.zelly.in/health # Events pipeline: verify ClickHouse has rows (via WireGuard) CH_IP=$(terraform output -raw clickhouse_private_ip) curl -s "http://${CH_IP}:8123/?query=SELECT+count()+FROM+analytics.store_events"
Post-deployment checklist
| Check | Staging | Production |
|---|---|---|
All ECS tasks show RUNNING | ✓ | ✓ |
| ALB health checks green (all 3 target groups) | ✓ | ✓ |
ACM certs show ISSUED in AWS Console | ✓ | ✓ |
GET /health returns 200 on nova, customer, orion ALB domains | ✓ | ✓ |
| Storefront: custom domain → valid cert served by Caddy | ✓ | ✓ |
| BullMQ: enqueue a job → job visible in Redis → consumed by events-consumer within 3s | ✓ | ✓ |
ClickHouse: SELECT count() FROM analytics.store_events returns without error | ✓ | ✓ |
CloudWatch log groups receiving logs under /zelly/ecs/ | ✓ | ✓ |
GitHub Actions secrets: AWS_ROLE_ARN set in each service repo | ✓ | ✓ |
Secrets Manager: all zelly/*/env secrets populated with real values | ✓ | ✓ |
GitHub Actions — set AWS_ROLE_ARN
CI/CD uses GitHub OIDC — no long-lived AWS keys. After terraform apply, set the role ARN as a repository secret in each service repo.
ROLE_ARN=$(cd d:/zelly/terraform && terraform output -raw github_actions_role_arn) for REPO in \ zelly-in/backend-api-fastify-nova \ zelly-in/customer-panel-neptune \ zelly-in/internal-admin-panel-orion \ zelly-in/store-events-consumer \ zelly-in/storefront-astro-titan; do gh secret set AWS_ROLE_ARN --body "$ROLE_ARN" --repo "$REPO" done
After this, pushing to main triggers a production deploy; pushing to staging triggers a staging deploy. No manual image builds needed.
Production apply — pending steps (July 2026)
Staging was stabilised and hardened on 2026-07-03. The following Terraform changes have been applied to staging but not yet to production. Apply them after staging has been stable for at least one week (target: ~2026-07-10).
terraform apply on production without completing the import step first. The github_oidc module was moved from staging state to production config on 2026-07-03. The IAM role exists in AWS but is not yet in production Terraform state. Applying without importing will fail with EntityAlreadyExists.What changed in the Terraform code
| Module | Change |
|---|---|
module.github_oidc | Added to production main.tf. Removed from staging (account-wide resource should live in production state). IAM role already exists — must import before apply. |
module.elasticache | num_cache_clusters increased from 1 to 2 with automatic_failover_enabled = true and multi_az_enabled = true. Production gets replication; staging stays single-node. |
module.fastify_nova | health_check_grace_period_seconds = 60 and deployment_circuit_breaker added. |
module.customer_panel | Same — grace period 60s + circuit breaker. |
module.orion_backend | Same — grace period 60s + circuit breaker. |
module.events_consumer | deployment_circuit_breaker added (no ALB so no grace period). |
module.internal_tools | deployment_circuit_breaker added to both bull-studio and redis-insight services. |
module.storefront | Already applied to staging on 2026-07-03 (rev 6). Production apply registers a new task def — force-redeploy needed after. |
Step 1 — Import the github_oidc resources into production state
The zelly-github-actions IAM role was previously managed by staging state. After removing it from staging, it is orphaned. Import it into production state before applying.
cd d:/zelly/terraform terraform init # Import the IAM role terraform import module.github_oidc.aws_iam_role.github_actions zelly-github-actions # Import the inline policy attached to that role terraform import module.github_oidc.aws_iam_role_policy.github_actions \ zelly-github-actions:zelly-github-actions-policy # Verify — plan should show 0 changes for github_oidc after import terraform plan -target=module.github_oidc
Set-Location "d:\zelly\terraform" $tf = "C:\tools\terraform\terraform.exe" & $tf init & $tf import module.github_oidc.aws_iam_role.github_actions zelly-github-actions & $tf import module.github_oidc.aws_iam_role_policy.github_actions ` zelly-github-actions:zelly-github-actions-policy & $tf plan -target=module.github_oidc
module.github_oidc. If it shows any change, review the diff before applying — the policy content must not drift from what was deployed by staging.Step 2 — Apply all pending modules
# ElastiCache: adds 1 read replica + enables automatic failover (~5 min, no downtime) # ECS services: in-place update to add grace period + circuit breaker (no task restart) # github_oidc: no-op after import terraform apply \ -target=module.github_oidc \ -target=module.elasticache \ -target=module.fastify_nova \ -target=module.customer_panel \ -target=module.orion_backend \ -target=module.events_consumer \ -target=module.internal_tools \ -target=module.storefront
$tf = "C:\tools\terraform\terraform.exe"
$tfArgs = @("apply",
"-target=module.github_oidc",
"-target=module.elasticache",
"-target=module.fastify_nova",
"-target=module.customer_panel",
"-target=module.orion_backend",
"-target=module.events_consumer",
"-target=module.internal_tools",
"-target=module.storefront",
"-auto-approve")
& $tf @tfArgs
apply_immediately = true) — expect ~5 minutes for the replica to initialise. Redis connections are not interrupted during this change.Step 3 — Force-redeploy storefront to pick up the new task definition
Storefront task definition gets a new revision (adds AUTH_HUB_URL env var). Because of lifecycle { ignore_changes = [task_definition] }, the service stays on the old revision until force-redeployed.
aws ecs update-service \
--cluster zelly-production \
--service storefront \
--force-new-deployment \
--region ap-south-1
# Wait for stable
aws ecs wait services-stable \
--cluster zelly-production \
--services storefront \
--region ap-south-1 && echo "Stable."
$aws = "C:\Program Files\Amazon\AWSCLIV2\aws.exe" & $aws ecs update-service ` --cluster zelly-production ` --service storefront ` --force-new-deployment ` --region ap-south-1
Production apply checklist
| Step | Command | Done |
|---|---|---|
Import github_oidc IAM role | terraform import module.github_oidc.aws_iam_role.github_actions zelly-github-actions | ☐ |
Import github_oidc inline policy | terraform import module.github_oidc.aws_iam_role_policy.github_actions zelly-github-actions:zelly-github-actions-policy | ☐ |
| Verify plan shows 0 changes for github_oidc | terraform plan -target=module.github_oidc | ☐ |
| Apply all modules | terraform apply -target=module.github_oidc -target=module.elasticache ... | ☐ |
| Force-redeploy storefront | aws ecs update-service --service storefront --force-new-deployment --region ap-south-1 | ☐ |
| Confirm storefront stable | aws ecs wait services-stable --cluster zelly-production --services storefront --region ap-south-1 | ☐ |
| Confirm ElastiCache has 2 nodes | AWS Console → ElastiCache → zelly-production → Nodes: 2 | ☐ |
| Confirm circuit breaker enabled on all services | aws ecs describe-services --cluster zelly-production --services fastify-nova customer-panel orion-backend events-consumer --region ap-south-1 --query 'services[*].{s:serviceName,cb:deploymentConfiguration.deploymentCircuitBreaker.enable}' | ☐ |
Gotchas
Things that have bitten us, collected here so they don't bite you.
ECR is always in ap-south-1, even for staging
Staging runs in ap-southeast-1 but ECR repos live in ap-south-1 (the production account). When building and pushing images for staging, always target the ap-south-1 registry. When Terraform builds image URIs for staging, it uses the account ID but hardcodes ap-south-1 as the registry region. If you run docker push to the wrong region, the image won't exist.
Always authenticate with: aws ecr get-login-password --region ap-south-1
subdomain_base must always be zelly.in — never storego.in
The SUBDOMAIN_BASE env var in fastify-nova controls how storefront subdomain routing works. It defaults to zelly.in in Terraform. If you accidentally set it to storego.in anywhere, storefront tenant lookups will break silently — the app won't error, it just won't find the right tenant.
ACM validation CNAMEs must NOT be proxied through Cloudflare
Terraform automatically creates the Cloudflare CNAME records for ACM DNS validation with proxied = false. If you manually flip those records to proxied (orange cloud), ACM cannot complete validation and the cert will stay in PENDING_VALIDATION forever.
The records have the comment "ACM certificate validation — do not delete". Leave them alone after creation.
Secrets must exist before ECS tasks start — task fails with ResourceInitializationError
ECS pulls secrets from Secrets Manager at task startup via the execution role. If a referenced secret does not exist, the task fails immediately with:
ResourceInitializationError: unable to pull secrets or registry auth: execution role does not have permissions or the secret does not exist
Always populate zelly/fastify-nova/env, zelly/customer-panel/env, and zelly/orion/env before running terraform apply for the ECS services.
To check what stopped a task: see How-To: Debug a stopped task.
Terraform apply won't redeploy running ECS tasks
All ECS services have lifecycle { ignore_changes = [task_definition] }. This means terraform apply registers a new task definition revision but the running service stays on the old one. You must force a redeployment to roll out changes.
aws ecs update-service \ --region ap-southeast-1 \ --cluster zelly-staging \ --service fastify-nova \ --force-new-deployment
This is intentional — it prevents Terraform from accidentally restarting services during infrastructure-only changes (e.g., updating a security group rule).
Secrets Manager: can't immediately recreate a deleted secret
When you delete a secret, AWS puts it in a 7-day recovery window. If you try to create a secret with the same name it will fail. To force-delete without the recovery window:
aws secretsmanager delete-secret \ --region ap-southeast-1 \ --secret-id "zelly/fastify-nova/env" \ --force-delete-without-recovery
To update a secret's value without deleting it, use put-secret-value instead — see How-To: Update a secret.
Aurora version 8.0.mysql_aurora.3.05.2 is not available in ap-southeast-1
Aurora MySQL Serverless v2 version 8.0.mysql_aurora.3.05.2 is not available in Singapore (ap-southeast-1). The Terraform module uses 8.0.mysql_aurora.3.04.1 for staging. Do not try to upgrade this — the apply will fail with Cannot find version.
Production (ap-south-1) can use the higher version if needed, but this has not been tested.
AL2023 root volume minimum is 30 GB
Amazon Linux 2023 AMI snapshots require a minimum root EBS volume of 30 GB. If you set volume_size = 20 for bastion or ClickHouse EC2 instances, the apply fails with InvalidBlockDeviceMapping. Both modules are correctly set to 30 GB — do not lower them.
ElastiCache and Security Group descriptions: ASCII only
AWS does not accept non-ASCII characters in ElastiCache cluster descriptions or Security Group descriptions. Em-dashes (—), smart quotes, and other Unicode punctuation cause InvalidParameterValue errors. Always use plain hyphens in these fields.
This most commonly bites when copy-pasting descriptions from a Markdown file or a macOS editor that auto-converts hyphens to em-dashes.
ElastiCache Redis TLS: port 6379, but the client needs tls: {} option
ElastiCache with transit_encryption_enabled = true still listens on port 6379 — the port doesn't change. What changes is that the Redis client must use TLS when connecting. In Node.js (ioredis or @redis/client), pass tls: {} in the connection options. Without it, the connection hangs silently because the server expects TLS but receives plaintext.
Cloudflare Terraform provider v4: use content, not value
The cloudflare/cloudflare provider v4 renamed the DNS record field from value to content. Using value will cause a validation error or a deprecation warning that breaks plans. All records in the Terraform modules already use content — if you add a new record manually, make sure to use content.
Never put secrets in terraform.tfvars or ECS environment blocks
All sensitive values live in AWS Secrets Manager only. ECS task definitions reference them via the secrets block (which injects them as env vars at container start via the execution role). Plaintext env vars in the environment block are visible in the ECS console, CloudWatch logs, and Terraform state — never put passwords, tokens, or keys there.
Non-sensitive config that happens to look like a secret (e.g. CLICKHOUSE_USER=default) is fine in environment.
Fargate does not use AWS_ACCESS_KEY_ID / AWS_SECRET_ACCESS_KEY
ECS Fargate tasks use an IAM task role for AWS API access. Credentials are automatically injected via the ECS container metadata endpoint — no static keys needed or wanted. If the app code checks for AWS_ACCESS_KEY_ID, it will get them automatically from the role. Never add these to the task definition environment or Secrets Manager.
Terraform for_each fails with "keys derived from resource attributes unknown until apply"
This error occurs when ACM certificate domain_validation_options are passed through a module output and then used as for_each keys. AWS doesn't know the CNAME names until the cert is created, so the keys are "unknown at plan time" and Terraform refuses to plan.
The fix: ACM certificates must be created in the root config (not inside the alb module), so the domain_name — which IS known at plan time — can be used as the for_each key. This is already the case in both terraform/main.tf and environments/staging/main.tf. Do not move cert resources back into the alb module.
orion-backend CORS_ORIGINS is a Terraform variable, not a secret
CORS_ORIGINS for the orion-backend is the URL of the Cloudflare Pages frontend (seller panel). It is not sensitive, so it lives as a Terraform variable (orion_cors_origins) rather than in Secrets Manager. Set it in terraform.tfvars:
For staging you may want to point it at the Cloudflare Pages preview URL instead. The default is https://seller.zelly.in.
Neptune (customer-panel) shares almost all env vars with fastify-nova
Neptune runs the same backend codebase as fastify-nova in a split-deployment topology (DEPLOYMENT_TOPOLOGY=split, APP_ROLE=api). As a result, its task definition is nearly identical to nova — same Aurora/Redis/ClickHouse secrets, same payment, Slack, logging, and marketing env vars. Its zelly/customer-panel/env secret in Secrets Manager must contain all the same keys as zelly/fastify-nova/env.
Kafka is completely removed — do not re-add it
The analytics pipeline previously used Kafka (KafkaJS in fastify-nova, a Kafka consumer in store-events-consumer). Kafka and all Kafka-related env vars (KAFKA_BROKERS, KAFKA_CLIENT_ID, KAFKA_TOPIC, KAFKA_SSL, etc.) have been removed. The pipeline is now BullMQ + ElastiCache Redis. If you see Kafka config in old .env files, ignore it.
How-Tos
Common operational tasks after the initial deployment.
Update a secret value in Secrets Manager
Use put-secret-value (not create-secret) to update an existing secret. After updating, force-redeploy the affected service so it picks up the new value.
# Fetch current value, edit, put back aws secretsmanager get-secret-value \ --region ap-southeast-1 \ --secret-id "zelly/fastify-nova/env" \ --query SecretString --output text > /tmp/nova-env.json # Edit /tmp/nova-env.json, then: aws secretsmanager put-secret-value \ --region ap-southeast-1 \ --secret-id "zelly/fastify-nova/env" \ --secret-string file:///tmp/nova-env.json rm /tmp/nova-env.json
Force ECS service redeployment
Needed after: updating Secrets Manager, changing env vars in the task definition, pushing a new Docker image without updating the image tag in tfvars.
SERVICE="fastify-nova" # or customer-panel, orion-backend, etc. CLUSTER="zelly-staging" # or zelly-production REGION="ap-southeast-1" # or ap-south-1 aws ecs update-service \ --region $REGION \ --cluster $CLUSTER \ --service $SERVICE \ --force-new-deployment # Watch the rollout aws ecs wait services-stable \ --region $REGION \ --cluster $CLUSTER \ --services $SERVICE echo "Stable."
Debug a stopped ECS task
When ECS tasks fail to start, the stopped reason tells you exactly why.
CLUSTER="zelly-staging" REGION="ap-southeast-1" # List the most recently stopped tasks aws ecs list-tasks \ --region $REGION \ --cluster $CLUSTER \ --desired-status STOPPED \ --query 'taskArns[0:5]' --output json # Describe a stopped task to see stoppedReason + container exit codes TASK_ARN="arn:aws:ecs:..." # paste from above aws ecs describe-tasks \ --region $REGION \ --cluster $CLUSTER \ --tasks "$TASK_ARN" \ --query 'tasks[0].{stopped:stoppedReason,containers:containers[*].{name:name,exit:exitCode,reason:reason}}'
Common stopped reasons
| Stopped reason | Cause | Fix |
|---|---|---|
ResourceInitializationError: unable to pull secrets | Secret doesn't exist in Secrets Manager or execution role lacks permission | Create the secret; check the execution role policy covers zelly/* |
CannotPullContainerError | Image doesn't exist in ECR, or ECR is in the wrong region | Push the image; verify image URI uses ap-south-1 ECR endpoint |
Essential container exited / exit code 1 | App crashed on startup — bad config, failed DB connection, etc. | Check CloudWatch logs: aws logs tail /zelly/ecs/fastify-nova --follow |
Task failed ELB health checks | ALB health check path returns non-2xx or task isn't listening on the expected port | Ensure /health endpoint returns 200; check container port mapping |
Tail ECS logs live
# Staging aws logs tail /zelly/ecs/fastify-nova \ --region ap-southeast-1 \ --follow \ --since 10m # Production aws logs tail /zelly/ecs/fastify-nova \ --region ap-south-1 \ --follow \ --since 10m # Filter to errors only aws logs tail /zelly/ecs/fastify-nova \ --region ap-southeast-1 \ --follow \ --filter-pattern "ERROR"
Roll back to a previous task definition revision
ECS keeps all task definition revisions. To roll back, update the service to use an older revision.
# List recent revisions for a service aws ecs list-task-definitions \ --region ap-southeast-1 \ --family-prefix "zelly-staging-fastify-nova" \ --sort DESC \ --query 'taskDefinitionArns[0:5]' --output table # Roll back to a specific revision aws ecs update-service \ --region ap-southeast-1 \ --cluster zelly-staging \ --service fastify-nova \ --task-definition "zelly-staging-fastify-nova:42" # replace 42 with revision number
Scale an ECS service manually
Auto-scaling handles traffic-based scaling, but you can also set desired count directly. Useful for scaling to zero during off-hours on staging.
# Scale down staging overnight aws ecs update-service \ --region ap-southeast-1 \ --cluster zelly-staging \ --service fastify-nova \ --desired-count 0 # Scale back up aws ecs update-service \ --region ap-southeast-1 \ --cluster zelly-staging \ --service fastify-nova \ --desired-count 1
SSH into the bastion host
BASTION=$(cd d:/zelly/terraform/environments/staging && terraform output -raw bastion_public_ip)
ssh -i ~/.ssh/id_rsa ec2-user@"$BASTION"
# Or use SSM Session Manager (no open port 22 needed)
INSTANCE_ID=$(cd d:/zelly/terraform/environments/staging && terraform output -raw bastion_instance_id)
aws ssm start-session \
--region ap-southeast-1 \
--target "$INSTANCE_ID"
Access Bull Studio (BullMQ UI) via SSH tunnel
Bull Studio runs as an internal ECS Fargate task with no public ALB. Access it via an SSH tunnel through the bastion.
BASTION=$(cd d:/zelly/terraform/environments/staging && terraform output -raw bastion_public_ip) # Find the Bull Studio task private IP BULL_IP=$(aws ecs list-tasks \ --region ap-southeast-1 \ --cluster zelly-staging \ --service-name bull-studio \ --query 'taskArns[0]' --output text \ | xargs -I{} aws ecs describe-tasks \ --region ap-southeast-1 \ --cluster zelly-staging \ --tasks {} \ --query 'tasks[0].attachments[0].details[?name==`privateIPv4Address`].value' \ --output text) # Open tunnel — then visit http://localhost:3001 in your browser ssh -L 3001:"$BULL_IP":3000 ec2-user@"$BASTION" -N
Access RedisInsight via SSH tunnel
BASTION=$(cd d:/zelly/terraform/environments/staging && terraform output -raw bastion_public_ip)
REDIS_INSIGHT_IP=$(aws ecs list-tasks \
--region ap-southeast-1 \
--cluster zelly-staging \
--service-name redis-insight \
--query 'taskArns[0]' --output text \
| xargs -I{} aws ecs describe-tasks \
--region ap-southeast-1 \
--cluster zelly-staging \
--tasks {} \
--query 'tasks[0].attachments[0].details[?name==`privateIPv4Address`].value' \
--output text)
# Open tunnel — then visit http://localhost:5540 in your browser
ssh -L 5540:"$REDIS_INSIGHT_IP":5540 ec2-user@"$BASTION" -N
Connect to Redis from the bastion
REDIS_HOST=$(aws secretsmanager get-secret-value \ --region ap-southeast-1 \ --secret-id zelly/redis/auth \ --query SecretString --output text \ | python3 -c "import sys,json; print(json.load(sys.stdin)['host'])") REDIS_PASS=$(aws secretsmanager get-secret-value \ --region ap-southeast-1 \ --secret-id zelly/redis/auth \ --query SecretString --output text \ | python3 -c "import sys,json; print(json.load(sys.stdin)['auth_token'])") redis-cli -h "$REDIS_HOST" -p 6379 -a "$REDIS_PASS" --tls PING
Connect to Aurora from the bastion
AURORA_HOST=$(aws secretsmanager get-secret-value \ --region ap-south-1 \ --secret-id zelly/aurora/production \ --query SecretString --output text \ | python3 -c "import sys,json; print(json.load(sys.stdin)['host'])") DB_PASS=$(aws secretsmanager get-secret-value \ --region ap-south-1 \ --secret-id zelly/aurora/production \ --query SecretString --output text \ | python3 -c "import sys,json; print(json.load(sys.stdin)['password'])") mysql -h "$AURORA_HOST" -u root -p"$DB_PASS"
Run a one-off migration against staging Aurora
For ad-hoc SQL you need to run after the initial migration (schema patches, data fixes).
AURORA=$(cd d:/zelly/terraform/environments/staging && terraform output -raw aurora_endpoint) DB_PASS=$(aws secretsmanager get-secret-value \ --region ap-southeast-1 \ --secret-id zelly/aurora/staging \ --query SecretString --output text \ | python3 -c "import sys,json; print(json.load(sys.stdin)['password'])") # Run a SQL file mysql -h "$AURORA" -u root -p"$DB_PASS" astro_primary < patch.sql # Or interactive session mysql -h "$AURORA" -u root -p"$DB_PASS"
Check BullMQ queue depth from Redis CLI
# After connecting to redis-cli (see above), run: # List all BullMQ queues KEYS bull:*:meta # Count waiting jobs in a specific queue LLEN bull:store-events:wait LLEN bull:store-events:active ZCOUNT bull:store-events:delayed -inf +inf ZCOUNT bull:store-events:failed -inf +inf
Terraform: apply changes to a single module
When you change only one service's task definition, use -target to limit the apply scope and avoid accidentally touching unrelated resources.
# Apply only the fastify-nova service module terraform apply -target module.fastify_nova # Apply only the ECS cluster (e.g. after adding a log group) terraform apply -target module.ecs_cluster # Apply only security groups terraform apply -target module.security_groups