Runbooks & Troubleshooting
Viewing logs
Production / Staging (CloudWatch)
All ECS services write to CloudWatch. Log groups follow the pattern /zelly/ecs/<service-name> (30-day retention).
# Tail the last 100 lines from fastify-nova aws logs tail /zelly/ecs/fastify-nova \ --follow \ --region ap-south-1 # Filter for errors aws logs filter-log-events \ --log-group-name /zelly/ecs/fastify-nova \ --filter-pattern "ERROR" \ --region ap-south-1 \ --start-time $(date -d '1 hour ago' +%s000)
Local dev
Connecting to staging Aurora (direct)
Staging Aurora is publicly accessible — no VPN needed.
# Get endpoint from terraform outputs cd terraform/environments/staging && terraform output aurora_endpoint # Connect mysql -h <aurora_endpoint> -u root -p # Password is in terraform/local-dev/.env (DB_PASSWORD) or Secrets Manager
Connecting to production Aurora (via WireGuard)
Production Aurora is in a private subnet. You must connect WireGuard first.
# Get your WireGuard client config from terraform outputs cd terraform && terraform output wireguard_client_config_peer1 # Save to /etc/wireguard/wg0.conf, then connect: sudo wg-quick up wg0 # Now connect to Aurora via private IP mysql -h <aurora_private_endpoint> -u root -p
Runbook — Restart an ECS service
aws ecs update-service \ --cluster zelly-production \ --service zelly-production-fastify-nova \ --force-new-deployment \ --region ap-south-1
Runbook — Clear a stuck BullMQ queue
# Connect to Redis redis-cli -h <elasticache_endpoint> -a <REDIS_PASSWORD> # List all BullMQ keys for a queue KEYS bull:store-events:* # Drain failed jobs (run inside redis-cli) DEL bull:store-events:failed # Or use the BullMQ dashboard (internal_tools service, access via WireGuard)
Runbook — Run DB migrations
# Staging: connect directly (Aurora is public) cd backend-api-fastify-nova DB_HOST=<staging_aurora_endpoint> npm run migrate # Production: connect WireGuard first, then run locally DB_HOST=<prod_aurora_private_endpoint> npm run migrate
ALTER TABLE on large tables, use pt-online-schema-change or gh-ost to avoid lock contention.Runbook — Apply ClickHouse schema
The init SQL only runs on a fresh ClickHouse volume. To re-apply manually:
cat store-events-consumer/docker/clickhouse/init/clickhouse_init.sql \
| curl -s -X POST \
"http://<clickhouse_host>:8123/?user=default&password=<CLICKHOUSE_PASSWORD>" \
--data-binary @-
Troubleshooting — Local Dev
Service shows MISSING REQUIRED CONFIG on startup
terraform/local-dev/.env is either missing or has empty values for DB_HOST, DB_USER, or DB_PASSWORD.
fastify-nova exits with ERR_MODULE_NOT_FOUND (service-account-creds-private.json)
The Firebase service account file is missing from backend-api-fastify-nova/. Get it from 1Password, place it at backend-api-fastify-nova/service-account-creds-private.json, then rebuild:
ClickHouse unhealthy — nova / seller-panel / events-consumer stuck in "Created"
ClickHouse listens on IPv4 (0.0.0.0:8123) but Docker Desktop resolves localhost to ::1 (IPv6) first. The healthcheck already uses 127.0.0.1. Verify ClickHouse is responding:
If it prints Ok. but Docker still shows unhealthy, force-recreate:
*.test domains return "connection refused" or don't resolve
- Confirm the hosts file entry is present:
ping zelly-nova.testshould resolve to127.0.0.1. - Confirm port 80 is free (Windows: stop IIS — see Local Dev setup).
- Confirm the proxy container is running:
zdev psshould showzelly-proxyas Up. - Check Caddy logs:
zdev logs proxy.
zelly-seller.test or zelly-admin.test returns 403 Forbidden
Vite 5+ rejects requests whose Host header is not in its allowlist. The Caddyfile already contains header_up Host "localhost" for those routes. If you see 403, reload Caddy:
seller-panel or orion-frontend crashes with SIGBUS in a loop
node:22-alpine (musl libc) conflicts with esbuild's pre-built native binary. Both services use node:22-slim (Debian/glibc). If the named volume has a stale esbuild binary from a previous Alpine run:
Port 80 already in use on Windows — proxy container won't start
IIS binds port 80 by default. Run once as Administrator:
If neither service exists, check with netstat -ano | findstr :80 and kill the PID shown.
npm install runs on every start
This only happens if the node_modules named volume was deleted (e.g. after zdev down -v). It's a one-time cost per fresh volume — takes 2–5 minutes and is cached until the volume is deleted again.
ClickHouse init SQL not running
The init SQL only runs on a fresh volume. If ClickHouse has already started once, the schema is skipped. To force re-init:
Troubleshooting — ECS / Production
ECS task fails to start — "CannotPullContainerError"
The task execution role lacks permission to pull from ECR, or the image tag doesn't exist in ECR. Check:
ECS task starts then exits immediately
Check the CloudWatch logs for the crash message:
Common causes: missing Secrets Manager secret, wrong DB_HOST, or a required env var that is empty in the task definition.
ALB health check failing — service shows "draining" or 0 healthy targets
The ALB health check path for each service must return 200. Common issue: the service hasn't exposed a /health endpoint on the expected path. Check the target group health check path in the AWS console and compare with what the service actually serves.
Storefront custom domain not getting a certificate
On-demand TLS requires:
- The domain DNS A record points to the NLB Elastic IP.
- Caddy can reach Let's Encrypt on port 443 (NLB must pass TCP 443 through to Caddy).
- The
allow-certcallback succeeds — check that fastify-nova's/validate_tenant_domain/:domainreturns 200 for that domain.
Cost management
Estimated monthly spend: ~$460/mo production, ~$320/mo staging. Major cost drivers:
- ALBs — $65/mo fixed (3 ALB + 1 NLB) regardless of traffic
- NAT Gateway — $35/mo fixed per environment
- ECS Fargate — largest variable cost; use Fargate Spot for events-consumer and internal_tools to save ~70%
To reduce staging costs, scale ECS to 0 outside business hours. Aurora Serverless v2 and ElastiCache scale down automatically when idle.