Operations

Runbooks & Troubleshooting

Viewing logs

Production / Staging (CloudWatch)

All ECS services write to CloudWatch. Log groups follow the pattern /zelly/ecs/<service-name> (30-day retention).

bash — tail logs via AWS CLI

# Tail the last 100 lines from fastify-nova
aws logs tail /zelly/ecs/fastify-nova \
  --follow \
  --region ap-south-1

# Filter for errors
aws logs filter-log-events \
  --log-group-name /zelly/ecs/fastify-nova \
  --filter-pattern "ERROR" \
  --region ap-south-1 \
  --start-time $(date -d '1 hour ago' +%s000)

Local dev

zdev logs -f                       # all services
zdev logs -f fastify-nova          # one service
zdev logs --tail 50 orion-backend  # last 50 lines

Connecting to staging Aurora (direct)

Staging Aurora is publicly accessible — no VPN needed.

bash

# Get endpoint from terraform outputs
cd terraform/environments/staging && terraform output aurora_endpoint

# Connect
mysql -h <aurora_endpoint> -u root -p
# Password is in terraform/local-dev/.env (DB_PASSWORD) or Secrets Manager

Connecting to production Aurora (via WireGuard)

Production Aurora is in a private subnet. You must connect WireGuard first.

bash

# Get your WireGuard client config from terraform outputs
cd terraform && terraform output wireguard_client_config_peer1

# Save to /etc/wireguard/wg0.conf, then connect:
sudo wg-quick up wg0

# Now connect to Aurora via private IP
mysql -h <aurora_private_endpoint> -u root -p

Runbook — Restart an ECS service

bash

aws ecs update-service \
  --cluster zelly-production \
  --service zelly-production-fastify-nova \
  --force-new-deployment \
  --region ap-south-1

Runbook — Clear a stuck BullMQ queue

bash — via Redis CLI (connect WireGuard first in prod)

# Connect to Redis
redis-cli -h <elasticache_endpoint> -a <REDIS_PASSWORD>

# List all BullMQ keys for a queue
KEYS bull:store-events:*

# Drain failed jobs (run inside redis-cli)
DEL bull:store-events:failed

# Or use the BullMQ dashboard (internal_tools service, access via WireGuard)

Runbook — Run DB migrations

bash — run migrations against staging

# Staging: connect directly (Aurora is public)
cd backend-api-fastify-nova
DB_HOST=<staging_aurora_endpoint> npm run migrate

# Production: connect WireGuard first, then run locally
DB_HOST=<prod_aurora_private_endpoint> npm run migrate

Never run migrations directly against production without running and verifying them on staging first. For ALTER TABLE on large tables, use pt-online-schema-change or gh-ost to avoid lock contention.

Runbook — Apply ClickHouse schema

The init SQL only runs on a fresh ClickHouse volume. To re-apply manually:

bash — via WireGuard (prod) or direct (staging)

cat store-events-consumer/docker/clickhouse/init/clickhouse_init.sql \
  | curl -s -X POST \
    "http://<clickhouse_host>:8123/?user=default&password=<CLICKHOUSE_PASSWORD>" \
    --data-binary @-

Troubleshooting — Local Dev

Service shows MISSING REQUIRED CONFIG on startup

terraform/local-dev/.env is either missing or has empty values for DB_HOST, DB_USER, or DB_PASSWORD.

# Create if missing
cp terraform/local-dev/.env.example terraform/local-dev/.env
# Edit: fill in DB_HOST, DB_USER, DB_PASSWORD
# Then restart the affected service
zdev up fastify-nova

fastify-nova exits with ERR_MODULE_NOT_FOUND (service-account-creds-private.json)

The Firebase service account file is missing from backend-api-fastify-nova/. Get it from 1Password, place it at backend-api-fastify-nova/service-account-creds-private.json, then rebuild:

zdev up --build fastify-nova

ClickHouse unhealthy — nova / seller-panel / events-consumer stuck in "Created"

ClickHouse listens on IPv4 (0.0.0.0:8123) but Docker Desktop resolves localhost to ::1 (IPv6) first. The healthcheck already uses 127.0.0.1. Verify ClickHouse is responding:

docker exec zelly-clickhouse wget -qO- http://127.0.0.1:8123/ping
# Expected: Ok.

If it prints Ok. but Docker still shows unhealthy, force-recreate:

zdev up clickhouse --force-recreate -d

*.test domains return "connection refused" or don't resolve

Confirm the hosts file entry is present: ping zelly-nova.test should resolve to 127.0.0.1.
Confirm port 80 is free (Windows: stop IIS — see Local Dev setup).
Confirm the proxy container is running: zdev ps should show zelly-proxy as Up.
Check Caddy logs: zdev logs proxy.

zelly-seller.test or zelly-admin.test returns 403 Forbidden

Vite 5+ rejects requests whose Host header is not in its allowlist. The Caddyfile already contains header_up Host "localhost" for those routes. If you see 403, reload Caddy:

docker exec zelly-proxy caddy reload --config /etc/caddy/Caddyfile

seller-panel or orion-frontend crashes with SIGBUS in a loop

node:22-alpine (musl libc) conflicts with esbuild's pre-built native binary. Both services use node:22-slim (Debian/glibc). If the named volume has a stale esbuild binary from a previous Alpine run:

docker volume rm zelly_seller_panel_nm zelly_orion_frontend_nm
zdev up seller-panel orion-frontend

Port 80 already in use on Windows — proxy container won't start

IIS binds port 80 by default. Run once as Administrator:

net stop "World Wide Web Publishing Service"
net stop "IIS Admin Service"

If neither service exists, check with netstat -ano | findstr :80 and kill the PID shown.

npm install runs on every start

This only happens if the node_modules named volume was deleted (e.g. after zdev down -v). It's a one-time cost per fresh volume — takes 2–5 minutes and is cached until the volume is deleted again.

ClickHouse init SQL not running

The init SQL only runs on a fresh volume. If ClickHouse has already started once, the schema is skipped. To force re-init:

docker volume rm zelly_clickhouse_data
zdev up clickhouse

Troubleshooting — ECS / Production

ECS task fails to start — "CannotPullContainerError"

The task execution role lacks permission to pull from ECR, or the image tag doesn't exist in ECR. Check:

# Verify image exists
aws ecr list-images \
  --repository-name zelly/fastify-nova \
  --region ap-south-1 \
  --query 'imageIds[*].imageTag'

# Check task execution role has ecr:BatchGetImage permission
aws iam get-role-policy \
  --role-name zelly-ecs-task-execution \
  --policy-name ecr-pull

ECS task starts then exits immediately

Check the CloudWatch logs for the crash message:

aws logs tail /zelly/ecs/fastify-nova \
  --region ap-south-1 \
  --since 10m

Common causes: missing Secrets Manager secret, wrong DB_HOST, or a required env var that is empty in the task definition.

ALB health check failing — service shows "draining" or 0 healthy targets

The ALB health check path for each service must return 200. Common issue: the service hasn't exposed a /health endpoint on the expected path. Check the target group health check path in the AWS console and compare with what the service actually serves.

Storefront custom domain not getting a certificate

On-demand TLS requires:

The domain DNS A record points to the NLB Elastic IP.
Caddy can reach Let's Encrypt on port 443 (NLB must pass TCP 443 through to Caddy).
The allow-cert callback succeeds — check that fastify-nova's /validate_tenant_domain/:domain returns 200 for that domain.

# Test domain validation endpoint directly
curl https://<nova_alb_endpoint>/validate_tenant_domain/merchant.example.com

Cost management

Estimated monthly spend: ~$460/mo production, ~$320/mo staging. Major cost drivers:

ALBs — $65/mo fixed (3 ALB + 1 NLB) regardless of traffic
NAT Gateway — $35/mo fixed per environment
ECS Fargate — largest variable cost; use Fargate Spot for events-consumer and internal_tools to save ~70%

To reduce staging costs, scale ECS to 0 outside business hours. Aurora Serverless v2 and ElastiCache scale down automatically when idle.