Operations

Debugging & Error Reference

Detailed error scenarios, root causes, and step-by-step solutions organized by system layer. Each section includes the exact error message to grep for, what causes it, and how to fix it.

Quick orientation: Start by checking CloudWatch Logs for the failing service, then the ECS stopped task reason, then the Secrets Manager values. Those three steps resolve ~80% of issues.

ECS / Container Errors

ResourceInitializationError: unable to pull secrets or registry auth

Root cause: The ECS task role lacks permission to read from Secrets Manager, or the secret ARN referenced in the task definition doesn't exist in the correct region.

Diagnose:

bash

# Check the stopped task reason
aws ecs describe-tasks \
  --cluster zelly-staging \
  --tasks $(aws ecs list-tasks --cluster zelly-staging --desired-status STOPPED --query 'taskArns[0]' --output text) \
  --region ap-southeast-1 \
  --query 'tasks[0].stoppedReason'

# Verify the secret exists in the right region
aws secretsmanager describe-secret \
  --secret-id zelly/fastify-nova/env \
  --region ap-southeast-1

Fix:

Confirm the secret exists in the environment's region (staging = ap-southeast-1, prod = ap-south-1)
If missing, create it: aws secretsmanager create-secret --name zelly/fastify-nova/env --secret-string '{}' --region ap-southeast-1
Check the ECS task execution role has secretsmanager:GetSecretValue on the secret ARN
The Terraform module attaches this automatically — re-run terraform apply if the role is misconfigured

CannotPullContainerError: pull image manifest has been retried 5 time(s)

Root cause: ECR authentication failure. The task execution role can't authenticate to ECR, the image tag doesn't exist, or the NAT gateway has a routing issue.

Common causes:

Image tag in task definition doesn't exist in ECR (check with aws ecr describe-images)
ECR repo is in ap-south-1 but task is in a region without a VPC endpoint — requires NAT GW
Task execution role missing ecr:GetAuthorizationToken / ecr:BatchGetImage

bash

# Verify the image tag exists
aws ecr describe-images \
  --repository-name zelly/fastify-nova \
  --image-ids imageTag=staging-abc12345 \
  --region ap-south-1

# Check if NAT gateway is working (from a bastion in the VPC)
curl -s https://api.ecr.ap-south-1.amazonaws.com/ --max-time 5

Fix: Ensure the image tag you're deploying actually exists in ECR. Trigger a fresh CI build if the tag is missing. If the NAT gateway is down, check the VPC route table has a default route via NAT GW.

Task starts then immediately stops (exit code 1 or 2)

Root cause: Application crash on startup. Usually a missing required environment variable, syntax error in config, or port already in use.

Diagnose:

bash

# Get logs from the crashed container (adjust time range)
aws logs filter-log-events \
  --log-group-name /zelly/ecs/fastify-nova \
  --start-time $(( $(date +%s) * 1000 - 300000 )) \
  --region ap-southeast-1 \
  --query 'events[*].message' \
  --output text | tail -50

Common startup crash reasons:

Log message	Cause	Fix
`Cannot read properties of undefined (reading 'X')`	Env var not set	Check Secrets Manager, force-redeploy after populating
`Error: connect ECONNREFUSED`	DB/Redis not reachable	Check VPN, security group, Redis TLS config
`listen EADDRINUSE :::3000`	Port conflict	Should not happen in ECS; check containerPort in task def
`SyntaxError: Unexpected token`	Bad JSON in env var	Check Secrets Manager values for malformed JSON

ALB health check failing — service stuck at 0/N healthy

Root cause: The container is running but the ALB cannot reach the health check endpoint, or the app is not listening on the correct port.

Diagnose:

bash — check target group health

aws elbv2 describe-target-health \
  --target-group-arn arn:aws:elasticloadbalancing:ap-southeast-1:279391564627:targetgroup/fastify-nova-tg/xxx \
  --region ap-southeast-1

Common causes & fixes:

App not listening on correct port: fastify-nova should listen on 0.0.0.0:3000. Check HOST env var — some apps default to 127.0.0.1 which ALB can't reach.
Health check path wrong: Check the ALB target group health check path. Should be /health or / returning 200.
App takes too long to start: Increase health check grace period in the ECS service (Terraform: health_check_grace_period_seconds).
Security group blocking: ALB SG must be in the inbound rules of the task SG on the container port.

Exit code 137 — container OOM killed

Root cause: Container exceeded its memory limit. ECS sends SIGKILL (exit 137).

bash — check stopped task

aws ecs describe-tasks \
  --cluster zelly-production \
  --tasks TASK_ARN \
  --region ap-south-1 \
  --query 'tasks[0].{reason:stoppedReason,containers:containers[*].{name:name,exitCode:exitCode,reason:reason}}'

Fix: Increase the memory value in the ECS task definition. For fastify-nova, default is 2048 MB. Register a new task definition with higher memory and force-redeploy. Check for memory leaks if this happens repeatedly.

Deployment stuck — new tasks never reach RUNNING

Root cause: Insufficient cluster capacity, task placement failure, or the new tasks are crashing before reaching RUNNING state.

bash

# Check deployment events
aws ecs describe-services \
  --cluster zelly-staging \
  --services fastify-nova \
  --region ap-southeast-1 \
  --query 'services[0].{events:events[0:10],deployments:deployments}'

Fix steps:

Check events — look for service fastify-nova was unable to place a task
For capacity issues: the cluster uses Fargate — there is no capacity to provision. Check AWS service quotas.
If tasks are crashing: see "Task immediately stops" above
Force the old deployment to drain: aws ecs update-service --cluster ... --service ... --desired-count 0 --region ..., then set desired count back

lifecycle { ignore_changes = [task_definition] } — Terraform reverts my deployment

Root cause: All ECS services have this lifecycle rule so Terraform never overwrites CI-deployed images. This is intentional.

What this means: After changing Terraform variables (e.g. memory, env vars), terraform apply registers a new task definition revision but does not force a redeployment. You must do that manually.

bash

aws ecs update-service \
  --cluster zelly-production \
  --service fastify-nova \
  --force-new-deployment \
  --region ap-south-1

Database Errors

ECONNREFUSED 10.0.x.x:3306 or connect ETIMEDOUT

Root cause: Can't reach Aurora. In production, Aurora is in a private subnet — requires WireGuard VPN. In staging, Aurora is publicly accessible.

Diagnose:

bash

# Production: check WireGuard is connected
wg show wg0

# Test connectivity to Aurora endpoint (get endpoint from Terraform outputs or AWS console)
nc -zv aurora-prod-endpoint.cluster-xxx.ap-south-1.rds.amazonaws.com 3306 -w 5

# Staging: test direct (no VPN needed)
mysql -h aurora-staging-endpoint.cluster-xxx.ap-southeast-1.rds.amazonaws.com \
  -u zellymaster -p -e "SELECT 1"

Fix:

Production: Connect to WireGuard VPN first. See VPN setup.
ECS services: Check the task security group allows outbound to the Aurora security group on port 3306. Terraform handles this, but re-apply if SG rules are missing.
Staging: Aurora is public — if it's refusing, the Aurora cluster may be paused (Serverless v2 cold start). Wait 10–30s and retry.

Access denied for user 'zellymaster'@'10.x.x.x'

Root cause: Wrong password, wrong username, or the user doesn't have access to the target schema.

bash

# Check the secret value (read from Secrets Manager)
aws secretsmanager get-secret-value \
  --secret-id zelly/aurora/master \
  --region ap-south-1 \
  --query 'SecretString' --output text | python3 -m json.tool

Fix:

Verify the password in Secrets Manager matches the Aurora master password set during provisioning
If passwords drifted, update either the Aurora master password via RDS console or the secret value
ECS does not hot-reload secrets — force-redeploy after updating the secret

Too many connections — Aurora max_connections exceeded

Root cause: The application connection pool is exhausted, or too many services are connecting without pooling.

Aurora Serverless v2 max_connections scales with ACU. At 0.5 ACU ≈ 90 connections. At 8 ACU ≈ 1000+.

bash — check current connections

# Via WireGuard VPN (production) or direct (staging)
mysql -h AURORA_ENDPOINT -u zellymaster -p -e \
  "SELECT USER, HOST, COUNT(*) as cnt FROM information_schema.PROCESSLIST GROUP BY USER, HOST ORDER BY cnt DESC"

Fix:

Check if Aurora has auto-scaled (should scale to handle load)
Kill idle connections: KILL CONNECTION_ID;
Reduce connection pool size in service env vars (DB_POOL_MAX or equivalent)
Consider Aurora RDS Proxy if connection storms are recurring

SSL connection error: SSL is required

Root cause: Aurora requires SSL connections. The database client is not configured to use TLS.

Fix: Add SSL options to the database connection config:

Node.js (mysql2 / TypeORM)

// mysql2 / Sequelize
{
  ssl: { rejectUnauthorized: true }
}

// TypeORM
{
  ssl: true,
  extra: { ssl: { rejectUnauthorized: false } }  // use if cert chain is Amazon RDS
}

The environment variable DB_SSL=true is often what controls this. Check the Secrets Manager secret for the relevant service.

Migration failure — Table 'X' already exists or Migration X has already been run

Root cause: A migration was run against a database that's already in that state, or a partial migration left the schema in an inconsistent state.

bash

# Check migration state (TypeORM example)
mysql -h AURORA_ENDPOINT -u zellymaster -p -e \
  "SELECT * FROM astro_primary.migrations ORDER BY timestamp DESC LIMIT 10"

# Manually mark migration as run without executing (TypeORM)
INSERT INTO migrations (timestamp, name) VALUES (1700000000000, 'MigrationName1700000000000');

Fix:

Never run migrations against production without testing on staging first
If a migration is stuck, check for locked tables: SHOW ENGINE INNODB STATUS\G
For Aurora Serverless, migrations that take >30s may hit the Aurora connection timeout — use NOWAIT hints or batch the migration

Aurora cold start — connection timeout on first request after idle period

Root cause: Aurora Serverless v2 scales to zero minimum ACUs when idle. The first request after a long idle period takes 10–30 seconds while the instance scales up.

Staging only — production is set to min 1 ACU.

Fix: Either set a minimum ACU capacity of 0.5 or 1 in staging (Terraform: serverlessv2_scaling_configuration.min_capacity), or implement a connection retry with backoff in the application.

Network & TLS Errors

ALB 502 Bad Gateway

Root cause: The ALB received a bad response from the target (container). The target is unhealthy, crashed mid-request, or the container exited.

Diagnose:

Check CloudWatch logs for the service — look for crash or unhandled errors near the request time
Check ECS stopped tasks for any containers that crashed
Check the ALB access logs in S3 (if enabled) for the exact error

Fix: Usually means the service crashed. Check logs and fix the underlying app error. If the service is overloaded, increase desired task count.

ALB 503 Service Unavailable

Root cause: No healthy targets in the target group. All containers are unhealthy or there are zero running tasks.

bash

aws elbv2 describe-target-health \
  --target-group-arn TARGET_GROUP_ARN \
  --region ap-south-1

Common causes:

ECS service desired count is 0 — scale up
All tasks are failing health checks — fix the health check endpoint or grace period
Deployment in progress — wait for the rolling update to complete
ALB security group was changed and is now blocking the health check port

ALB 504 Gateway Timeout

Root cause: The application did not respond within the ALB idle timeout (default 60s).

Fix:

Find and optimize the slow endpoint in the application logs
Increase ALB idle timeout in Terraform (idle_timeout on the ALB resource) if the request is legitimately long-running
For DB queries, add indexes or optimize the query

Caddy TLS — challenge failed or no valid ACME CA

Root cause: Caddy failed to obtain a Let's Encrypt certificate for a merchant's custom domain. The domain's DNS is not pointing to the NLB Elastic IP.

Diagnose:

bash — check caddy logs

aws logs filter-log-events \
  --log-group-name /zelly/ecs/storefront \
  --filter-pattern "challenge" \
  --start-time $(( $(date +%s) * 1000 - 3600000 )) \
  --region ap-south-1 \
  --query 'events[*].message' --output text

Fix:

Verify the merchant's domain A record points to the NLB Elastic IP (check terraform output nlb_elastic_ip)
Wait for DNS propagation (up to 24h for some registrars)
Check that the /allow-cert endpoint returns 200 for the domain (Caddy calls it before issuing a cert)
Let's Encrypt rate limit: max 5 failed challenges per domain per hour. Wait before retrying.
For testing, switch to Let's Encrypt staging CA temporarily to avoid rate limits

Caddy /allow-cert returns 403 — custom domain blocked

Root cause: The domain is not registered as a valid tenant domain in the astro_primary database. Caddy calls ${CORE_API_URL}/validate_tenant_domain/{domain} before issuing a cert.

Fix:

Verify the merchant's domain is saved in the database
Check fastify-nova logs around the validate_tenant_domain endpoint
Ensure CORE_API_URL is correctly set in the storefront service secret

CORS error — blocked by CORS policy: No 'Access-Control-Allow-Origin'

Root cause: The CORS_ORIGINS variable on orion-backend doesn't include the requesting frontend origin.

Fix:

terraform — update variable

# terraform.tfvars (or staging equivalent)
orion_cors_origins = "https://admin.zelly.in,https://seller.zelly.in,http://localhost:5175,http://localhost:5173"

Then run terraform apply and force-redeploy orion-backend. Note: CORS_ORIGINS is a Terraform variable, not a Secrets Manager key.

BullMQ & Redis Errors

Error: connect ECONNREFUSED or connect ETIMEDOUT on Redis

Root cause: Redis TLS is not configured in the client, or the security group is blocking port 6379.

ElastiCache Redis requires TLS. Clients must pass tls: {}:

Node.js — correct Redis config

const Redis = require('ioredis');
const redis = new Redis({
  host: process.env.REDIS_HOST,
  port: 6379,
  password: process.env.REDIS_AUTH_TOKEN,
  tls: {},           // REQUIRED for ElastiCache TLS
  maxRetriesPerRequest: null,  // Required for BullMQ
  enableReadyCheck: false,
});

Fix: Ensure tls: {} is present. Check REDIS_HOST is the ElastiCache endpoint (not localhost). Locally, Redis runs in Docker without TLS.

Jobs stuck in waiting state — queue not draining

Root cause: The events-consumer worker is not running or is not connected to the same Redis instance.

bash — check consumer health

# Check if events-consumer is running
aws ecs describe-services \
  --cluster zelly-staging \
  --services events-consumer \
  --region ap-southeast-1 \
  --query 'services[0].{desired:desiredCount,running:runningCount}'

# Check consumer logs for errors
aws logs filter-log-events \
  --log-group-name /zelly/ecs/events-consumer \
  --start-time $(( $(date +%s) * 1000 - 600000 )) \
  --region ap-southeast-1 \
  --query 'events[*].message' --output text | tail -30

Fix: If the consumer is down, force-redeploy it. If it's running but not processing, check the queue name — BullMQ queue names must match exactly between producer and consumer (store-events).

Jobs failing repeatedly and landing in failed state

Root cause: The job processor is throwing an error. After maxAttempts retries, BullMQ moves the job to the failed set.

bash — find failed jobs

# Via Bull Studio (access over WireGuard VPN)
# VPN URL: http://10.0.x.x:3000 (check wg-config.json for bastion IP)
# Bull Studio shows failed jobs with their error stack traces

Fix:

Check consumer logs for the error thrown during job processing
Common causes: ClickHouse unreachable, bad data shape, missing env var
Once fixed, retry failed jobs from Bull Studio UI or via BullMQ API

WRONGTYPE Operation against a key holding the wrong kind of value

Root cause: A Redis key collision — a non-BullMQ key exists at the same path as a queue's internal key, usually from a previous deployment with a different naming scheme.

Fix: Flush the conflicting key (NOT the entire DB). Connect to Redis via RedisInsight (over WireGuard VPN), find the conflicting key with KEYS bull:*, and delete it.

Warning: Never run FLUSHALL on production Redis — it clears all BullMQ queues.

Shopify webhook queue backlog building up

Root cause: fastify-nova is receiving webhooks faster than events-consumer can process them, or consumer is down.

bash — check queue depth via Redis

# Access Redis via RedisInsight over WireGuard VPN, or:
redis-cli -h ELASTICACHE_ENDPOINT -p 6379 -a AUTH_TOKEN --tls \
  LLEN "bull:SHOPIFY_WEBHOOK:wait"

Fix:

Scale up events-consumer desired task count (it's fixed at 1 — change in Terraform events_consumer_desired_count)
The job processor uses batch inserts — increase batch size or concurrency in the worker config

Service-Specific Errors

fastify-nova

Firebase auth error — Error: Service account object must contain a string "project_id" field

Root cause: The Firebase service account JSON stored in Secrets Manager is malformed or missing.

Fix:

Download the service account JSON from Firebase Console → Project Settings → Service Accounts
Store it as a JSON-escaped string in the FIREBASE_SERVICE_ACCOUNT key in zelly/fastify-nova/env
The value must be the entire JSON object as a single-line string: {"type":"service_account","project_id":"..."}
Force-redeploy fastify-nova after updating the secret

Razorpay BAD_REQUEST_ERROR: Signature Verification Failed

Root cause: RAZORPAY_KEY_SECRET in Secrets Manager doesn't match the key secret in the Razorpay dashboard.

Fix: Log into the Razorpay dashboard, copy the key secret, update zelly/fastify-nova/env in Secrets Manager, force-redeploy.

Shopify webhook signature verification failing

Root cause: SHOPIFY_WEBHOOK_SECRET doesn't match the secret set in the Shopify Partner Dashboard.

Fix: Regenerate the webhook secret in Shopify Partner Dashboard, update the secret, redeploy. Also check that the webhook URL is correct (should point to the ALB endpoint for fastify-nova).

customer-panel-neptune

Session not persisting — users logged out on every request

Root cause: SESSION_SECRET changes between deployments (if it's randomly generated), or Redis is not reachable for session storage.

Fix:

Ensure SESSION_SECRET is a fixed value in Secrets Manager (not dynamically generated)
Check Redis connectivity from the customer-panel container
If behind a load balancer, ensure sticky sessions are not required (session should be stored in Redis, not memory)

orion-backend

NestJS crash on startup — Cannot read properties of undefined (reading 'forRoot')

Root cause: TypeORM or NestJS module config is missing required environment variables (DB_HOST, DB_PASSWORD, etc.).

Fix: Check that all keys defined in zelly/orion/env are present. Compare with the secrets-schema.json in zelly-ops.

ClickHouse queries returning no data

Root cause: orion-backend reads from ClickHouse for analytics. If CH_HOST is wrong or the ClickHouse EC2 is down, queries silently return empty.

bash — test ClickHouse connectivity

# Via WireGuard VPN
curl -s http://10.0.x.x:8123/?query=SELECT+1

Fix: Verify the ClickHouse EC2 is running (aws ec2 describe-instances --filters Name=tag:Name,Values=zelly-clickhouse) and Docker is running on it (ssh ec2-user@... "docker ps" via bastion).

events-consumer

ClickHouse insert failing — Code: 60. DB::Exception: Table X doesn't exist

Root cause: The ClickHouse schema hasn't been initialized, or CH_DATABASE points to the wrong database.

bash — check ClickHouse schema

# Via WireGuard VPN — run the schema setup
curl -X POST http://10.0.x.x:8123/ \
  --data "SHOW TABLES FROM zelly_analytics"

Fix: Run the ClickHouse schema initialization script. Check store-events-consumer/src/clickhouse/schema.sql and apply it via HTTP API.

storefront-astro-titan

Astro SSR crash — TypeError: fetch failed on server-side API calls

Root cause: The storefront makes SSR API calls to fastify-nova via CORE_API_URL. If this env var points to a non-existent URL or fastify-nova is down, all SSR pages fail.

Fix:

Check CORE_API_URL in zelly/storefront/env — should be the internal ALB DNS for fastify-nova
From inside the ECS task network, test: curl http://fastify-nova-internal-alb.../health
Ensure fastify-nova is healthy before the storefront picks up traffic

Cloudflare Errors

Cloudflare Pages deployment failed — build error

Root cause: Build command failure, missing environment variable in Pages settings, or Node.js version mismatch.

Fix:

Check the Pages deployment log in Cloudflare Dashboard → Pages → Project → Deployments
Ensure all required env vars are set in Pages Settings → Environment Variables
Check the Node.js version: add NODE_VERSION=22 to environment variables
Test the build locally first: npm run build

ACM certificate stuck validating — PENDING_VALIDATION forever

Root cause: The Cloudflare DNS CNAME for ACM validation has proxied = true. ACM validation CNAMEs must be non-proxied (proxied = false).

Fix: In the Cloudflare dashboard, find the _acme-challenge.* CNAME record and set it to DNS-only (grey cloud). ACM will validate within 30 minutes.

Never proxy ACM validation CNAMEs. This is a hard rule — proxied breaks cert validation.

Cloudflare Worker returning 1101 — Worker threw an unhandled exception

Root cause: JavaScript error in the Worker code. Cloudflare swallows the actual error and shows 1101.

Fix:

Open Cloudflare Dashboard → Workers → zelly-checkout → Logs tab
Reproduce the request to see the real error
Or use: wrangler tail to stream live Worker logs

subdomain_base should always be zelly.in

Root cause: The Terraform variable subdomain_base controls the base domain for all ALB/service subdomains. It must always be zelly.in.

Symptoms: Services unreachable, ACM cert for wrong domain, DNS records pointing nowhere.

Fix: Check terraform.tfvars — subdomain_base = "zelly.in". Never set this to storego.in or any other domain.

General Debugging Tips

The 5-minute drill

When something is broken and you don't know where to start:

CloudWatch Logs: aws logs filter-log-events --log-group-name /zelly/ecs/SERVICE_NAME --start-time ... --query 'events[*].message' --output text | tail -50
ECS service events: aws ecs describe-services --cluster CLUSTER --services SERVICE --query 'services[0].events[0:5]'
Stopped tasks: aws ecs list-tasks --cluster CLUSTER --service-name SERVICE --desired-status STOPPED + describe
Secret values: Verify with aws secretsmanager get-secret-value --secret-id zelly/SERVICE/env
Compare environments: Use zelly-ops Compare tab to see staging vs production image tags and revisions side by side

CloudWatch Insights — useful queries

CloudWatch Logs Insights

# All errors in the last hour across all services
fields @timestamp, @logStream, @message
| filter @message like /(?i)(error|exception|fatal)/
| sort @timestamp desc
| limit 50

# Slow requests (if app logs request timing)
fields @timestamp, @message
| filter @message like /duration/
| parse @message "duration: *ms" as durationMs
| filter durationMs > 1000
| sort durationMs desc
| limit 20

# Health check failures
fields @timestamp, @message
| filter @message like /health/
| sort @timestamp desc
| limit 20

Useful environment variables checklist

Service	Critical env vars	Common mistakes
fastify-nova	`DB_HOST`, `REDIS_HOST`, `FIREBASE_SERVICE_ACCOUNT`	FIREBASE JSON malformed, REDIS without TLS
customer-panel	`DB_HOST`, `SESSION_SECRET`, `REDIS_HOST`	SESSION_SECRET randomly generated, Redis no TLS
orion-backend	`DB_HOST`, `CORS_ORIGINS`, `CH_HOST`	CORS_ORIGINS missing frontend domains
events-consumer	`REDIS_HOST`, `CH_HOST`, `CH_DATABASE`	CH schema not initialized
storefront	`CORE_API_URL`, `DB_HOST`	CORE_API_URL pointing to wrong env