Operations

Debugging & Error Reference

Detailed error scenarios, root causes, and step-by-step solutions organized by system layer. Each section includes the exact error message to grep for, what causes it, and how to fix it.

Quick orientation: Start by checking CloudWatch Logs for the failing service, then the ECS stopped task reason, then the Secrets Manager values. Those three steps resolve ~80% of issues.

ECS / Container Errors

ResourceInitializationError: unable to pull secrets or registry auth

Root cause: The ECS task role lacks permission to read from Secrets Manager, or the secret ARN referenced in the task definition doesn't exist in the correct region.

Diagnose:

bash
# Check the stopped task reason
aws ecs describe-tasks \
  --cluster zelly-staging \
  --tasks $(aws ecs list-tasks --cluster zelly-staging --desired-status STOPPED --query 'taskArns[0]' --output text) \
  --region ap-southeast-1 \
  --query 'tasks[0].stoppedReason'

# Verify the secret exists in the right region
aws secretsmanager describe-secret \
  --secret-id zelly/fastify-nova/env \
  --region ap-southeast-1

Fix:

  1. Confirm the secret exists in the environment's region (staging = ap-southeast-1, prod = ap-south-1)
  2. If missing, create it: aws secretsmanager create-secret --name zelly/fastify-nova/env --secret-string '{}' --region ap-southeast-1
  3. Check the ECS task execution role has secretsmanager:GetSecretValue on the secret ARN
  4. The Terraform module attaches this automatically — re-run terraform apply if the role is misconfigured
CannotPullContainerError: pull image manifest has been retried 5 time(s)

Root cause: ECR authentication failure. The task execution role can't authenticate to ECR, the image tag doesn't exist, or the NAT gateway has a routing issue.

Common causes:

bash
# Verify the image tag exists
aws ecr describe-images \
  --repository-name zelly/fastify-nova \
  --image-ids imageTag=staging-abc12345 \
  --region ap-south-1

# Check if NAT gateway is working (from a bastion in the VPC)
curl -s https://api.ecr.ap-south-1.amazonaws.com/ --max-time 5

Fix: Ensure the image tag you're deploying actually exists in ECR. Trigger a fresh CI build if the tag is missing. If the NAT gateway is down, check the VPC route table has a default route via NAT GW.

Task starts then immediately stops (exit code 1 or 2)

Root cause: Application crash on startup. Usually a missing required environment variable, syntax error in config, or port already in use.

Diagnose:

bash
# Get logs from the crashed container (adjust time range)
aws logs filter-log-events \
  --log-group-name /zelly/ecs/fastify-nova \
  --start-time $(( $(date +%s) * 1000 - 300000 )) \
  --region ap-southeast-1 \
  --query 'events[*].message' \
  --output text | tail -50

Common startup crash reasons:

Log messageCauseFix
Cannot read properties of undefined (reading 'X')Env var not setCheck Secrets Manager, force-redeploy after populating
Error: connect ECONNREFUSEDDB/Redis not reachableCheck VPN, security group, Redis TLS config
listen EADDRINUSE :::3000Port conflictShould not happen in ECS; check containerPort in task def
SyntaxError: Unexpected tokenBad JSON in env varCheck Secrets Manager values for malformed JSON
ALB health check failing — service stuck at 0/N healthy

Root cause: The container is running but the ALB cannot reach the health check endpoint, or the app is not listening on the correct port.

Diagnose:

bash — check target group health
aws elbv2 describe-target-health \
  --target-group-arn arn:aws:elasticloadbalancing:ap-southeast-1:279391564627:targetgroup/fastify-nova-tg/xxx \
  --region ap-southeast-1

Common causes & fixes:

Exit code 137 — container OOM killed

Root cause: Container exceeded its memory limit. ECS sends SIGKILL (exit 137).

bash — check stopped task
aws ecs describe-tasks \
  --cluster zelly-production \
  --tasks TASK_ARN \
  --region ap-south-1 \
  --query 'tasks[0].{reason:stoppedReason,containers:containers[*].{name:name,exitCode:exitCode,reason:reason}}'

Fix: Increase the memory value in the ECS task definition. For fastify-nova, default is 2048 MB. Register a new task definition with higher memory and force-redeploy. Check for memory leaks if this happens repeatedly.

Deployment stuck — new tasks never reach RUNNING

Root cause: Insufficient cluster capacity, task placement failure, or the new tasks are crashing before reaching RUNNING state.

bash
# Check deployment events
aws ecs describe-services \
  --cluster zelly-staging \
  --services fastify-nova \
  --region ap-southeast-1 \
  --query 'services[0].{events:events[0:10],deployments:deployments}'

Fix steps:

  1. Check events — look for service fastify-nova was unable to place a task
  2. For capacity issues: the cluster uses Fargate — there is no capacity to provision. Check AWS service quotas.
  3. If tasks are crashing: see "Task immediately stops" above
  4. Force the old deployment to drain: aws ecs update-service --cluster ... --service ... --desired-count 0 --region ..., then set desired count back
lifecycle { ignore_changes = [task_definition] } — Terraform reverts my deployment

Root cause: All ECS services have this lifecycle rule so Terraform never overwrites CI-deployed images. This is intentional.

What this means: After changing Terraform variables (e.g. memory, env vars), terraform apply registers a new task definition revision but does not force a redeployment. You must do that manually.

bash
aws ecs update-service \
  --cluster zelly-production \
  --service fastify-nova \
  --force-new-deployment \
  --region ap-south-1

Database Errors

ECONNREFUSED 10.0.x.x:3306 or connect ETIMEDOUT

Root cause: Can't reach Aurora. In production, Aurora is in a private subnet — requires WireGuard VPN. In staging, Aurora is publicly accessible.

Diagnose:

bash
# Production: check WireGuard is connected
wg show wg0

# Test connectivity to Aurora endpoint (get endpoint from Terraform outputs or AWS console)
nc -zv aurora-prod-endpoint.cluster-xxx.ap-south-1.rds.amazonaws.com 3306 -w 5

# Staging: test direct (no VPN needed)
mysql -h aurora-staging-endpoint.cluster-xxx.ap-southeast-1.rds.amazonaws.com \
  -u zellymaster -p -e "SELECT 1"

Fix:

Access denied for user 'zellymaster'@'10.x.x.x'

Root cause: Wrong password, wrong username, or the user doesn't have access to the target schema.

bash
# Check the secret value (read from Secrets Manager)
aws secretsmanager get-secret-value \
  --secret-id zelly/aurora/master \
  --region ap-south-1 \
  --query 'SecretString' --output text | python3 -m json.tool

Fix:

Too many connections — Aurora max_connections exceeded

Root cause: The application connection pool is exhausted, or too many services are connecting without pooling.

Aurora Serverless v2 max_connections scales with ACU. At 0.5 ACU ≈ 90 connections. At 8 ACU ≈ 1000+.

bash — check current connections
# Via WireGuard VPN (production) or direct (staging)
mysql -h AURORA_ENDPOINT -u zellymaster -p -e \
  "SELECT USER, HOST, COUNT(*) as cnt FROM information_schema.PROCESSLIST GROUP BY USER, HOST ORDER BY cnt DESC"

Fix:

  1. Check if Aurora has auto-scaled (should scale to handle load)
  2. Kill idle connections: KILL CONNECTION_ID;
  3. Reduce connection pool size in service env vars (DB_POOL_MAX or equivalent)
  4. Consider Aurora RDS Proxy if connection storms are recurring
SSL connection error: SSL is required

Root cause: Aurora requires SSL connections. The database client is not configured to use TLS.

Fix: Add SSL options to the database connection config:

Node.js (mysql2 / TypeORM)
// mysql2 / Sequelize
{
  ssl: { rejectUnauthorized: true }
}

// TypeORM
{
  ssl: true,
  extra: { ssl: { rejectUnauthorized: false } }  // use if cert chain is Amazon RDS
}

The environment variable DB_SSL=true is often what controls this. Check the Secrets Manager secret for the relevant service.

Migration failure — Table 'X' already exists or Migration X has already been run

Root cause: A migration was run against a database that's already in that state, or a partial migration left the schema in an inconsistent state.

bash
# Check migration state (TypeORM example)
mysql -h AURORA_ENDPOINT -u zellymaster -p -e \
  "SELECT * FROM astro_primary.migrations ORDER BY timestamp DESC LIMIT 10"

# Manually mark migration as run without executing (TypeORM)
INSERT INTO migrations (timestamp, name) VALUES (1700000000000, 'MigrationName1700000000000');

Fix:

Aurora cold start — connection timeout on first request after idle period

Root cause: Aurora Serverless v2 scales to zero minimum ACUs when idle. The first request after a long idle period takes 10–30 seconds while the instance scales up.

Staging only — production is set to min 1 ACU.

Fix: Either set a minimum ACU capacity of 0.5 or 1 in staging (Terraform: serverlessv2_scaling_configuration.min_capacity), or implement a connection retry with backoff in the application.

Network & TLS Errors

ALB 502 Bad Gateway

Root cause: The ALB received a bad response from the target (container). The target is unhealthy, crashed mid-request, or the container exited.

Diagnose:

  1. Check CloudWatch logs for the service — look for crash or unhandled errors near the request time
  2. Check ECS stopped tasks for any containers that crashed
  3. Check the ALB access logs in S3 (if enabled) for the exact error

Fix: Usually means the service crashed. Check logs and fix the underlying app error. If the service is overloaded, increase desired task count.

ALB 503 Service Unavailable

Root cause: No healthy targets in the target group. All containers are unhealthy or there are zero running tasks.

bash
aws elbv2 describe-target-health \
  --target-group-arn TARGET_GROUP_ARN \
  --region ap-south-1

Common causes:

ALB 504 Gateway Timeout

Root cause: The application did not respond within the ALB idle timeout (default 60s).

Fix:

Caddy TLS — challenge failed or no valid ACME CA

Root cause: Caddy failed to obtain a Let's Encrypt certificate for a merchant's custom domain. The domain's DNS is not pointing to the NLB Elastic IP.

Diagnose:

bash — check caddy logs
aws logs filter-log-events \
  --log-group-name /zelly/ecs/storefront \
  --filter-pattern "challenge" \
  --start-time $(( $(date +%s) * 1000 - 3600000 )) \
  --region ap-south-1 \
  --query 'events[*].message' --output text

Fix:

  1. Verify the merchant's domain A record points to the NLB Elastic IP (check terraform output nlb_elastic_ip)
  2. Wait for DNS propagation (up to 24h for some registrars)
  3. Check that the /allow-cert endpoint returns 200 for the domain (Caddy calls it before issuing a cert)
  4. Let's Encrypt rate limit: max 5 failed challenges per domain per hour. Wait before retrying.
  5. For testing, switch to Let's Encrypt staging CA temporarily to avoid rate limits
Caddy /allow-cert returns 403 — custom domain blocked

Root cause: The domain is not registered as a valid tenant domain in the astro_primary database. Caddy calls ${CORE_API_URL}/validate_tenant_domain/{domain} before issuing a cert.

Fix:

  1. Verify the merchant's domain is saved in the database
  2. Check fastify-nova logs around the validate_tenant_domain endpoint
  3. Ensure CORE_API_URL is correctly set in the storefront service secret
CORS error — blocked by CORS policy: No 'Access-Control-Allow-Origin'

Root cause: The CORS_ORIGINS variable on orion-backend doesn't include the requesting frontend origin.

Fix:

terraform — update variable
# terraform.tfvars (or staging equivalent)
orion_cors_origins = "https://admin.zelly.in,https://seller.zelly.in,http://localhost:5175,http://localhost:5173"

Then run terraform apply and force-redeploy orion-backend. Note: CORS_ORIGINS is a Terraform variable, not a Secrets Manager key.

BullMQ & Redis Errors

Error: connect ECONNREFUSED or connect ETIMEDOUT on Redis

Root cause: Redis TLS is not configured in the client, or the security group is blocking port 6379.

ElastiCache Redis requires TLS. Clients must pass tls: {}:

Node.js — correct Redis config
const Redis = require('ioredis');
const redis = new Redis({
  host: process.env.REDIS_HOST,
  port: 6379,
  password: process.env.REDIS_AUTH_TOKEN,
  tls: {},           // REQUIRED for ElastiCache TLS
  maxRetriesPerRequest: null,  // Required for BullMQ
  enableReadyCheck: false,
});

Fix: Ensure tls: {} is present. Check REDIS_HOST is the ElastiCache endpoint (not localhost). Locally, Redis runs in Docker without TLS.

Jobs stuck in waiting state — queue not draining

Root cause: The events-consumer worker is not running or is not connected to the same Redis instance.

bash — check consumer health
# Check if events-consumer is running
aws ecs describe-services \
  --cluster zelly-staging \
  --services events-consumer \
  --region ap-southeast-1 \
  --query 'services[0].{desired:desiredCount,running:runningCount}'

# Check consumer logs for errors
aws logs filter-log-events \
  --log-group-name /zelly/ecs/events-consumer \
  --start-time $(( $(date +%s) * 1000 - 600000 )) \
  --region ap-southeast-1 \
  --query 'events[*].message' --output text | tail -30

Fix: If the consumer is down, force-redeploy it. If it's running but not processing, check the queue name — BullMQ queue names must match exactly between producer and consumer (store-events).

Jobs failing repeatedly and landing in failed state

Root cause: The job processor is throwing an error. After maxAttempts retries, BullMQ moves the job to the failed set.

bash — find failed jobs
# Via Bull Studio (access over WireGuard VPN)
# VPN URL: http://10.0.x.x:3000 (check wg-config.json for bastion IP)
# Bull Studio shows failed jobs with their error stack traces

Fix:

  1. Check consumer logs for the error thrown during job processing
  2. Common causes: ClickHouse unreachable, bad data shape, missing env var
  3. Once fixed, retry failed jobs from Bull Studio UI or via BullMQ API
WRONGTYPE Operation against a key holding the wrong kind of value

Root cause: A Redis key collision — a non-BullMQ key exists at the same path as a queue's internal key, usually from a previous deployment with a different naming scheme.

Fix: Flush the conflicting key (NOT the entire DB). Connect to Redis via RedisInsight (over WireGuard VPN), find the conflicting key with KEYS bull:*, and delete it.

Warning: Never run FLUSHALL on production Redis — it clears all BullMQ queues.
Shopify webhook queue backlog building up

Root cause: fastify-nova is receiving webhooks faster than events-consumer can process them, or consumer is down.

bash — check queue depth via Redis
# Access Redis via RedisInsight over WireGuard VPN, or:
redis-cli -h ELASTICACHE_ENDPOINT -p 6379 -a AUTH_TOKEN --tls \
  LLEN "bull:SHOPIFY_WEBHOOK:wait"

Fix:

Service-Specific Errors

fastify-nova

Firebase auth error — Error: Service account object must contain a string "project_id" field

Root cause: The Firebase service account JSON stored in Secrets Manager is malformed or missing.

Fix:

  1. Download the service account JSON from Firebase Console → Project Settings → Service Accounts
  2. Store it as a JSON-escaped string in the FIREBASE_SERVICE_ACCOUNT key in zelly/fastify-nova/env
  3. The value must be the entire JSON object as a single-line string: {"type":"service_account","project_id":"..."}
  4. Force-redeploy fastify-nova after updating the secret
Razorpay BAD_REQUEST_ERROR: Signature Verification Failed

Root cause: RAZORPAY_KEY_SECRET in Secrets Manager doesn't match the key secret in the Razorpay dashboard.

Fix: Log into the Razorpay dashboard, copy the key secret, update zelly/fastify-nova/env in Secrets Manager, force-redeploy.

Shopify webhook signature verification failing

Root cause: SHOPIFY_WEBHOOK_SECRET doesn't match the secret set in the Shopify Partner Dashboard.

Fix: Regenerate the webhook secret in Shopify Partner Dashboard, update the secret, redeploy. Also check that the webhook URL is correct (should point to the ALB endpoint for fastify-nova).

customer-panel-neptune

Session not persisting — users logged out on every request

Root cause: SESSION_SECRET changes between deployments (if it's randomly generated), or Redis is not reachable for session storage.

Fix:

orion-backend

NestJS crash on startup — Cannot read properties of undefined (reading 'forRoot')

Root cause: TypeORM or NestJS module config is missing required environment variables (DB_HOST, DB_PASSWORD, etc.).

Fix: Check that all keys defined in zelly/orion/env are present. Compare with the secrets-schema.json in zelly-ops.

ClickHouse queries returning no data

Root cause: orion-backend reads from ClickHouse for analytics. If CH_HOST is wrong or the ClickHouse EC2 is down, queries silently return empty.

bash — test ClickHouse connectivity
# Via WireGuard VPN
curl -s http://10.0.x.x:8123/?query=SELECT+1

Fix: Verify the ClickHouse EC2 is running (aws ec2 describe-instances --filters Name=tag:Name,Values=zelly-clickhouse) and Docker is running on it (ssh ec2-user@... "docker ps" via bastion).

events-consumer

ClickHouse insert failing — Code: 60. DB::Exception: Table X doesn't exist

Root cause: The ClickHouse schema hasn't been initialized, or CH_DATABASE points to the wrong database.

bash — check ClickHouse schema
# Via WireGuard VPN — run the schema setup
curl -X POST http://10.0.x.x:8123/ \
  --data "SHOW TABLES FROM zelly_analytics"

Fix: Run the ClickHouse schema initialization script. Check store-events-consumer/src/clickhouse/schema.sql and apply it via HTTP API.

storefront-astro-titan

Astro SSR crash — TypeError: fetch failed on server-side API calls

Root cause: The storefront makes SSR API calls to fastify-nova via CORE_API_URL. If this env var points to a non-existent URL or fastify-nova is down, all SSR pages fail.

Fix:

Cloudflare Errors

Cloudflare Pages deployment failed — build error

Root cause: Build command failure, missing environment variable in Pages settings, or Node.js version mismatch.

Fix:

  1. Check the Pages deployment log in Cloudflare Dashboard → Pages → Project → Deployments
  2. Ensure all required env vars are set in Pages Settings → Environment Variables
  3. Check the Node.js version: add NODE_VERSION=22 to environment variables
  4. Test the build locally first: npm run build
ACM certificate stuck validating — PENDING_VALIDATION forever

Root cause: The Cloudflare DNS CNAME for ACM validation has proxied = true. ACM validation CNAMEs must be non-proxied (proxied = false).

Fix: In the Cloudflare dashboard, find the _acme-challenge.* CNAME record and set it to DNS-only (grey cloud). ACM will validate within 30 minutes.

Never proxy ACM validation CNAMEs. This is a hard rule — proxied breaks cert validation.
Cloudflare Worker returning 1101 — Worker threw an unhandled exception

Root cause: JavaScript error in the Worker code. Cloudflare swallows the actual error and shows 1101.

Fix:

  1. Open Cloudflare Dashboard → Workers → zelly-checkout → Logs tab
  2. Reproduce the request to see the real error
  3. Or use: wrangler tail to stream live Worker logs
subdomain_base should always be zelly.in

Root cause: The Terraform variable subdomain_base controls the base domain for all ALB/service subdomains. It must always be zelly.in.

Symptoms: Services unreachable, ACM cert for wrong domain, DNS records pointing nowhere.

Fix: Check terraform.tfvarssubdomain_base = "zelly.in". Never set this to storego.in or any other domain.

General Debugging Tips

The 5-minute drill

When something is broken and you don't know where to start:

  1. CloudWatch Logs: aws logs filter-log-events --log-group-name /zelly/ecs/SERVICE_NAME --start-time ... --query 'events[*].message' --output text | tail -50
  2. ECS service events: aws ecs describe-services --cluster CLUSTER --services SERVICE --query 'services[0].events[0:5]'
  3. Stopped tasks: aws ecs list-tasks --cluster CLUSTER --service-name SERVICE --desired-status STOPPED + describe
  4. Secret values: Verify with aws secretsmanager get-secret-value --secret-id zelly/SERVICE/env
  5. Compare environments: Use zelly-ops Compare tab to see staging vs production image tags and revisions side by side

CloudWatch Insights — useful queries

CloudWatch Logs Insights
# All errors in the last hour across all services
fields @timestamp, @logStream, @message
| filter @message like /(?i)(error|exception|fatal)/
| sort @timestamp desc
| limit 50

# Slow requests (if app logs request timing)
fields @timestamp, @message
| filter @message like /duration/
| parse @message "duration: *ms" as durationMs
| filter durationMs > 1000
| sort durationMs desc
| limit 20

# Health check failures
fields @timestamp, @message
| filter @message like /health/
| sort @timestamp desc
| limit 20

Useful environment variables checklist

ServiceCritical env varsCommon mistakes
fastify-novaDB_HOST, REDIS_HOST, FIREBASE_SERVICE_ACCOUNTFIREBASE JSON malformed, REDIS without TLS
customer-panelDB_HOST, SESSION_SECRET, REDIS_HOSTSESSION_SECRET randomly generated, Redis no TLS
orion-backendDB_HOST, CORS_ORIGINS, CH_HOSTCORS_ORIGINS missing frontend domains
events-consumerREDIS_HOST, CH_HOST, CH_DATABASECH schema not initialized
storefrontCORE_API_URL, DB_HOSTCORE_API_URL pointing to wrong env