Debugging & Error Reference
Detailed error scenarios, root causes, and step-by-step solutions organized by system layer. Each section includes the exact error message to grep for, what causes it, and how to fix it.
ECS / Container Errors
ResourceInitializationError: unable to pull secrets or registry auth
Root cause: The ECS task role lacks permission to read from Secrets Manager, or the secret ARN referenced in the task definition doesn't exist in the correct region.
Diagnose:
# Check the stopped task reason aws ecs describe-tasks \ --cluster zelly-staging \ --tasks $(aws ecs list-tasks --cluster zelly-staging --desired-status STOPPED --query 'taskArns[0]' --output text) \ --region ap-southeast-1 \ --query 'tasks[0].stoppedReason' # Verify the secret exists in the right region aws secretsmanager describe-secret \ --secret-id zelly/fastify-nova/env \ --region ap-southeast-1
Fix:
- Confirm the secret exists in the environment's region (staging =
ap-southeast-1, prod =ap-south-1) - If missing, create it:
aws secretsmanager create-secret --name zelly/fastify-nova/env --secret-string '{}' --region ap-southeast-1 - Check the ECS task execution role has
secretsmanager:GetSecretValueon the secret ARN - The Terraform module attaches this automatically — re-run
terraform applyif the role is misconfigured
CannotPullContainerError: pull image manifest has been retried 5 time(s)
Root cause: ECR authentication failure. The task execution role can't authenticate to ECR, the image tag doesn't exist, or the NAT gateway has a routing issue.
Common causes:
- Image tag in task definition doesn't exist in ECR (check with
aws ecr describe-images) - ECR repo is in
ap-south-1but task is in a region without a VPC endpoint — requires NAT GW - Task execution role missing
ecr:GetAuthorizationToken/ecr:BatchGetImage
# Verify the image tag exists aws ecr describe-images \ --repository-name zelly/fastify-nova \ --image-ids imageTag=staging-abc12345 \ --region ap-south-1 # Check if NAT gateway is working (from a bastion in the VPC) curl -s https://api.ecr.ap-south-1.amazonaws.com/ --max-time 5
Fix: Ensure the image tag you're deploying actually exists in ECR. Trigger a fresh CI build if the tag is missing. If the NAT gateway is down, check the VPC route table has a default route via NAT GW.
Task starts then immediately stops (exit code 1 or 2)
Root cause: Application crash on startup. Usually a missing required environment variable, syntax error in config, or port already in use.
Diagnose:
# Get logs from the crashed container (adjust time range) aws logs filter-log-events \ --log-group-name /zelly/ecs/fastify-nova \ --start-time $(( $(date +%s) * 1000 - 300000 )) \ --region ap-southeast-1 \ --query 'events[*].message' \ --output text | tail -50
Common startup crash reasons:
| Log message | Cause | Fix |
|---|---|---|
Cannot read properties of undefined (reading 'X') | Env var not set | Check Secrets Manager, force-redeploy after populating |
Error: connect ECONNREFUSED | DB/Redis not reachable | Check VPN, security group, Redis TLS config |
listen EADDRINUSE :::3000 | Port conflict | Should not happen in ECS; check containerPort in task def |
SyntaxError: Unexpected token | Bad JSON in env var | Check Secrets Manager values for malformed JSON |
ALB health check failing — service stuck at 0/N healthy
Root cause: The container is running but the ALB cannot reach the health check endpoint, or the app is not listening on the correct port.
Diagnose:
aws elbv2 describe-target-health \ --target-group-arn arn:aws:elasticloadbalancing:ap-southeast-1:279391564627:targetgroup/fastify-nova-tg/xxx \ --region ap-southeast-1
Common causes & fixes:
- App not listening on correct port: fastify-nova should listen on
0.0.0.0:3000. CheckHOSTenv var — some apps default to127.0.0.1which ALB can't reach. - Health check path wrong: Check the ALB target group health check path. Should be
/healthor/returning 200. - App takes too long to start: Increase health check grace period in the ECS service (Terraform:
health_check_grace_period_seconds). - Security group blocking: ALB SG must be in the inbound rules of the task SG on the container port.
Exit code 137 — container OOM killed
Root cause: Container exceeded its memory limit. ECS sends SIGKILL (exit 137).
aws ecs describe-tasks \
--cluster zelly-production \
--tasks TASK_ARN \
--region ap-south-1 \
--query 'tasks[0].{reason:stoppedReason,containers:containers[*].{name:name,exitCode:exitCode,reason:reason}}'Fix: Increase the memory value in the ECS task definition. For fastify-nova, default is 2048 MB. Register a new task definition with higher memory and force-redeploy. Check for memory leaks if this happens repeatedly.
Deployment stuck — new tasks never reach RUNNING
Root cause: Insufficient cluster capacity, task placement failure, or the new tasks are crashing before reaching RUNNING state.
# Check deployment events
aws ecs describe-services \
--cluster zelly-staging \
--services fastify-nova \
--region ap-southeast-1 \
--query 'services[0].{events:events[0:10],deployments:deployments}'Fix steps:
- Check events — look for
service fastify-nova was unable to place a task - For capacity issues: the cluster uses Fargate — there is no capacity to provision. Check AWS service quotas.
- If tasks are crashing: see "Task immediately stops" above
- Force the old deployment to drain:
aws ecs update-service --cluster ... --service ... --desired-count 0 --region ..., then set desired count back
lifecycle { ignore_changes = [task_definition] } — Terraform reverts my deployment
Root cause: All ECS services have this lifecycle rule so Terraform never overwrites CI-deployed images. This is intentional.
What this means: After changing Terraform variables (e.g. memory, env vars), terraform apply registers a new task definition revision but does not force a redeployment. You must do that manually.
aws ecs update-service \ --cluster zelly-production \ --service fastify-nova \ --force-new-deployment \ --region ap-south-1
Database Errors
ECONNREFUSED 10.0.x.x:3306 or connect ETIMEDOUT
Root cause: Can't reach Aurora. In production, Aurora is in a private subnet — requires WireGuard VPN. In staging, Aurora is publicly accessible.
Diagnose:
# Production: check WireGuard is connected wg show wg0 # Test connectivity to Aurora endpoint (get endpoint from Terraform outputs or AWS console) nc -zv aurora-prod-endpoint.cluster-xxx.ap-south-1.rds.amazonaws.com 3306 -w 5 # Staging: test direct (no VPN needed) mysql -h aurora-staging-endpoint.cluster-xxx.ap-southeast-1.rds.amazonaws.com \ -u zellymaster -p -e "SELECT 1"
Fix:
- Production: Connect to WireGuard VPN first. See VPN setup.
- ECS services: Check the task security group allows outbound to the Aurora security group on port 3306. Terraform handles this, but re-apply if SG rules are missing.
- Staging: Aurora is public — if it's refusing, the Aurora cluster may be paused (Serverless v2 cold start). Wait 10–30s and retry.
Access denied for user 'zellymaster'@'10.x.x.x'
Root cause: Wrong password, wrong username, or the user doesn't have access to the target schema.
# Check the secret value (read from Secrets Manager) aws secretsmanager get-secret-value \ --secret-id zelly/aurora/master \ --region ap-south-1 \ --query 'SecretString' --output text | python3 -m json.tool
Fix:
- Verify the password in Secrets Manager matches the Aurora master password set during provisioning
- If passwords drifted, update either the Aurora master password via RDS console or the secret value
- ECS does not hot-reload secrets — force-redeploy after updating the secret
Too many connections — Aurora max_connections exceeded
Root cause: The application connection pool is exhausted, or too many services are connecting without pooling.
Aurora Serverless v2 max_connections scales with ACU. At 0.5 ACU ≈ 90 connections. At 8 ACU ≈ 1000+.
# Via WireGuard VPN (production) or direct (staging) mysql -h AURORA_ENDPOINT -u zellymaster -p -e \ "SELECT USER, HOST, COUNT(*) as cnt FROM information_schema.PROCESSLIST GROUP BY USER, HOST ORDER BY cnt DESC"
Fix:
- Check if Aurora has auto-scaled (should scale to handle load)
- Kill idle connections:
KILL CONNECTION_ID; - Reduce connection pool size in service env vars (
DB_POOL_MAXor equivalent) - Consider Aurora RDS Proxy if connection storms are recurring
SSL connection error: SSL is required
Root cause: Aurora requires SSL connections. The database client is not configured to use TLS.
Fix: Add SSL options to the database connection config:
// mysql2 / Sequelize
{
ssl: { rejectUnauthorized: true }
}
// TypeORM
{
ssl: true,
extra: { ssl: { rejectUnauthorized: false } } // use if cert chain is Amazon RDS
}The environment variable DB_SSL=true is often what controls this. Check the Secrets Manager secret for the relevant service.
Migration failure — Table 'X' already exists or Migration X has already been run
Root cause: A migration was run against a database that's already in that state, or a partial migration left the schema in an inconsistent state.
# Check migration state (TypeORM example) mysql -h AURORA_ENDPOINT -u zellymaster -p -e \ "SELECT * FROM astro_primary.migrations ORDER BY timestamp DESC LIMIT 10" # Manually mark migration as run without executing (TypeORM) INSERT INTO migrations (timestamp, name) VALUES (1700000000000, 'MigrationName1700000000000');
Fix:
- Never run migrations against production without testing on staging first
- If a migration is stuck, check for locked tables:
SHOW ENGINE INNODB STATUS\G - For Aurora Serverless, migrations that take >30s may hit the Aurora connection timeout — use
NOWAIThints or batch the migration
Aurora cold start — connection timeout on first request after idle period
Root cause: Aurora Serverless v2 scales to zero minimum ACUs when idle. The first request after a long idle period takes 10–30 seconds while the instance scales up.
Staging only — production is set to min 1 ACU.
Fix: Either set a minimum ACU capacity of 0.5 or 1 in staging (Terraform: serverlessv2_scaling_configuration.min_capacity), or implement a connection retry with backoff in the application.
Network & TLS Errors
ALB 502 Bad Gateway
Root cause: The ALB received a bad response from the target (container). The target is unhealthy, crashed mid-request, or the container exited.
Diagnose:
- Check CloudWatch logs for the service — look for crash or unhandled errors near the request time
- Check ECS stopped tasks for any containers that crashed
- Check the ALB access logs in S3 (if enabled) for the exact error
Fix: Usually means the service crashed. Check logs and fix the underlying app error. If the service is overloaded, increase desired task count.
ALB 503 Service Unavailable
Root cause: No healthy targets in the target group. All containers are unhealthy or there are zero running tasks.
aws elbv2 describe-target-health \ --target-group-arn TARGET_GROUP_ARN \ --region ap-south-1
Common causes:
- ECS service desired count is 0 — scale up
- All tasks are failing health checks — fix the health check endpoint or grace period
- Deployment in progress — wait for the rolling update to complete
- ALB security group was changed and is now blocking the health check port
ALB 504 Gateway Timeout
Root cause: The application did not respond within the ALB idle timeout (default 60s).
Fix:
- Find and optimize the slow endpoint in the application logs
- Increase ALB idle timeout in Terraform (
idle_timeouton the ALB resource) if the request is legitimately long-running - For DB queries, add indexes or optimize the query
Caddy TLS — challenge failed or no valid ACME CA
Root cause: Caddy failed to obtain a Let's Encrypt certificate for a merchant's custom domain. The domain's DNS is not pointing to the NLB Elastic IP.
Diagnose:
aws logs filter-log-events \ --log-group-name /zelly/ecs/storefront \ --filter-pattern "challenge" \ --start-time $(( $(date +%s) * 1000 - 3600000 )) \ --region ap-south-1 \ --query 'events[*].message' --output text
Fix:
- Verify the merchant's domain A record points to the NLB Elastic IP (check
terraform output nlb_elastic_ip) - Wait for DNS propagation (up to 24h for some registrars)
- Check that the
/allow-certendpoint returns 200 for the domain (Caddy calls it before issuing a cert) - Let's Encrypt rate limit: max 5 failed challenges per domain per hour. Wait before retrying.
- For testing, switch to Let's Encrypt staging CA temporarily to avoid rate limits
Caddy /allow-cert returns 403 — custom domain blocked
Root cause: The domain is not registered as a valid tenant domain in the astro_primary database. Caddy calls ${CORE_API_URL}/validate_tenant_domain/{domain} before issuing a cert.
Fix:
- Verify the merchant's domain is saved in the database
- Check fastify-nova logs around the
validate_tenant_domainendpoint - Ensure
CORE_API_URLis correctly set in the storefront service secret
CORS error — blocked by CORS policy: No 'Access-Control-Allow-Origin'
Root cause: The CORS_ORIGINS variable on orion-backend doesn't include the requesting frontend origin.
Fix:
# terraform.tfvars (or staging equivalent) orion_cors_origins = "https://admin.zelly.in,https://seller.zelly.in,http://localhost:5175,http://localhost:5173"
Then run terraform apply and force-redeploy orion-backend. Note: CORS_ORIGINS is a Terraform variable, not a Secrets Manager key.
BullMQ & Redis Errors
Error: connect ECONNREFUSED or connect ETIMEDOUT on Redis
Root cause: Redis TLS is not configured in the client, or the security group is blocking port 6379.
ElastiCache Redis requires TLS. Clients must pass tls: {}:
const Redis = require('ioredis');
const redis = new Redis({
host: process.env.REDIS_HOST,
port: 6379,
password: process.env.REDIS_AUTH_TOKEN,
tls: {}, // REQUIRED for ElastiCache TLS
maxRetriesPerRequest: null, // Required for BullMQ
enableReadyCheck: false,
});Fix: Ensure tls: {} is present. Check REDIS_HOST is the ElastiCache endpoint (not localhost). Locally, Redis runs in Docker without TLS.
Jobs stuck in waiting state — queue not draining
Root cause: The events-consumer worker is not running or is not connected to the same Redis instance.
# Check if events-consumer is running
aws ecs describe-services \
--cluster zelly-staging \
--services events-consumer \
--region ap-southeast-1 \
--query 'services[0].{desired:desiredCount,running:runningCount}'
# Check consumer logs for errors
aws logs filter-log-events \
--log-group-name /zelly/ecs/events-consumer \
--start-time $(( $(date +%s) * 1000 - 600000 )) \
--region ap-southeast-1 \
--query 'events[*].message' --output text | tail -30Fix: If the consumer is down, force-redeploy it. If it's running but not processing, check the queue name — BullMQ queue names must match exactly between producer and consumer (store-events).
Jobs failing repeatedly and landing in failed state
Root cause: The job processor is throwing an error. After maxAttempts retries, BullMQ moves the job to the failed set.
# Via Bull Studio (access over WireGuard VPN) # VPN URL: http://10.0.x.x:3000 (check wg-config.json for bastion IP) # Bull Studio shows failed jobs with their error stack traces
Fix:
- Check consumer logs for the error thrown during job processing
- Common causes: ClickHouse unreachable, bad data shape, missing env var
- Once fixed, retry failed jobs from Bull Studio UI or via BullMQ API
WRONGTYPE Operation against a key holding the wrong kind of value
Root cause: A Redis key collision — a non-BullMQ key exists at the same path as a queue's internal key, usually from a previous deployment with a different naming scheme.
Fix: Flush the conflicting key (NOT the entire DB). Connect to Redis via RedisInsight (over WireGuard VPN), find the conflicting key with KEYS bull:*, and delete it.
FLUSHALL on production Redis — it clears all BullMQ queues.Shopify webhook queue backlog building up
Root cause: fastify-nova is receiving webhooks faster than events-consumer can process them, or consumer is down.
# Access Redis via RedisInsight over WireGuard VPN, or: redis-cli -h ELASTICACHE_ENDPOINT -p 6379 -a AUTH_TOKEN --tls \ LLEN "bull:SHOPIFY_WEBHOOK:wait"
Fix:
- Scale up
events-consumerdesired task count (it's fixed at 1 — change in Terraformevents_consumer_desired_count) - The job processor uses batch inserts — increase batch size or concurrency in the worker config
Service-Specific Errors
fastify-nova
Firebase auth error — Error: Service account object must contain a string "project_id" field
Root cause: The Firebase service account JSON stored in Secrets Manager is malformed or missing.
Fix:
- Download the service account JSON from Firebase Console → Project Settings → Service Accounts
- Store it as a JSON-escaped string in the
FIREBASE_SERVICE_ACCOUNTkey inzelly/fastify-nova/env - The value must be the entire JSON object as a single-line string:
{"type":"service_account","project_id":"..."} - Force-redeploy fastify-nova after updating the secret
Razorpay BAD_REQUEST_ERROR: Signature Verification Failed
Root cause: RAZORPAY_KEY_SECRET in Secrets Manager doesn't match the key secret in the Razorpay dashboard.
Fix: Log into the Razorpay dashboard, copy the key secret, update zelly/fastify-nova/env in Secrets Manager, force-redeploy.
Shopify webhook signature verification failing
Root cause: SHOPIFY_WEBHOOK_SECRET doesn't match the secret set in the Shopify Partner Dashboard.
Fix: Regenerate the webhook secret in Shopify Partner Dashboard, update the secret, redeploy. Also check that the webhook URL is correct (should point to the ALB endpoint for fastify-nova).
customer-panel-neptune
Session not persisting — users logged out on every request
Root cause: SESSION_SECRET changes between deployments (if it's randomly generated), or Redis is not reachable for session storage.
Fix:
- Ensure
SESSION_SECRETis a fixed value in Secrets Manager (not dynamically generated) - Check Redis connectivity from the customer-panel container
- If behind a load balancer, ensure sticky sessions are not required (session should be stored in Redis, not memory)
orion-backend
NestJS crash on startup — Cannot read properties of undefined (reading 'forRoot')
Root cause: TypeORM or NestJS module config is missing required environment variables (DB_HOST, DB_PASSWORD, etc.).
Fix: Check that all keys defined in zelly/orion/env are present. Compare with the secrets-schema.json in zelly-ops.
ClickHouse queries returning no data
Root cause: orion-backend reads from ClickHouse for analytics. If CH_HOST is wrong or the ClickHouse EC2 is down, queries silently return empty.
# Via WireGuard VPN curl -s http://10.0.x.x:8123/?query=SELECT+1
Fix: Verify the ClickHouse EC2 is running (aws ec2 describe-instances --filters Name=tag:Name,Values=zelly-clickhouse) and Docker is running on it (ssh ec2-user@... "docker ps" via bastion).
events-consumer
ClickHouse insert failing — Code: 60. DB::Exception: Table X doesn't exist
Root cause: The ClickHouse schema hasn't been initialized, or CH_DATABASE points to the wrong database.
# Via WireGuard VPN — run the schema setup curl -X POST http://10.0.x.x:8123/ \ --data "SHOW TABLES FROM zelly_analytics"
Fix: Run the ClickHouse schema initialization script. Check store-events-consumer/src/clickhouse/schema.sql and apply it via HTTP API.
storefront-astro-titan
Astro SSR crash — TypeError: fetch failed on server-side API calls
Root cause: The storefront makes SSR API calls to fastify-nova via CORE_API_URL. If this env var points to a non-existent URL or fastify-nova is down, all SSR pages fail.
Fix:
- Check
CORE_API_URLinzelly/storefront/env— should be the internal ALB DNS for fastify-nova - From inside the ECS task network, test:
curl http://fastify-nova-internal-alb.../health - Ensure fastify-nova is healthy before the storefront picks up traffic
Cloudflare Errors
Cloudflare Pages deployment failed — build error
Root cause: Build command failure, missing environment variable in Pages settings, or Node.js version mismatch.
Fix:
- Check the Pages deployment log in Cloudflare Dashboard → Pages → Project → Deployments
- Ensure all required env vars are set in Pages Settings → Environment Variables
- Check the Node.js version: add
NODE_VERSION=22to environment variables - Test the build locally first:
npm run build
ACM certificate stuck validating — PENDING_VALIDATION forever
Root cause: The Cloudflare DNS CNAME for ACM validation has proxied = true. ACM validation CNAMEs must be non-proxied (proxied = false).
Fix: In the Cloudflare dashboard, find the _acme-challenge.* CNAME record and set it to DNS-only (grey cloud). ACM will validate within 30 minutes.
Cloudflare Worker returning 1101 — Worker threw an unhandled exception
Root cause: JavaScript error in the Worker code. Cloudflare swallows the actual error and shows 1101.
Fix:
- Open Cloudflare Dashboard → Workers → zelly-checkout → Logs tab
- Reproduce the request to see the real error
- Or use:
wrangler tailto stream live Worker logs
subdomain_base should always be zelly.in
Root cause: The Terraform variable subdomain_base controls the base domain for all ALB/service subdomains. It must always be zelly.in.
Symptoms: Services unreachable, ACM cert for wrong domain, DNS records pointing nowhere.
Fix: Check terraform.tfvars — subdomain_base = "zelly.in". Never set this to storego.in or any other domain.
General Debugging Tips
The 5-minute drill
When something is broken and you don't know where to start:
- CloudWatch Logs:
aws logs filter-log-events --log-group-name /zelly/ecs/SERVICE_NAME --start-time ... --query 'events[*].message' --output text | tail -50 - ECS service events:
aws ecs describe-services --cluster CLUSTER --services SERVICE --query 'services[0].events[0:5]' - Stopped tasks:
aws ecs list-tasks --cluster CLUSTER --service-name SERVICE --desired-status STOPPED+ describe - Secret values: Verify with
aws secretsmanager get-secret-value --secret-id zelly/SERVICE/env - Compare environments: Use zelly-ops Compare tab to see staging vs production image tags and revisions side by side
CloudWatch Insights — useful queries
# All errors in the last hour across all services fields @timestamp, @logStream, @message | filter @message like /(?i)(error|exception|fatal)/ | sort @timestamp desc | limit 50 # Slow requests (if app logs request timing) fields @timestamp, @message | filter @message like /duration/ | parse @message "duration: *ms" as durationMs | filter durationMs > 1000 | sort durationMs desc | limit 20 # Health check failures fields @timestamp, @message | filter @message like /health/ | sort @timestamp desc | limit 20
Useful environment variables checklist
| Service | Critical env vars | Common mistakes |
|---|---|---|
| fastify-nova | DB_HOST, REDIS_HOST, FIREBASE_SERVICE_ACCOUNT | FIREBASE JSON malformed, REDIS without TLS |
| customer-panel | DB_HOST, SESSION_SECRET, REDIS_HOST | SESSION_SECRET randomly generated, Redis no TLS |
| orion-backend | DB_HOST, CORS_ORIGINS, CH_HOST | CORS_ORIGINS missing frontend domains |
| events-consumer | REDIS_HOST, CH_HOST, CH_DATABASE | CH schema not initialized |
| storefront | CORE_API_URL, DB_HOST | CORE_API_URL pointing to wrong env |