Operations

Incident Response

This guide defines severity levels, response procedures, and communication templates for incidents affecting the Zelly platform. Every engineer on-call should be familiar with P0 and P1 procedures before taking a shift.

Severity Levels

Severity	Definition	Response SLA	Escalation	Examples
P0 · Critical	Production completely down or data loss/security breach	15 minutes to acknowledge, all-hands	Immediately wake all engineers + founders	All services 503, DB data corruption, creds leaked
P1 · High	Major feature broken, ≥20% users affected, or revenue impacted	30 minutes to acknowledge	Alert on-call + tech lead	Checkout broken, payments failing, storefront 502, login broken
P2 · Medium	Single service degraded, <20% users affected	2 hours to acknowledge	Alert on-call only	Admin panel slow, analytics delayed, single merchant domain cert expired
P3 · Low	Minor degradation, non-production, or cosmetic	Next business day	Log as GitHub issue, no page	Staging environment flapping, UI typo, non-critical log spam

P0 · Critical — Production Down / Data Loss / Security

P0 = all-hands. Drop everything. Wake everyone if it's 3am. Time-to-resolution matters more than anything else.

P0-A: All services returning 503 / production completely down

Triage (0–5 min): Confirm the scope. Hit https://api.zelly.in/health, https://app.zelly.in, and the storefront. Determine whether it's one service or all.

Check ECS cluster health:

bash

aws ecs describe-services \
  --cluster zelly-production \
  --services fastify-nova customer-panel orion-backend storefront \
  --region ap-south-1 \
  --query 'services[*].{name:serviceName,desired:desiredCount,running:runningCount,events:events[0].message}'

Check ALB target health: If running count is 0, services crashed. Get logs immediately (see step 4). If running but 503, check ALB target group health.

Get crash logs:

bash

aws logs filter-log-events \
  --log-group-name /zelly/ecs/fastify-nova \
  --start-time $(( $(date +%s) * 1000 - 600000 )) \
  --region ap-south-1 \
  --query 'events[*].message' --output text | tail -100

Attempt rollback: If a bad deployment caused this, rollback immediately:

bash

# Get current task def revision
CURRENT=$(aws ecs describe-services --cluster zelly-production \
  --services fastify-nova --region ap-south-1 \
  --query 'services[0].taskDefinition' --output text)

# Roll back to previous revision
PREV="${CURRENT%:*}:$(( ${CURRENT##*:} - 1 ))"
aws ecs update-service --cluster zelly-production \
  --service fastify-nova --task-definition $PREV \
  --force-new-deployment --region ap-south-1

Scale check: If memory/CPU spike caused the crash, scale up first, debug second: aws ecs update-service --cluster zelly-production --service fastify-nova --desired-count 4 --region ap-south-1
Communicate: Post status to team within 10 minutes of P0 declaration (see communication template below)

P0-B: Database corruption or data loss

Stop all writes immediately before any further action.

Scale all ECS services to 0 desired count to stop writes: aws ecs update-service --cluster zelly-production --service SERVICE --desired-count 0 --region ap-south-1
Take an immediate Aurora snapshot: aws rds create-db-cluster-snapshot --db-cluster-identifier zelly-aurora-production --db-cluster-snapshot-identifier emergency-$(date +%Y%m%d-%H%M) --region ap-south-1
Connect via WireGuard VPN and assess the damage: mysql -h AURORA_ENDPOINT -u admin -p
Check Aurora automated backups — restoration point can be any second in the last 7 days
Do NOT restore production in place — restore to a new cluster, verify, then cut over

P0-C: Security breach — credentials leaked or unauthorized access

Revoke immediately: Rotate the leaked credential first, ask questions later

bash — rotate leaked secret

# Rotate Aurora master password
aws rds modify-db-cluster \
  --db-cluster-identifier zelly-aurora-production \
  --master-user-password NEW_SECURE_PASSWORD \
  --apply-immediately \
  --region ap-south-1

# Update the secret to match
aws secretsmanager put-secret-value \
  --secret-id zelly/aurora/master \
  --secret-string '{"password":"NEW_SECURE_PASSWORD","username":"admin"}' \
  --region ap-south-1

If AWS access keys were leaked: deactivate via IAM immediately, then rotate
Check CloudTrail for unauthorized API calls: aws cloudtrail lookup-events --lookup-attributes AttributeKey=Username,AttributeValue=LEAKED_USER --region ap-south-1
Check Aurora audit logs for suspicious queries
After credentials rotated, force-redeploy all services to pick up new secrets
File a post-mortem within 24 hours

P0 Communication Template

Slack / WhatsApp

🚨 P0 INCIDENT — [SHORT DESCRIPTION]

Status: INVESTIGATING / MITIGATING / RESOLVED
Started: [TIME UTC]
Impact: [What is broken, how many users affected]
Current action: [What we're doing right now]
Next update: [in X minutes]

CC: @everyone

P1 · High — Major Feature Broken / Revenue Impact

P1 = 30-minute acknowledgment. Revenue or major user flows affected. Immediately loop in tech lead.

P1-A: Checkout / payment flow broken

Check fastify-nova logs for Razorpay API errors, signature failures, or 5xx responses from the payment gateway
Verify RAZORPAY_KEY_ID and RAZORPAY_KEY_SECRET in Secrets Manager are correct and match the Razorpay dashboard
Check if Razorpay has an outage: status.razorpay.com
If a code change broke it: rollback fastify-nova to the previous task definition
Test checkout manually on a test order after the fix

P1-B: Storefront 502/503 for all or many merchants

Check ECS storefront service running/desired counts
Check Caddy container logs — Caddy runs as a sidecar in the same ECS task
Check if the issue is DNS (NLB Elastic IP changed? No — EIPs are static)
Check if fastify-nova (which storefront SSRs against) is healthy
If a Caddy config change caused it, rollback the storefront task definition

P1-C: All users locked out (login/auth broken)

Check customer-panel logs for errors
Verify Firebase service account JSON is valid in zelly/fastify-nova/env
Check if Redis (for sessions) is reachable from customer-panel
Check if any Firebase project settings changed (Console → Settings)

P1 Communication Template

Slack / WhatsApp

⚠️ P1 INCIDENT — [SHORT DESCRIPTION]

Status: INVESTIGATING / MITIGATING / RESOLVED
Started: [TIME UTC]
Impact: [e.g., "Checkout broken for all merchants", "~40% of login attempts failing"]
Current action: [What we're doing]
Next update: [in X minutes]

CC: @tech-lead @on-call

P2 · Medium — Single Service Degraded

P2 incidents affect a subset of users or a non-critical feature. Acknowledge within 2 hours, fix within the same business day where possible.

Common P2 scenarios

Admin panel (orion) slow or timing out

Check orion-backend CloudWatch metrics (CPU/memory)
Look for slow database queries in orion-backend logs
Check ClickHouse query performance — analytics reads can be slow
Scale up orion-backend desired count if it's memory pressure

Analytics data delayed or missing

Check events-consumer is running: aws ecs describe-services --cluster zelly-production --services events-consumer
Check BullMQ queue depth via Bull Studio (WireGuard VPN required)
Check consumer logs for ClickHouse insert errors
Verify ClickHouse EC2 is running and Docker container is up

Single merchant's storefront unreachable

Check if the merchant's domain DNS is pointing to the NLB EIP
Check Caddy logs for TLS challenge failures for that domain
Check if the /allow-cert endpoint returns 200 for the domain
Verify the merchant's domain is in the tenant DB

Shopify webhook processing delayed

Check BullMQ queue depth for SHOPIFY_WEBHOOK
Check events-consumer logs for processing errors
If queue is deeply backlogged, increase consumer concurrency temporarily

P3 · Low — Minor / Non-Production

P3s are logged as GitHub issues and triaged in the next sprint. No on-call page required.

Common P3 scenarios

Scenario	Action
Staging environment flapping or down	Log issue, fix during business hours. Staging downtime doesn't affect customers.
Non-critical log noise / warning spam	Log issue, fix in next sprint
Seller panel UI bug (non-blocking)	Log GitHub issue, screenshot, reproduce steps
Staging CI pipeline red	Check GitHub Actions log, fix or rerun
Analytics numbers off by small margin	Investigate in next sprint
Internal docs typo or missing content	PR to internal.zelly.in

Post-Mortem Template

Every P0 and major P1 incident requires a post-mortem written within 24–48 hours. Blameless post-mortems focus on systems and processes, not individuals.

markdown — Post-Mortem Template

# Incident Post-Mortem: [Title]

**Date:** YYYY-MM-DD
**Severity:** P0 / P1
**Duration:** X hours Y minutes (HH:MM UTC → HH:MM UTC)
**Author:** [name]
**Status:** DRAFT / FINAL

## Summary
[1–3 sentences describing what broke, the user impact, and the root cause]

## Impact
- **Users affected:** [number or percentage]
- **Revenue impact:** [estimate if known]
- **Services affected:** [list]
- **Detection method:** [monitoring alert / user report / internal discovery]

## Timeline (all times UTC)
| Time | Event |
|------|-------|
| HH:MM | [first sign of issue] |
| HH:MM | [incident declared] |
| HH:MM | [diagnosis reached] |
| HH:MM | [mitigation applied] |
| HH:MM | [incident resolved] |

## Root Cause
[Technical explanation of what actually went wrong]

## Contributing Factors
- [Factor 1]
- [Factor 2]

## What Went Well
- [Something that worked during the response]

## What Went Poorly
- [Something that slowed us down]

## Action Items
| Action | Owner | Due |
|--------|-------|-----|
| [Preventive measure] | [name] | [date] |
| [Monitoring improvement] | [name] | [date] |
| [Documentation update] | [name] | [date] |

## Lessons Learned
[1–2 paragraphs on what the team learned from this incident]

Escalation Contacts

Role	When to escalate	How
On-call engineer	Any P0/P1 alert, immediate	PagerDuty / direct message
Tech lead	P1 unresolved after 30 min, any P0	Phone call
Founders	P0, revenue impact >1hr, security breach	Phone call
AWS Support	AWS infrastructure issue (RDS, ECS not responding to API)	AWS Console → Support
Cloudflare Support	Cloudflare network issue affecting storefront	Cloudflare Dashboard → Support

Quick Links

ECS Runbook

Logs, debugging stopped tasks, force redeploy, scaling commands.

operations.html → Debugging

Error Reference

Specific error messages mapped to root causes and fixes.

debugging.html

Commands Cheatsheet

AWS CLI, ECS, Redis, ClickHouse, Terraform — one-liners for fast response.

cheatsheet.html

Secrets & VPN

Access credentials, connect WireGuard, update Secrets Manager.

operations.html → Secrets