Operations
Incident Response
This guide defines severity levels, response procedures, and communication templates for incidents affecting the Zelly platform. Every engineer on-call should be familiar with P0 and P1 procedures before taking a shift.
Severity Levels
| Severity | Definition | Response SLA | Escalation | Examples |
|---|---|---|---|---|
| P0 · Critical | Production completely down or data loss/security breach | 15 minutes to acknowledge, all-hands | Immediately wake all engineers + founders | All services 503, DB data corruption, creds leaked |
| P1 · High | Major feature broken, ≥20% users affected, or revenue impacted | 30 minutes to acknowledge | Alert on-call + tech lead | Checkout broken, payments failing, storefront 502, login broken |
| P2 · Medium | Single service degraded, <20% users affected | 2 hours to acknowledge | Alert on-call only | Admin panel slow, analytics delayed, single merchant domain cert expired |
| P3 · Low | Minor degradation, non-production, or cosmetic | Next business day | Log as GitHub issue, no page | Staging environment flapping, UI typo, non-critical log spam |
P0 · Critical — Production Down / Data Loss / Security
P0 = all-hands. Drop everything. Wake everyone if it's 3am. Time-to-resolution matters more than anything else.
P0-A: All services returning 503 / production completely down
- Triage (0–5 min): Confirm the scope. Hit
https://api.zelly.in/health,https://app.zelly.in, and the storefront. Determine whether it's one service or all. - Check ECS cluster health:
bash
aws ecs describe-services \ --cluster zelly-production \ --services fastify-nova customer-panel orion-backend storefront \ --region ap-south-1 \ --query 'services[*].{name:serviceName,desired:desiredCount,running:runningCount,events:events[0].message}' - Check ALB target health: If running count is 0, services crashed. Get logs immediately (see step 4). If running but 503, check ALB target group health.
- Get crash logs:
bash
aws logs filter-log-events \ --log-group-name /zelly/ecs/fastify-nova \ --start-time $(( $(date +%s) * 1000 - 600000 )) \ --region ap-south-1 \ --query 'events[*].message' --output text | tail -100
- Attempt rollback: If a bad deployment caused this, rollback immediately:
bash
# Get current task def revision CURRENT=$(aws ecs describe-services --cluster zelly-production \ --services fastify-nova --region ap-south-1 \ --query 'services[0].taskDefinition' --output text) # Roll back to previous revision PREV="${CURRENT%:*}:$(( ${CURRENT##*:} - 1 ))" aws ecs update-service --cluster zelly-production \ --service fastify-nova --task-definition $PREV \ --force-new-deployment --region ap-south-1 - Scale check: If memory/CPU spike caused the crash, scale up first, debug second:
aws ecs update-service --cluster zelly-production --service fastify-nova --desired-count 4 --region ap-south-1 - Communicate: Post status to team within 10 minutes of P0 declaration (see communication template below)
P0-B: Database corruption or data loss
Stop all writes immediately before any further action.
- Scale all ECS services to 0 desired count to stop writes:
aws ecs update-service --cluster zelly-production --service SERVICE --desired-count 0 --region ap-south-1 - Take an immediate Aurora snapshot:
aws rds create-db-cluster-snapshot --db-cluster-identifier zelly-aurora-production --db-cluster-snapshot-identifier emergency-$(date +%Y%m%d-%H%M) --region ap-south-1 - Connect via WireGuard VPN and assess the damage:
mysql -h AURORA_ENDPOINT -u admin -p - Check Aurora automated backups — restoration point can be any second in the last 7 days
- Do NOT restore production in place — restore to a new cluster, verify, then cut over
P0-C: Security breach — credentials leaked or unauthorized access
- Revoke immediately: Rotate the leaked credential first, ask questions later
bash — rotate leaked secret
# Rotate Aurora master password aws rds modify-db-cluster \ --db-cluster-identifier zelly-aurora-production \ --master-user-password NEW_SECURE_PASSWORD \ --apply-immediately \ --region ap-south-1 # Update the secret to match aws secretsmanager put-secret-value \ --secret-id zelly/aurora/master \ --secret-string '{"password":"NEW_SECURE_PASSWORD","username":"admin"}' \ --region ap-south-1 - If AWS access keys were leaked: deactivate via IAM immediately, then rotate
- Check CloudTrail for unauthorized API calls:
aws cloudtrail lookup-events --lookup-attributes AttributeKey=Username,AttributeValue=LEAKED_USER --region ap-south-1 - Check Aurora audit logs for suspicious queries
- After credentials rotated, force-redeploy all services to pick up new secrets
- File a post-mortem within 24 hours
P0 Communication Template
Slack / WhatsApp
🚨 P0 INCIDENT — [SHORT DESCRIPTION] Status: INVESTIGATING / MITIGATING / RESOLVED Started: [TIME UTC] Impact: [What is broken, how many users affected] Current action: [What we're doing right now] Next update: [in X minutes] CC: @everyone
P1 · High — Major Feature Broken / Revenue Impact
P1 = 30-minute acknowledgment. Revenue or major user flows affected. Immediately loop in tech lead.
P1-A: Checkout / payment flow broken
- Check fastify-nova logs for Razorpay API errors, signature failures, or 5xx responses from the payment gateway
- Verify
RAZORPAY_KEY_IDandRAZORPAY_KEY_SECRETin Secrets Manager are correct and match the Razorpay dashboard - Check if Razorpay has an outage: status.razorpay.com
- If a code change broke it: rollback fastify-nova to the previous task definition
- Test checkout manually on a test order after the fix
P1-B: Storefront 502/503 for all or many merchants
- Check ECS storefront service running/desired counts
- Check Caddy container logs — Caddy runs as a sidecar in the same ECS task
- Check if the issue is DNS (NLB Elastic IP changed? No — EIPs are static)
- Check if fastify-nova (which storefront SSRs against) is healthy
- If a Caddy config change caused it, rollback the storefront task definition
P1-C: All users locked out (login/auth broken)
- Check customer-panel logs for errors
- Verify Firebase service account JSON is valid in
zelly/fastify-nova/env - Check if Redis (for sessions) is reachable from customer-panel
- Check if any Firebase project settings changed (Console → Settings)
P1 Communication Template
Slack / WhatsApp
⚠️ P1 INCIDENT — [SHORT DESCRIPTION] Status: INVESTIGATING / MITIGATING / RESOLVED Started: [TIME UTC] Impact: [e.g., "Checkout broken for all merchants", "~40% of login attempts failing"] Current action: [What we're doing] Next update: [in X minutes] CC: @tech-lead @on-call
P2 · Medium — Single Service Degraded
P2 incidents affect a subset of users or a non-critical feature. Acknowledge within 2 hours, fix within the same business day where possible.
Common P2 scenarios
Admin panel (orion) slow or timing out
- Check orion-backend CloudWatch metrics (CPU/memory)
- Look for slow database queries in orion-backend logs
- Check ClickHouse query performance — analytics reads can be slow
- Scale up orion-backend desired count if it's memory pressure
Analytics data delayed or missing
- Check events-consumer is running:
aws ecs describe-services --cluster zelly-production --services events-consumer - Check BullMQ queue depth via Bull Studio (WireGuard VPN required)
- Check consumer logs for ClickHouse insert errors
- Verify ClickHouse EC2 is running and Docker container is up
Single merchant's storefront unreachable
- Check if the merchant's domain DNS is pointing to the NLB EIP
- Check Caddy logs for TLS challenge failures for that domain
- Check if the
/allow-certendpoint returns 200 for the domain - Verify the merchant's domain is in the tenant DB
Shopify webhook processing delayed
- Check BullMQ queue depth for
SHOPIFY_WEBHOOK - Check events-consumer logs for processing errors
- If queue is deeply backlogged, increase consumer concurrency temporarily
P3 · Low — Minor / Non-Production
P3s are logged as GitHub issues and triaged in the next sprint. No on-call page required.
Common P3 scenarios
| Scenario | Action |
|---|---|
| Staging environment flapping or down | Log issue, fix during business hours. Staging downtime doesn't affect customers. |
| Non-critical log noise / warning spam | Log issue, fix in next sprint |
| Seller panel UI bug (non-blocking) | Log GitHub issue, screenshot, reproduce steps |
| Staging CI pipeline red | Check GitHub Actions log, fix or rerun |
| Analytics numbers off by small margin | Investigate in next sprint |
| Internal docs typo or missing content | PR to internal.zelly.in |
Post-Mortem Template
Every P0 and major P1 incident requires a post-mortem written within 24–48 hours. Blameless post-mortems focus on systems and processes, not individuals.
markdown — Post-Mortem Template
# Incident Post-Mortem: [Title] **Date:** YYYY-MM-DD **Severity:** P0 / P1 **Duration:** X hours Y minutes (HH:MM UTC → HH:MM UTC) **Author:** [name] **Status:** DRAFT / FINAL ## Summary [1–3 sentences describing what broke, the user impact, and the root cause] ## Impact - **Users affected:** [number or percentage] - **Revenue impact:** [estimate if known] - **Services affected:** [list] - **Detection method:** [monitoring alert / user report / internal discovery] ## Timeline (all times UTC) | Time | Event | |------|-------| | HH:MM | [first sign of issue] | | HH:MM | [incident declared] | | HH:MM | [diagnosis reached] | | HH:MM | [mitigation applied] | | HH:MM | [incident resolved] | ## Root Cause [Technical explanation of what actually went wrong] ## Contributing Factors - [Factor 1] - [Factor 2] ## What Went Well - [Something that worked during the response] ## What Went Poorly - [Something that slowed us down] ## Action Items | Action | Owner | Due | |--------|-------|-----| | [Preventive measure] | [name] | [date] | | [Monitoring improvement] | [name] | [date] | | [Documentation update] | [name] | [date] | ## Lessons Learned [1–2 paragraphs on what the team learned from this incident]
Escalation Contacts
| Role | When to escalate | How |
|---|---|---|
| On-call engineer | Any P0/P1 alert, immediate | PagerDuty / direct message |
| Tech lead | P1 unresolved after 30 min, any P0 | Phone call |
| Founders | P0, revenue impact >1hr, security breach | Phone call |
| AWS Support | AWS infrastructure issue (RDS, ECS not responding to API) | AWS Console → Support |
| Cloudflare Support | Cloudflare network issue affecting storefront | Cloudflare Dashboard → Support |
Quick Links
ECS Runbook
Logs, debugging stopped tasks, force redeploy, scaling commands.
Commands Cheatsheet
AWS CLI, ECS, Redis, ClickHouse, Terraform — one-liners for fast response.
Secrets & VPN
Access credentials, connect WireGuard, update Secrets Manager.