Operations

Incident Response

This guide defines severity levels, response procedures, and communication templates for incidents affecting the Zelly platform. Every engineer on-call should be familiar with P0 and P1 procedures before taking a shift.

Severity Levels

Severity Definition Response SLA Escalation Examples
P0 · Critical Production completely down or data loss/security breach 15 minutes to acknowledge, all-hands Immediately wake all engineers + founders All services 503, DB data corruption, creds leaked
P1 · High Major feature broken, ≥20% users affected, or revenue impacted 30 minutes to acknowledge Alert on-call + tech lead Checkout broken, payments failing, storefront 502, login broken
P2 · Medium Single service degraded, <20% users affected 2 hours to acknowledge Alert on-call only Admin panel slow, analytics delayed, single merchant domain cert expired
P3 · Low Minor degradation, non-production, or cosmetic Next business day Log as GitHub issue, no page Staging environment flapping, UI typo, non-critical log spam

P0 · Critical — Production Down / Data Loss / Security

P0 = all-hands. Drop everything. Wake everyone if it's 3am. Time-to-resolution matters more than anything else.

P0-A: All services returning 503 / production completely down

  1. Triage (0–5 min): Confirm the scope. Hit https://api.zelly.in/health, https://app.zelly.in, and the storefront. Determine whether it's one service or all.
  2. Check ECS cluster health:
    bash
    aws ecs describe-services \
      --cluster zelly-production \
      --services fastify-nova customer-panel orion-backend storefront \
      --region ap-south-1 \
      --query 'services[*].{name:serviceName,desired:desiredCount,running:runningCount,events:events[0].message}'
  3. Check ALB target health: If running count is 0, services crashed. Get logs immediately (see step 4). If running but 503, check ALB target group health.
  4. Get crash logs:
    bash
    aws logs filter-log-events \
      --log-group-name /zelly/ecs/fastify-nova \
      --start-time $(( $(date +%s) * 1000 - 600000 )) \
      --region ap-south-1 \
      --query 'events[*].message' --output text | tail -100
  5. Attempt rollback: If a bad deployment caused this, rollback immediately:
    bash
    # Get current task def revision
    CURRENT=$(aws ecs describe-services --cluster zelly-production \
      --services fastify-nova --region ap-south-1 \
      --query 'services[0].taskDefinition' --output text)
    
    # Roll back to previous revision
    PREV="${CURRENT%:*}:$(( ${CURRENT##*:} - 1 ))"
    aws ecs update-service --cluster zelly-production \
      --service fastify-nova --task-definition $PREV \
      --force-new-deployment --region ap-south-1
  6. Scale check: If memory/CPU spike caused the crash, scale up first, debug second: aws ecs update-service --cluster zelly-production --service fastify-nova --desired-count 4 --region ap-south-1
  7. Communicate: Post status to team within 10 minutes of P0 declaration (see communication template below)

P0-B: Database corruption or data loss

Stop all writes immediately before any further action.
  1. Scale all ECS services to 0 desired count to stop writes: aws ecs update-service --cluster zelly-production --service SERVICE --desired-count 0 --region ap-south-1
  2. Take an immediate Aurora snapshot: aws rds create-db-cluster-snapshot --db-cluster-identifier zelly-aurora-production --db-cluster-snapshot-identifier emergency-$(date +%Y%m%d-%H%M) --region ap-south-1
  3. Connect via WireGuard VPN and assess the damage: mysql -h AURORA_ENDPOINT -u admin -p
  4. Check Aurora automated backups — restoration point can be any second in the last 7 days
  5. Do NOT restore production in place — restore to a new cluster, verify, then cut over

P0-C: Security breach — credentials leaked or unauthorized access

  1. Revoke immediately: Rotate the leaked credential first, ask questions later
    bash — rotate leaked secret
    # Rotate Aurora master password
    aws rds modify-db-cluster \
      --db-cluster-identifier zelly-aurora-production \
      --master-user-password NEW_SECURE_PASSWORD \
      --apply-immediately \
      --region ap-south-1
    
    # Update the secret to match
    aws secretsmanager put-secret-value \
      --secret-id zelly/aurora/master \
      --secret-string '{"password":"NEW_SECURE_PASSWORD","username":"admin"}' \
      --region ap-south-1
  2. If AWS access keys were leaked: deactivate via IAM immediately, then rotate
  3. Check CloudTrail for unauthorized API calls: aws cloudtrail lookup-events --lookup-attributes AttributeKey=Username,AttributeValue=LEAKED_USER --region ap-south-1
  4. Check Aurora audit logs for suspicious queries
  5. After credentials rotated, force-redeploy all services to pick up new secrets
  6. File a post-mortem within 24 hours

P0 Communication Template

Slack / WhatsApp
🚨 P0 INCIDENT — [SHORT DESCRIPTION]

Status: INVESTIGATING / MITIGATING / RESOLVED
Started: [TIME UTC]
Impact: [What is broken, how many users affected]
Current action: [What we're doing right now]
Next update: [in X minutes]

CC: @everyone

P1 · High — Major Feature Broken / Revenue Impact

P1 = 30-minute acknowledgment. Revenue or major user flows affected. Immediately loop in tech lead.

P1-A: Checkout / payment flow broken

  1. Check fastify-nova logs for Razorpay API errors, signature failures, or 5xx responses from the payment gateway
  2. Verify RAZORPAY_KEY_ID and RAZORPAY_KEY_SECRET in Secrets Manager are correct and match the Razorpay dashboard
  3. Check if Razorpay has an outage: status.razorpay.com
  4. If a code change broke it: rollback fastify-nova to the previous task definition
  5. Test checkout manually on a test order after the fix

P1-B: Storefront 502/503 for all or many merchants

  1. Check ECS storefront service running/desired counts
  2. Check Caddy container logs — Caddy runs as a sidecar in the same ECS task
  3. Check if the issue is DNS (NLB Elastic IP changed? No — EIPs are static)
  4. Check if fastify-nova (which storefront SSRs against) is healthy
  5. If a Caddy config change caused it, rollback the storefront task definition

P1-C: All users locked out (login/auth broken)

  1. Check customer-panel logs for errors
  2. Verify Firebase service account JSON is valid in zelly/fastify-nova/env
  3. Check if Redis (for sessions) is reachable from customer-panel
  4. Check if any Firebase project settings changed (Console → Settings)

P1 Communication Template

Slack / WhatsApp
⚠️ P1 INCIDENT — [SHORT DESCRIPTION]

Status: INVESTIGATING / MITIGATING / RESOLVED
Started: [TIME UTC]
Impact: [e.g., "Checkout broken for all merchants", "~40% of login attempts failing"]
Current action: [What we're doing]
Next update: [in X minutes]

CC: @tech-lead @on-call

P2 · Medium — Single Service Degraded

P2 incidents affect a subset of users or a non-critical feature. Acknowledge within 2 hours, fix within the same business day where possible.

Common P2 scenarios

Admin panel (orion) slow or timing out
  1. Check orion-backend CloudWatch metrics (CPU/memory)
  2. Look for slow database queries in orion-backend logs
  3. Check ClickHouse query performance — analytics reads can be slow
  4. Scale up orion-backend desired count if it's memory pressure
Analytics data delayed or missing
  1. Check events-consumer is running: aws ecs describe-services --cluster zelly-production --services events-consumer
  2. Check BullMQ queue depth via Bull Studio (WireGuard VPN required)
  3. Check consumer logs for ClickHouse insert errors
  4. Verify ClickHouse EC2 is running and Docker container is up
Single merchant's storefront unreachable
  1. Check if the merchant's domain DNS is pointing to the NLB EIP
  2. Check Caddy logs for TLS challenge failures for that domain
  3. Check if the /allow-cert endpoint returns 200 for the domain
  4. Verify the merchant's domain is in the tenant DB
Shopify webhook processing delayed
  1. Check BullMQ queue depth for SHOPIFY_WEBHOOK
  2. Check events-consumer logs for processing errors
  3. If queue is deeply backlogged, increase consumer concurrency temporarily

P3 · Low — Minor / Non-Production

P3s are logged as GitHub issues and triaged in the next sprint. No on-call page required.

Common P3 scenarios

ScenarioAction
Staging environment flapping or downLog issue, fix during business hours. Staging downtime doesn't affect customers.
Non-critical log noise / warning spamLog issue, fix in next sprint
Seller panel UI bug (non-blocking)Log GitHub issue, screenshot, reproduce steps
Staging CI pipeline redCheck GitHub Actions log, fix or rerun
Analytics numbers off by small marginInvestigate in next sprint
Internal docs typo or missing contentPR to internal.zelly.in

Post-Mortem Template

Every P0 and major P1 incident requires a post-mortem written within 24–48 hours. Blameless post-mortems focus on systems and processes, not individuals.

markdown — Post-Mortem Template
# Incident Post-Mortem: [Title]

**Date:** YYYY-MM-DD
**Severity:** P0 / P1
**Duration:** X hours Y minutes (HH:MM UTC → HH:MM UTC)
**Author:** [name]
**Status:** DRAFT / FINAL

## Summary
[1–3 sentences describing what broke, the user impact, and the root cause]

## Impact
- **Users affected:** [number or percentage]
- **Revenue impact:** [estimate if known]
- **Services affected:** [list]
- **Detection method:** [monitoring alert / user report / internal discovery]

## Timeline (all times UTC)
| Time | Event |
|------|-------|
| HH:MM | [first sign of issue] |
| HH:MM | [incident declared] |
| HH:MM | [diagnosis reached] |
| HH:MM | [mitigation applied] |
| HH:MM | [incident resolved] |

## Root Cause
[Technical explanation of what actually went wrong]

## Contributing Factors
- [Factor 1]
- [Factor 2]

## What Went Well
- [Something that worked during the response]

## What Went Poorly
- [Something that slowed us down]

## Action Items
| Action | Owner | Due |
|--------|-------|-----|
| [Preventive measure] | [name] | [date] |
| [Monitoring improvement] | [name] | [date] |
| [Documentation update] | [name] | [date] |

## Lessons Learned
[1–2 paragraphs on what the team learned from this incident]

Escalation Contacts

RoleWhen to escalateHow
On-call engineerAny P0/P1 alert, immediatePagerDuty / direct message
Tech leadP1 unresolved after 30 min, any P0Phone call
FoundersP0, revenue impact >1hr, security breachPhone call
AWS SupportAWS infrastructure issue (RDS, ECS not responding to API)AWS Console → Support
Cloudflare SupportCloudflare network issue affecting storefrontCloudflare Dashboard → Support

Quick Links

ECS Runbook
Logs, debugging stopped tasks, force redeploy, scaling commands.
Error Reference
Specific error messages mapped to root causes and fixes.
Commands Cheatsheet
AWS CLI, ECS, Redis, ClickHouse, Terraform — one-liners for fast response.
Secrets & VPN
Access credentials, connect WireGuard, update Secrets Manager.