Production Ops
Production Operations Weekly Review
May 16 to May 24, 2026 / cadence: weekly
Overall statusRED
Degraded days5
RED components2
AMBER components8
Actions needing disposition10
Daily Production Health
Latest daily health issues
- Platform p95 exceeded the 250ms target overnight
- Platform p99 hit extreme active-hour tail latency
- docgen2-api HPA hit max replicas
Daily Findings
Daily health findings by day
Reliability Movement
Customer visible failures-1
14 current / 15 previous
Observed incidents-1
14 current / 15 previous
Application alerts+128
213 current / 85 previous
Infrastructure alerts0
0 current / 0 previous
Pingdom events-1
14 current / 15 previous
Latency Trend
p95 ms max 12920p99 ms max 20000
Component Heatmap
May 20
May 21
May 22
May 23
May 24
Customer Edge
Application Runtime
Data Layer
Cache & Messaging
Batch & Scheduled Work
Kubernetes & Capacity
Observability & Alerting
External Dependencies
Daily Health Spine
Signals pulled from the daily production health PDF/JSON: customer checks, latency/errors, PHP-FPM probes, Kubernetes pressure, DB/cache pressure, and direct source links.
| Day | Pingdom | Latency / errors | PHP-FPM / Kubernetes | DB / cache | References |
|---|---|---|---|---|---|
| May 20RED | events0peak0 ms | p952021 msp997240 ms5xx6.1 % | probe1683restarts37HPA max2 | RDS pressure2Redis0slow SQL0 | |
| May 21RED | events0peak415 ms | p9512920 msp9920000 ms5xx1.15 % | probe24restarts35HPA max3 | RDS pressure1Redis0slow SQL0 | |
| May 22RED | events0peak414 ms | p951150 msp9920000 ms5xx0.02 % | probe0restarts4HPA max3 | RDS pressure1Redis0slow SQL0 | |
| May 23RED | events0peak421 ms | p95237.73 msp9920000 ms5xx0.03 % | probe0restarts2HPA max1 | RDS pressure1Redis0slow SQL0 | |
| May 24RED | events0peak427 ms | p95800 msp9917080 ms5xx0.03 % | probe0restarts3HPA max1 | RDS pressure2Redis0slow SQL0 |
Production Dependency Map
| Component | Status | Current period | Source | References | Disposition ID |
|---|---|---|---|---|---|
| Customer Edge4 components | |||||
| Pingdom public checks | GREEN | No degraded signal in collected evidence | pingdom | component-pingdom-public-checks | |
| DNS resolution | GREEN | No degraded signal in collected evidence | dns_issue_check | component-dns-resolution | |
| Ingress / Traefik | GREEN | No degraded signal in collected evidence | service_5xx_rate_pct, service_5xx_rps, service_p95_top | component-ingress-traefik | |
| Public 5xx and latency | GREEN | No degraded signal in collected evidence | service_5xx_rate_pct, latency_triage, latency_signal | component-public-5xx-and-latency | |
| Application Runtime5 components | |||||
| PHP-FPM | AMBER | probe failures 1683; probe failures 23 | web_probe_failures_5m, web_restarts_5m, latency_triage | component-php-fpm | |
| API/app pods | GREEN | No degraded signal in collected evidence | web_ready_pods, web_running_pods, web_unavailable_replicas | component-api-app-pods | |
| Worker pods | AMBER | container restarts 2; container restarts 3; container restarts 34; container restarts 36 | top_restarts_24h, top_memory_containers_24h | component-worker-pods | |
| Runtime memory and restarts | AMBER | container restarts 2; container restarts 3; container restarts 34; container restarts 36 | top_memory_containers_24h, top_restarts_24h | component-runtime-memory-and-restarts | |
| Slow request families | GREEN | No degraded signal in collected evidence | latency_triage, slow_traces | component-slow-request-families | |
| Data Layer5 components | |||||
| MySQL catalog | RED | mysql-catalog free memory below 512MiB; mysql-catalog swap observed | rds:mysql-catalog, slowquery:mysql-catalog | component-mysql-catalog | |
| MySQL catalog2 | RED | mysql-catalog2 free memory below 512MiB; mysql-catalog2 swap observed | rds:mysql-catalog2, slowquery:mysql-catalog2 | component-mysql-catalog2 | |
| MySQL master | GREEN | No degraded signal in collected evidence | rds:mysql-master, slowquery:mysql-master | component-mysql-master | |
| Postgres billing | AMBER | postgres-billing swap observed | rds:postgres-billing, slowquery:postgres-billing | component-postgres-billing | |
| Slow queries | GREEN | No degraded signal in collected evidence | slowquery | component-slow-queries | |
| Cache & Messaging3 components | |||||
| Redis / Sentinel | GREEN | No degraded signal in collected evidence | redis_issue_check, redis_evicted_keys_5m, redis_rejected_connections_5m | component-redis-sentinel | |
| RabbitMQ / queues | GREEN | No degraded signal in collected evidence | rabbitmq, queue_depth, consumer_lag | component-rabbitmq-queues | |
| Consumer lag or delayed processing | GREEN | No degraded signal in collected evidence | consumer_lag, cron_active | component-consumer-lag-or-delayed-processing | |
| Batch & Scheduled Work3 components | |||||
| Watched CronJobs | GREEN | No degraded signal in collected evidence | cron_active | component-watched-cronjobs | |
| Kubernetes Jobs | GREEN | No degraded signal in collected evidence | KubeJobFailed, job_failed | component-kubernetes-jobs | |
| Long-running scheduled tasks | GREEN | No degraded signal in collected evidence | cron_active, slow_traces | component-long-running-scheduled-tasks | |
| Kubernetes & Capacity5 components | |||||
| Readiness | GREEN | No degraded signal in collected evidence | web_ready_pods, web_unavailable_replicas | component-readiness | |
| Probe failures | AMBER | probe failures 1683; probe failures 23 | web_probe_failures_5m | component-probe-failures | |
| HPA saturation | AMBER | HPA max hit: docgen2-api; HPA max hit: docgen2-api, notifications-event-manager; HPA max hit: docgen2-api, notifications-event-manager, web; HPA max hit: docgen2-api, subscriptions-api, web | hpa_current, hpa_max | component-hpa-saturation | |
| Pod/node churn | AMBER | container restarts 2; container restarts 3; container restarts 34; container restarts 36 | top_restarts_24h | component-pod-node-churn | |
| Top memory containers | AMBER | container restarts 2; container restarts 3; container restarts 34; container restarts 36 | top_memory_containers_24h | component-top-memory-containers | |
| Observability & Alerting5 components | |||||
| Grafana / Prometheus | GREEN | No degraded signal in collected evidence | prometheus | component-grafana-prometheus | |
| Loki | GREEN | No degraded signal in collected evidence | loki, latency_triage | component-loki | |
| Tempo | GREEN | No degraded signal in collected evidence | slow_traces | component-tempo | |
| Alert rule health/noise | GREEN | No degraded signal in collected evidence | slack_alerts, alert_table | component-alert-rule-health-noise | |
| AWS alarm emails | GREEN | No degraded signal in collected evidence | aws_email_alerts, email_table | component-aws-alarm-emails | |
| External Dependencies5 components | |||||
| Email provider | GREEN | No degraded signal in collected evidence | email_provider | component-email-provider | |
| SMS provider | GREEN | No degraded signal in collected evidence | sms_provider | component-sms-provider | |
| Payment provider | GREEN | No degraded signal in collected evidence | payment_provider | component-payment-provider | |
| Identity/login provider | GREEN | No degraded signal in collected evidence | identity_provider | component-identity-login-provider | |
| Webhooks / third-party APIs | GREEN | No degraded signal in collected evidence | third_party_api | component-webhooks-third-party-apis | |
Previous period
| Metric | Current | Previous | Delta |
|---|---|---|---|
| active_aws_alarms | 0 | 0 | 0 |
| application_alerts | 213 | 85 | +128 |
| customer_incidents_confirmed | 0 | 0 | 0 |
| customer_incidents_observed | 14 | 15 | -1 |
| customer_visible_failures | 14 | 15 | -1 |
| impacted_services | 14 | 10 | +4 |
| infrastructure_alerts | 0 | 0 | 0 |
| pingdom_downtime_minutes | 17 | 15 | +2 |
| pingdom_events | 14 | 15 | -1 |
ADS Action Queue
Missing dispositions: component-php-fpm, component-worker-pods, component-runtime-memory-and-restarts, component-mysql-catalog, component-mysql-catalog2, component-postgres-billing, component-probe-failures, component-hpa-saturation, component-pod-node-churn, component-top-memory-containers
| Status | Action | Domain | ID |
|---|---|---|---|
| AMBER | PHP-FPM needs disposition | Application Runtime | component-php-fpm |
| AMBER | Worker pods needs disposition | Application Runtime | component-worker-pods |
| AMBER | Runtime memory and restarts needs disposition | Application Runtime | component-runtime-memory-and-restarts |
| RED | MySQL catalog needs disposition | Data Layer | component-mysql-catalog |
| RED | MySQL catalog2 needs disposition | Data Layer | component-mysql-catalog2 |
| AMBER | Postgres billing needs disposition | Data Layer | component-postgres-billing |
| AMBER | Probe failures needs disposition | Kubernetes & Capacity | component-probe-failures |
| AMBER | HPA saturation needs disposition | Kubernetes & Capacity | component-hpa-saturation |
| AMBER | Pod/node churn needs disposition | Kubernetes & Capacity | component-pod-node-churn |
| AMBER | Top memory containers needs disposition | Kubernetes & Capacity | component-top-memory-containers |
Source Coverage
| Source | Status | Detail |
|---|---|---|
| Daily health JSON | ok | 5 daily artifact(s) found |
| Production reliability dashboard | ok | Weekly alerts/reliability artifact |
| Team weekly report | ok | Delivery/deploy evidence |
| Engineering council test report | ok | Test/smoke evidence |
| AWS posture evidence | warning | Cost/security/recommendation evidence |
| Action register | warning | ADS/accepted-risk/false-positive dispositions |
Evidence References
Source reports
Production reliability dashboardProduction reliability JSONTeam weekly reportTeam weekly JSONEngineering council test reportEngineering council test JSONDaily health JSON: 2026-05-20Daily health PDF: 2026-05-20Daily health JSON: 2026-05-21Daily health PDF: 2026-05-21Daily health JSON: 2026-05-22Daily health PDF: 2026-05-22
Grafana
DB/readiness validation dashboardGrafana exploreTraefik latency dashboardDB/readiness validation dashboardTraefik latency dashboardDB/readiness validation dashboardTraefik latency dashboardDB/readiness validation dashboardTraefik latency dashboardAPI family dashboardsDB/readiness validation dashboardTraefik all traffic observability
Pingdom
Slack
Slack alert: TraefikServiceHighErrorRateSlack alert: TraefikServiceHighLatencySlack alert: KubeJobFailedSlack alert: NodeHighNumberConntrackEntriesUsedSlack alert: KubeDeploymentReplicasMismatchSlack alert: KubePodCrashLoopingSlack alert: KubeCPUOvercommitSlack alert: KubeHpaMaxedOutSlack alert: NodeSystemSaturationSlack alert: NodeHighNumberConntrackEntriesUsedSlack alert: NodeHighNumberConntrackEntriesUsedSlack alert: NodeHighNumberConntrackEntriesUsed
Reliability
- application alerts
- 213
- infrastructure alerts
- 0
- active aws alarms
- 0
- degraded days
- 5
- red components
- 2
- amber components
- 8
Customer Impact
- customer visible failures
- 14
- customer incidents observed
- 14
- customer incidents confirmed
- 0
- pingdom events
- 14
- pingdom downtime minutes
- 17
Delivery Health
- production bugs closed
- 41
- delivery items
- 31
- deployments
- 9
- test runs
- 9
- test pass rate
- 22.22
- smoke attempts
- 16
- smoke failed
- 2
Cost
- total
- 0
- currency
- USD
- forecast
- 0
Security
- security hub score
- 0
- critical findings
- 0
- high findings
- 0
- guardduty findings
- 0
- inspector critical
- 0
- iam external access
- 0
Aws Recommendations
- trusted advisor red
- 0
- trusted advisor yellow
- 0
- compute optimizer savings
- 0
- cost optimization savings
- 0
- well architected high risk issues
- 0
- well architected medium risk issues
- 0
Backup
- failed jobs
- 0
- protected resources
- 0