Production Ops
Production Operations Weekly Review
May 30 to Jun 8, 2026 / cadence: weekly
Overall statusRED
Degraded days7
RED components3
AMBER components11
Actions needing disposition14
Daily Production Health
Latest daily health issues
- API endpoint p95 crossed actionable latency policy
- mysql-catalog2 free storage was low
- postgres-billing free storage was low
Daily Findings
Daily health findings by day
Reliability Movement
Customer visible failures-14
0 current / 14 previous
Observed incidents-14
0 current / 14 previous
Application alerts-213
0 current / 213 previous
Infrastructure alerts0
0 current / 0 previous
Pingdom events-14
0 current / 14 previous
Latency Trend
p95 ms max 4155.4p99 ms max 5000
Component Heatmap
May 31
Jun 1
Jun 2
Jun 3
Jun 4
Jun 5
Jun 6
Customer Edge
Application Runtime
Data Layer
Cache & Messaging
Batch & Scheduled Work
Kubernetes & Capacity
Observability & Alerting
External Dependencies
Daily Health Spine
Signals pulled from the daily production health PDF/JSON: customer checks, latency/errors, PHP-FPM probes, Kubernetes pressure, DB/cache pressure, and direct source links.
| Day | Pingdom | Latency / errors | PHP-FPM / Kubernetes | DB / cache | References |
|---|---|---|---|---|---|
| May 31AMBER | events0peak469 ms | p95284.67 msp992416.93 ms5xx0.04 % | probe0restarts4HPA max1 | RDS pressure2Redis7158slow SQL0 | |
| Jun 1AMBER | events0peak0 ms | p95592.56 msp995000 ms5xx0.05 % | probe0restarts7HPA max2 | RDS pressure2Redis7158slow SQL0 | |
| Jun 2RED | events2peak446 ms | p954155.4 msp995000 ms5xx0.34 % | probe0restarts25HPA max1 | RDS pressure2Redis4029slow SQL0 | |
| Jun 3RED | events2peak721 ms | p951999.31 msp995000 ms5xx10.51 % | probe15restarts65HPA max1 | RDS pressure2Redis0slow SQL0 | |
| Jun 4RED | events0peak476 ms | p95204.12 msp99906.46 ms5xx0.08 % | probe4restarts35HPA max1 | RDS pressure1Redis0slow SQL0 | |
| Jun 5RED | events1peak459 ms | p95513.87 msp991237.46 ms5xx0.15 % | probe0restarts9HPA max0 | RDS pressure2Redis0slow SQL0 | |
| Jun 6RED | events0peak470 ms | p95193.7 msp992245.39 ms5xx0.07 % | probe0restarts3HPA max0 | RDS pressure2Redis0slow SQL0 |
Production Dependency Map
| Component | Status | Current period | Source | References | Disposition ID |
|---|---|---|---|---|---|
| Customer Edge4 components | |||||
| Pingdom public checks | RED | Pingdom DOWN/SLOW 1; Pingdom DOWN/SLOW 2 | pingdom | component-pingdom-public-checks | |
| DNS resolution | GREEN | No degraded signal in collected evidence | dns_issue_check | component-dns-resolution | |
| Ingress / Traefik | GREEN | No degraded signal in collected evidence | service_5xx_rate_pct, service_5xx_rps, service_p95_top | component-ingress-traefik | |
| Public 5xx and latency | GREEN | No degraded signal in collected evidence | service_5xx_rate_pct, latency_triage, latency_signal | component-public-5xx-and-latency | |
| Application Runtime5 components | |||||
| PHP-FPM | AMBER | probe failures 14; probe failures 3 | web_probe_failures_5m, web_restarts_5m, latency_triage | component-php-fpm | |
| API/app pods | GREEN | No degraded signal in collected evidence | web_ready_pods, web_running_pods, web_unavailable_replicas | component-api-app-pods | |
| Worker pods | AMBER | container restarts 25; container restarts 3; container restarts 35; container restarts 4 | top_restarts_24h, top_memory_containers_24h | component-worker-pods | |
| Runtime memory and restarts | AMBER | container restarts 25; container restarts 3; container restarts 35; container restarts 4 | top_memory_containers_24h, top_restarts_24h | component-runtime-memory-and-restarts | |
| Slow request families | GREEN | No degraded signal in collected evidence | latency_triage, slow_traces | component-slow-request-families | |
| Data Layer5 components | |||||
| MySQL catalog | AMBER | mysql-catalog swap observed | rds:mysql-catalog, slowquery:mysql-catalog | component-mysql-catalog | |
| MySQL catalog2 | AMBER | mysql-catalog2 swap observed | rds:mysql-catalog2, slowquery:mysql-catalog2 | component-mysql-catalog2 | |
| MySQL master | GREEN | No degraded signal in collected evidence | rds:mysql-master, slowquery:mysql-master | component-mysql-master | |
| Postgres billing | AMBER | postgres-billing swap observed | rds:postgres-billing, slowquery:postgres-billing | component-postgres-billing | |
| Slow queries | GREEN | No degraded signal in collected evidence | slowquery | component-slow-queries | |
| Cache & Messaging3 components | |||||
| Redis / Sentinel | AMBER | Redis/Sentinel connection evidence | redis_issue_check, redis_evicted_keys_5m, redis_rejected_connections_5m | component-redis-sentinel | |
| RabbitMQ / queues | GREEN | No degraded signal in collected evidence | rabbitmq, queue_depth, consumer_lag | component-rabbitmq-queues | |
| Consumer lag or delayed processing | GREEN | No degraded signal in collected evidence | consumer_lag, cron_active | component-consumer-lag-or-delayed-processing | |
| Batch & Scheduled Work3 components | |||||
| Watched CronJobs | GREEN | No degraded signal in collected evidence | cron_active | component-watched-cronjobs | |
| Kubernetes Jobs | GREEN | No degraded signal in collected evidence | KubeJobFailed, job_failed | component-kubernetes-jobs | |
| Long-running scheduled tasks | GREEN | No degraded signal in collected evidence | cron_active, slow_traces | component-long-running-scheduled-tasks | |
| Kubernetes & Capacity5 components | |||||
| Readiness | GREEN | No degraded signal in collected evidence | web_ready_pods, web_unavailable_replicas | component-readiness | |
| Probe failures | AMBER | probe failures 14; probe failures 3 | web_probe_failures_5m | component-probe-failures | |
| HPA saturation | AMBER | HPA max hit: docgen2-api; HPA max hit: docgen2-api, notifications-event-manager; HPA max hit: notifications-event-manager; HPA max hit: subscriptions-api | hpa_current, hpa_max | component-hpa-saturation | |
| Pod/node churn | AMBER | container restarts 25; container restarts 3; container restarts 35; container restarts 4 | top_restarts_24h | component-pod-node-churn | |
| Top memory containers | AMBER | container restarts 25; container restarts 3; container restarts 35; container restarts 4 | top_memory_containers_24h | component-top-memory-containers | |
| Observability & Alerting5 components | |||||
| Grafana / Prometheus | RED | API dashboard p95 needs route follow-up; API endpoint p95 crossed actionable latency policy; Platform p95 exceeded 1s in active hours; Platform p95 exceeded the 250ms target in active hours | prometheus | component-grafana-prometheus | |
| Loki | RED | API dashboard p95 needs route follow-up; API endpoint p95 crossed actionable latency policy; Platform p95 exceeded 1s in active hours; Platform p95 exceeded the 250ms target in active hours | loki, latency_triage | component-loki | |
| Tempo | GREEN | No degraded signal in collected evidence | slow_traces | component-tempo | |
| Alert rule health/noise | GREEN | No degraded signal in collected evidence | slack_alerts, alert_table | component-alert-rule-health-noise | |
| AWS alarm emails | GREEN | No degraded signal in collected evidence | aws_email_alerts, email_table | component-aws-alarm-emails | |
| External Dependencies5 components | |||||
| Email provider | GREEN | No degraded signal in collected evidence | email_provider | component-email-provider | |
| SMS provider | GREEN | No degraded signal in collected evidence | sms_provider | component-sms-provider | |
| Payment provider | GREEN | No degraded signal in collected evidence | payment_provider | component-payment-provider | |
| Identity/login provider | GREEN | No degraded signal in collected evidence | identity_provider | component-identity-login-provider | |
| Webhooks / third-party APIs | GREEN | No degraded signal in collected evidence | third_party_api | component-webhooks-third-party-apis | |
Previous period
| Metric | Current | Previous | Delta |
|---|---|---|---|
| active_aws_alarms | 0 | 0 | 0 |
| application_alerts | 0 | 213 | -213 |
| customer_incidents_confirmed | 0 | 0 | 0 |
| customer_incidents_observed | 0 | 14 | -14 |
| customer_visible_failures | 0 | 14 | -14 |
| impacted_services | 0 | 14 | -14 |
| infrastructure_alerts | 0 | 0 | 0 |
| pingdom_downtime_minutes | 0 | 17 | -17 |
| pingdom_events | 0 | 14 | -14 |
ADS Action Queue
Missing dispositions: component-pingdom-public-checks, component-php-fpm, component-worker-pods, component-runtime-memory-and-restarts, component-mysql-catalog, component-mysql-catalog2, component-postgres-billing, component-redis-sentinel, component-probe-failures, component-hpa-saturation, component-pod-node-churn, component-top-memory-containers, component-grafana-prometheus, component-loki
| Status | Action | Domain | ID |
|---|---|---|---|
| RED | Pingdom public checks needs disposition | Customer Edge | component-pingdom-public-checks |
| AMBER | PHP-FPM needs disposition | Application Runtime | component-php-fpm |
| AMBER | Worker pods needs disposition | Application Runtime | component-worker-pods |
| AMBER | Runtime memory and restarts needs disposition | Application Runtime | component-runtime-memory-and-restarts |
| AMBER | MySQL catalog needs disposition | Data Layer | component-mysql-catalog |
| AMBER | MySQL catalog2 needs disposition | Data Layer | component-mysql-catalog2 |
| AMBER | Postgres billing needs disposition | Data Layer | component-postgres-billing |
| AMBER | Redis / Sentinel needs disposition | Cache & Messaging | component-redis-sentinel |
| AMBER | Probe failures needs disposition | Kubernetes & Capacity | component-probe-failures |
| AMBER | HPA saturation needs disposition | Kubernetes & Capacity | component-hpa-saturation |
| AMBER | Pod/node churn needs disposition | Kubernetes & Capacity | component-pod-node-churn |
| AMBER | Top memory containers needs disposition | Kubernetes & Capacity | component-top-memory-containers |
| RED | Grafana / Prometheus needs disposition | Observability & Alerting | component-grafana-prometheus |
| RED | Loki needs disposition | Observability & Alerting | component-loki |
Source Coverage
| Source | Status | Detail |
|---|---|---|
| Daily health JSON | ok | 7 daily artifact(s) found |
| Production reliability dashboard | missing | Weekly alerts/reliability artifact |
| Team weekly report | warning | Delivery/deploy evidence |
| Engineering council test report | warning | Test/smoke evidence |
| AWS posture evidence | warning | aws CLI not found; provide --aws-evidence-json for AWS posture. |
| Action register | warning | ADS/accepted-risk/false-positive dispositions |
Evidence References
Source reports
Daily health JSON: 2026-05-31Daily health PDF: 2026-05-31Daily health JSON: 2026-06-01Daily health PDF: 2026-06-01Daily health JSON: 2026-06-02Daily health PDF: 2026-06-02Daily health JSON: 2026-06-03Daily health PDF: 2026-06-03Daily health JSON: 2026-06-04Daily health PDF: 2026-06-04Daily health JSON: 2026-06-05Daily health PDF: 2026-06-05
Grafana
API family dashboardsDB/readiness validation dashboardGrafana exploreTraefik all traffic observabilityTraefik endpoint observabilityTraefik latency dashboardDB/readiness validation dashboardTraefik all traffic observabilityTraefik endpoint observabilityTraefik latency dashboardDB/readiness validation dashboardTraefik all traffic observability
Reliability
- application alerts
- 0
- infrastructure alerts
- 0
- active aws alarms
- 0
- degraded days
- 7
- red components
- 3
- amber components
- 11
Customer Impact
- customer visible failures
- 0
- customer incidents observed
- 0
- customer incidents confirmed
- 0
- pingdom events
- 0
- pingdom downtime minutes
- 0
Delivery Health
- production bugs closed
- 0
- delivery items
- 0
- deployments
- 0
- test runs
- 0
- test pass rate
- 0
- smoke attempts
- 0
- smoke failed
- 0
Cost
- total
- 0
- currency
- USD
- forecast
- 0
Security
- security hub score
- 0
- critical findings
- 0
- high findings
- 0
- guardduty findings
- 0
- inspector critical
- 0
- iam external access
- 0
Aws Recommendations
- trusted advisor red
- 0
- trusted advisor yellow
- 0
- compute optimizer savings
- 0
- cost optimization savings
- 0
- well architected high risk issues
- 0
- well architected medium risk issues
- 0
Backup
- failed jobs
- 0
- protected resources
- 0