Production Ops
Production Operations Weekly Review
Jun 6 to Jun 13, 2026 / cadence: weekly
Overall statusRED
Degraded days6
RED components2
AMBER components9
Actions needing disposition11
Daily Production Health
Latest daily health issues
- API endpoint p95 crossed actionable latency policy
- mysql-catalog2 free storage was critically low
- postgres-billing free storage was low
Daily Findings
Daily health findings by day
Reliability Movement
Customer visible failures-14
0 current / 14 previous
Observed incidents-14
0 current / 14 previous
Application alerts-213
0 current / 213 previous
Infrastructure alerts0
0 current / 0 previous
Pingdom events-14
0 current / 14 previous
Latency Trend
p95 ms max 277.47p99 ms max 5000
Component Heatmap
Jun 7
Jun 8
Jun 9
Jun 10
Jun 11
Jun 13
Customer Edge
Application Runtime
Data Layer
Cache & Messaging
Batch & Scheduled Work
Kubernetes & Capacity
Observability & Alerting
External Dependencies
Daily Health Spine
Signals pulled from the daily production health PDF/JSON: customer checks, latency/errors, PHP-FPM probes, Kubernetes pressure, DB/cache pressure, and direct source links.
| Day | Pingdom | Latency / errors | PHP-FPM / Kubernetes | DB / cache | References |
|---|---|---|---|---|---|
| Jun 7RED | events0peak443 ms | p95135.85 msp99512.76 ms5xx0.07 % | probe0restarts6HPA max0 | RDS pressure2Redis0slow SQL0 | |
| Jun 8RED | events0peak650 ms | p95186.89 msp99331.5 ms5xx0.4 % | probe10restarts52HPA max2 | RDS pressure2Redis0slow SQL0 | |
| Jun 9RED | events0peak0 ms | p95277.47 msp994489.18 ms5xx0.26 % | probe79restarts42HPA max2 | RDS pressure2Redis0slow SQL0 | |
| Jun 10RED | events0peak0 ms | p95217.41 msp991428.03 ms5xx0.14 % | probe20restarts40HPA max2 | RDS pressure2Redis0slow SQL0 | |
| Jun 11RED | events0peak0 ms | p95228.47 msp995000 ms5xx0.42 % | probe33restarts31HPA max1 | RDS pressure2Redis0slow SQL0 | |
| Jun 13RED | events0peak0 ms | p95114.24 msp99851.37 ms5xx0.03 % | probe0restarts8HPA max0 | RDS pressure2Redis0slow SQL0 |
Production Dependency Map
| Component | Status | Current period | Source | References | Disposition ID |
|---|---|---|---|---|---|
| Customer Edge4 components | |||||
| Pingdom public checks | GREEN | No degraded signal in collected evidence | pingdom | component-pingdom-public-checks | |
| DNS resolution | GREEN | No degraded signal in collected evidence | dns_issue_check | component-dns-resolution | |
| Ingress / Traefik | GREEN | No degraded signal in collected evidence | service_5xx_rate_pct, service_5xx_rps, service_p95_top | component-ingress-traefik | |
| Public 5xx and latency | GREEN | No degraded signal in collected evidence | service_5xx_rate_pct, latency_triage, latency_signal | component-public-5xx-and-latency | |
| Application Runtime5 components | |||||
| PHP-FPM | AMBER | probe failures 10; probe failures 20; probe failures 32; probe failures 79 | web_probe_failures_5m, web_restarts_5m, latency_triage | component-php-fpm | |
| API/app pods | GREEN | No degraded signal in collected evidence | web_ready_pods, web_running_pods, web_unavailable_replicas | component-api-app-pods | |
| Worker pods | AMBER | container restarts 31; container restarts 40; container restarts 42; container restarts 52 | top_restarts_24h, top_memory_containers_24h | component-worker-pods | |
| Runtime memory and restarts | AMBER | container restarts 31; container restarts 40; container restarts 42; container restarts 52 | top_memory_containers_24h, top_restarts_24h | component-runtime-memory-and-restarts | |
| Slow request families | GREEN | No degraded signal in collected evidence | latency_triage, slow_traces | component-slow-request-families | |
| Data Layer5 components | |||||
| MySQL catalog | AMBER | mysql-catalog swap observed | rds:mysql-catalog, slowquery:mysql-catalog | component-mysql-catalog | |
| MySQL catalog2 | AMBER | mysql-catalog2 swap observed | rds:mysql-catalog2, slowquery:mysql-catalog2 | component-mysql-catalog2 | |
| MySQL master | GREEN | No degraded signal in collected evidence | rds:mysql-master, slowquery:mysql-master | component-mysql-master | |
| Postgres billing | GREEN | No degraded signal in collected evidence | rds:postgres-billing, slowquery:postgres-billing | component-postgres-billing | |
| Slow queries | GREEN | No degraded signal in collected evidence | slowquery | component-slow-queries | |
| Cache & Messaging3 components | |||||
| Redis / Sentinel | GREEN | No degraded signal in collected evidence | redis_issue_check, redis_evicted_keys_5m, redis_rejected_connections_5m | component-redis-sentinel | |
| RabbitMQ / queues | GREEN | No degraded signal in collected evidence | rabbitmq, queue_depth, consumer_lag | component-rabbitmq-queues | |
| Consumer lag or delayed processing | GREEN | No degraded signal in collected evidence | consumer_lag, cron_active | component-consumer-lag-or-delayed-processing | |
| Batch & Scheduled Work3 components | |||||
| Watched CronJobs | GREEN | No degraded signal in collected evidence | cron_active | component-watched-cronjobs | |
| Kubernetes Jobs | GREEN | No degraded signal in collected evidence | KubeJobFailed, job_failed | component-kubernetes-jobs | |
| Long-running scheduled tasks | GREEN | No degraded signal in collected evidence | cron_active, slow_traces | component-long-running-scheduled-tasks | |
| Kubernetes & Capacity5 components | |||||
| Readiness | GREEN | No degraded signal in collected evidence | web_ready_pods, web_unavailable_replicas | component-readiness | |
| Probe failures | AMBER | probe failures 10; probe failures 20; probe failures 32; probe failures 79 | web_probe_failures_5m | component-probe-failures | |
| HPA saturation | AMBER | HPA max hit: docgen2-api; HPA max hit: docgen2-api, subscriptions-api | hpa_current, hpa_max | component-hpa-saturation | |
| Pod/node churn | AMBER | container restarts 31; container restarts 40; container restarts 42; container restarts 52 | top_restarts_24h | component-pod-node-churn | |
| Top memory containers | AMBER | container restarts 31; container restarts 40; container restarts 42; container restarts 52 | top_memory_containers_24h | component-top-memory-containers | |
| Observability & Alerting5 components | |||||
| Grafana / Prometheus | RED | API dashboard p95 needs route follow-up; API endpoint p95 crossed actionable latency policy; Web pod availability crossed policy; Web probes failed during the day | prometheus | component-grafana-prometheus | |
| Loki | RED | API dashboard p95 needs route follow-up; API endpoint p95 crossed actionable latency policy; Web pod availability crossed policy; Web probes failed during the day | loki, latency_triage | component-loki | |
| Tempo | GREEN | No degraded signal in collected evidence | slow_traces | component-tempo | |
| Alert rule health/noise | GREEN | No degraded signal in collected evidence | slack_alerts, alert_table | component-alert-rule-health-noise | |
| AWS alarm emails | GREEN | No degraded signal in collected evidence | aws_email_alerts, email_table | component-aws-alarm-emails | |
| External Dependencies5 components | |||||
| Email provider | GREEN | No degraded signal in collected evidence | email_provider | component-email-provider | |
| SMS provider | GREEN | No degraded signal in collected evidence | sms_provider | component-sms-provider | |
| Payment provider | GREEN | No degraded signal in collected evidence | payment_provider | component-payment-provider | |
| Identity/login provider | GREEN | No degraded signal in collected evidence | identity_provider | component-identity-login-provider | |
| Webhooks / third-party APIs | GREEN | No degraded signal in collected evidence | third_party_api | component-webhooks-third-party-apis | |
Previous period
| Metric | Current | Previous | Delta |
|---|---|---|---|
| active_aws_alarms | 0 | 0 | 0 |
| application_alerts | 0 | 213 | -213 |
| customer_incidents_confirmed | 0 | 0 | 0 |
| customer_incidents_observed | 0 | 14 | -14 |
| customer_visible_failures | 0 | 14 | -14 |
| impacted_services | 0 | 14 | -14 |
| infrastructure_alerts | 0 | 0 | 0 |
| pingdom_downtime_minutes | 0 | 17 | -17 |
| pingdom_events | 0 | 14 | -14 |
ADS Action Queue
Missing dispositions: component-php-fpm, component-worker-pods, component-runtime-memory-and-restarts, component-mysql-catalog, component-mysql-catalog2, component-probe-failures, component-hpa-saturation, component-pod-node-churn, component-top-memory-containers, component-grafana-prometheus, component-loki
| Status | Action | Domain | ID |
|---|---|---|---|
| AMBER | PHP-FPM needs disposition | Application Runtime | component-php-fpm |
| AMBER | Worker pods needs disposition | Application Runtime | component-worker-pods |
| AMBER | Runtime memory and restarts needs disposition | Application Runtime | component-runtime-memory-and-restarts |
| AMBER | MySQL catalog needs disposition | Data Layer | component-mysql-catalog |
| AMBER | MySQL catalog2 needs disposition | Data Layer | component-mysql-catalog2 |
| AMBER | Probe failures needs disposition | Kubernetes & Capacity | component-probe-failures |
| AMBER | HPA saturation needs disposition | Kubernetes & Capacity | component-hpa-saturation |
| AMBER | Pod/node churn needs disposition | Kubernetes & Capacity | component-pod-node-churn |
| AMBER | Top memory containers needs disposition | Kubernetes & Capacity | component-top-memory-containers |
| RED | Grafana / Prometheus needs disposition | Observability & Alerting | component-grafana-prometheus |
| RED | Loki needs disposition | Observability & Alerting | component-loki |
Source Coverage
| Source | Status | Detail |
|---|---|---|
| Daily health JSON | ok | 6 daily artifact(s) found |
| Production reliability dashboard | missing | Weekly alerts/reliability artifact |
| Team weekly report | ok | Delivery/deploy evidence |
| Engineering council test report | ok | Test/smoke evidence |
| AWS posture evidence | warning | aws CLI not found; provide --aws-evidence-json for AWS posture. |
| Action register | warning | ADS/accepted-risk/false-positive dispositions |
Evidence References
Source reports
Team weekly reportTeam weekly JSONEngineering council test reportEngineering council test JSONDaily health JSON: 2026-06-07Daily health PDF: 2026-06-07Daily health JSON: 2026-06-08Daily health PDF: 2026-06-08Daily health JSON: 2026-06-09Daily health PDF: 2026-06-09Daily health JSON: 2026-06-10Daily health PDF: 2026-06-10
Grafana
API family dashboardsDB/readiness validation dashboardGrafana exploreTraefik all traffic observabilityTraefik endpoint observabilityTraefik latency dashboardDB/readiness validation dashboardTraefik all traffic observabilityTraefik endpoint observabilityTraefik latency dashboardDB/readiness validation dashboardTraefik all traffic observability
Reliability
- application alerts
- 0
- infrastructure alerts
- 0
- active aws alarms
- 0
- degraded days
- 6
- red components
- 2
- amber components
- 9
Customer Impact
- customer visible failures
- 0
- customer incidents observed
- 0
- customer incidents confirmed
- 0
- pingdom events
- 0
- pingdom downtime minutes
- 0
Delivery Health
- production bugs closed
- 35
- delivery items
- 37
- deployments
- 15
- test runs
- 7
- test pass rate
- 0
- smoke attempts
- 20
- smoke failed
- 4
Cost
- total
- 0
- currency
- USD
- forecast
- 0
Security
- security hub score
- 0
- critical findings
- 0
- high findings
- 0
- guardduty findings
- 0
- inspector critical
- 0
- iam external access
- 0
Aws Recommendations
- trusted advisor red
- 0
- trusted advisor yellow
- 0
- compute optimizer savings
- 0
- cost optimization savings
- 0
- well architected high risk issues
- 0
- well architected medium risk issues
- 0
Backup
- failed jobs
- 0
- protected resources
- 0