Production Ops

Production Operations Weekly Review

Jun 6 to Jun 13, 2026 / cadence: weekly
Overall statusRED
Degraded days6
RED components2
AMBER components9
Actions needing disposition11

Daily Production Health

Latest daily health issues
  • API endpoint p95 crossed actionable latency policy
  • mysql-catalog2 free storage was critically low
  • postgres-billing free storage was low

Daily Findings

Jun 7: 5 7Jun 8: 8 8Jun 9: 7 9Jun 10: 910Jun 11: 711Jun 13: 513max 9
Daily health findings by day

Reliability Movement

Customer visible failures-14
0 current / 14 previous
Observed incidents-14
0 current / 14 previous
Application alerts-213
0 current / 213 previous
Infrastructure alerts0
0 current / 0 previous
Pingdom events-14
0 current / 14 previous

Latency Trend

p95 ms max 277.47p99 ms max 5000

Component Heatmap

Jun 7
Jun 8
Jun 9
Jun 10
Jun 11
Jun 13
Customer Edge
Application Runtime
Data Layer
Cache & Messaging
Batch & Scheduled Work
Kubernetes & Capacity
Observability & Alerting
External Dependencies

Daily Health Spine

Signals pulled from the daily production health PDF/JSON: customer checks, latency/errors, PHP-FPM probes, Kubernetes pressure, DB/cache pressure, and direct source links.

DayPingdomLatency / errorsPHP-FPM / KubernetesDB / cacheReferences
Jun 7REDevents0peak443 msp95135.85 msp99512.76 ms5xx0.07 %probe0restarts6HPA max0RDS pressure2Redis0slow SQL0
Jun 8REDevents0peak650 msp95186.89 msp99331.5 ms5xx0.4 %probe10restarts52HPA max2RDS pressure2Redis0slow SQL0
Jun 9REDevents0peak0 msp95277.47 msp994489.18 ms5xx0.26 %probe79restarts42HPA max2RDS pressure2Redis0slow SQL0
Jun 10REDevents0peak0 msp95217.41 msp991428.03 ms5xx0.14 %probe20restarts40HPA max2RDS pressure2Redis0slow SQL0
Jun 11REDevents0peak0 msp95228.47 msp995000 ms5xx0.42 %probe33restarts31HPA max1RDS pressure2Redis0slow SQL0
Jun 13REDevents0peak0 msp95114.24 msp99851.37 ms5xx0.03 %probe0restarts8HPA max0RDS pressure2Redis0slow SQL0

Production Dependency Map

ComponentStatusCurrent periodSourceReferencesDisposition ID
Customer Edge4 components
Pingdom public checksGREENNo degraded signal in collected evidencepingdomcomponent-pingdom-public-checks
DNS resolutionGREENNo degraded signal in collected evidencedns_issue_checkcomponent-dns-resolution
Ingress / TraefikGREENNo degraded signal in collected evidenceservice_5xx_rate_pct, service_5xx_rps, service_p95_topcomponent-ingress-traefik
Public 5xx and latencyGREENNo degraded signal in collected evidenceservice_5xx_rate_pct, latency_triage, latency_signalcomponent-public-5xx-and-latency
Application Runtime5 components
PHP-FPMAMBERprobe failures 10; probe failures 20; probe failures 32; probe failures 79web_probe_failures_5m, web_restarts_5m, latency_triagecomponent-php-fpm
API/app podsGREENNo degraded signal in collected evidenceweb_ready_pods, web_running_pods, web_unavailable_replicascomponent-api-app-pods
Worker podsAMBERcontainer restarts 31; container restarts 40; container restarts 42; container restarts 52top_restarts_24h, top_memory_containers_24hcomponent-worker-pods
Runtime memory and restartsAMBERcontainer restarts 31; container restarts 40; container restarts 42; container restarts 52top_memory_containers_24h, top_restarts_24hcomponent-runtime-memory-and-restarts
Slow request familiesGREENNo degraded signal in collected evidencelatency_triage, slow_tracescomponent-slow-request-families
Data Layer5 components
MySQL catalogAMBERmysql-catalog swap observedrds:mysql-catalog, slowquery:mysql-catalogcomponent-mysql-catalog
MySQL catalog2AMBERmysql-catalog2 swap observedrds:mysql-catalog2, slowquery:mysql-catalog2component-mysql-catalog2
MySQL masterGREENNo degraded signal in collected evidencerds:mysql-master, slowquery:mysql-mastercomponent-mysql-master
Postgres billingGREENNo degraded signal in collected evidencerds:postgres-billing, slowquery:postgres-billingcomponent-postgres-billing
Slow queriesGREENNo degraded signal in collected evidenceslowquerycomponent-slow-queries
Cache & Messaging3 components
Redis / SentinelGREENNo degraded signal in collected evidenceredis_issue_check, redis_evicted_keys_5m, redis_rejected_connections_5mcomponent-redis-sentinel
RabbitMQ / queuesGREENNo degraded signal in collected evidencerabbitmq, queue_depth, consumer_lagcomponent-rabbitmq-queues
Consumer lag or delayed processingGREENNo degraded signal in collected evidenceconsumer_lag, cron_activecomponent-consumer-lag-or-delayed-processing
Batch & Scheduled Work3 components
Watched CronJobsGREENNo degraded signal in collected evidencecron_activecomponent-watched-cronjobs
Kubernetes JobsGREENNo degraded signal in collected evidenceKubeJobFailed, job_failedcomponent-kubernetes-jobs
Long-running scheduled tasksGREENNo degraded signal in collected evidencecron_active, slow_tracescomponent-long-running-scheduled-tasks
Kubernetes & Capacity5 components
ReadinessGREENNo degraded signal in collected evidenceweb_ready_pods, web_unavailable_replicascomponent-readiness
Probe failuresAMBERprobe failures 10; probe failures 20; probe failures 32; probe failures 79web_probe_failures_5mcomponent-probe-failures
HPA saturationAMBERHPA max hit: docgen2-api; HPA max hit: docgen2-api, subscriptions-apihpa_current, hpa_maxcomponent-hpa-saturation
Pod/node churnAMBERcontainer restarts 31; container restarts 40; container restarts 42; container restarts 52top_restarts_24hcomponent-pod-node-churn
Top memory containersAMBERcontainer restarts 31; container restarts 40; container restarts 42; container restarts 52top_memory_containers_24hcomponent-top-memory-containers
Observability & Alerting5 components
Grafana / PrometheusREDAPI dashboard p95 needs route follow-up; API endpoint p95 crossed actionable latency policy; Web pod availability crossed policy; Web probes failed during the dayprometheuscomponent-grafana-prometheus
LokiREDAPI dashboard p95 needs route follow-up; API endpoint p95 crossed actionable latency policy; Web pod availability crossed policy; Web probes failed during the dayloki, latency_triagecomponent-loki
TempoGREENNo degraded signal in collected evidenceslow_tracescomponent-tempo
Alert rule health/noiseGREENNo degraded signal in collected evidenceslack_alerts, alert_tablecomponent-alert-rule-health-noise
AWS alarm emailsGREENNo degraded signal in collected evidenceaws_email_alerts, email_tablecomponent-aws-alarm-emails
External Dependencies5 components
Email providerGREENNo degraded signal in collected evidenceemail_providercomponent-email-provider
SMS providerGREENNo degraded signal in collected evidencesms_providercomponent-sms-provider
Payment providerGREENNo degraded signal in collected evidencepayment_providercomponent-payment-provider
Identity/login providerGREENNo degraded signal in collected evidenceidentity_providercomponent-identity-login-provider
Webhooks / third-party APIsGREENNo degraded signal in collected evidencethird_party_apicomponent-webhooks-third-party-apis

Previous period

MetricCurrentPreviousDelta
active_aws_alarms000
application_alerts0213-213
customer_incidents_confirmed000
customer_incidents_observed014-14
customer_visible_failures014-14
impacted_services014-14
infrastructure_alerts000
pingdom_downtime_minutes017-17
pingdom_events014-14

ADS Action Queue

Missing dispositions: component-php-fpm, component-worker-pods, component-runtime-memory-and-restarts, component-mysql-catalog, component-mysql-catalog2, component-probe-failures, component-hpa-saturation, component-pod-node-churn, component-top-memory-containers, component-grafana-prometheus, component-loki

StatusActionDomainID
AMBERPHP-FPM needs dispositionApplication Runtimecomponent-php-fpm
AMBERWorker pods needs dispositionApplication Runtimecomponent-worker-pods
AMBERRuntime memory and restarts needs dispositionApplication Runtimecomponent-runtime-memory-and-restarts
AMBERMySQL catalog needs dispositionData Layercomponent-mysql-catalog
AMBERMySQL catalog2 needs dispositionData Layercomponent-mysql-catalog2
AMBERProbe failures needs dispositionKubernetes & Capacitycomponent-probe-failures
AMBERHPA saturation needs dispositionKubernetes & Capacitycomponent-hpa-saturation
AMBERPod/node churn needs dispositionKubernetes & Capacitycomponent-pod-node-churn
AMBERTop memory containers needs dispositionKubernetes & Capacitycomponent-top-memory-containers
REDGrafana / Prometheus needs dispositionObservability & Alertingcomponent-grafana-prometheus
REDLoki needs dispositionObservability & Alertingcomponent-loki

Source Coverage

SourceStatusDetail
Daily health JSONok6 daily artifact(s) found
Production reliability dashboardmissingWeekly alerts/reliability artifact
Team weekly reportokDelivery/deploy evidence
Engineering council test reportokTest/smoke evidence
AWS posture evidencewarningaws CLI not found; provide --aws-evidence-json for AWS posture.
Action registerwarningADS/accepted-risk/false-positive dispositions

Evidence References

Reliability

application alerts
0
infrastructure alerts
0
active aws alarms
0
degraded days
6
red components
2
amber components
9

Customer Impact

customer visible failures
0
customer incidents observed
0
customer incidents confirmed
0
pingdom events
0
pingdom downtime minutes
0

Delivery Health

production bugs closed
35
delivery items
37
deployments
15
test runs
7
test pass rate
0
smoke attempts
20
smoke failed
4

Cost

total
0
currency
USD
forecast
0

Security

security hub score
0
critical findings
0
high findings
0
guardduty findings
0
inspector critical
0
iam external access
0

Aws Recommendations

trusted advisor red
0
trusted advisor yellow
0
compute optimizer savings
0
cost optimization savings
0
well architected high risk issues
0
well architected medium risk issues
0

Backup

failed jobs
0
protected resources
0