Production Ops

Production Operations Weekly Review

May 30 to Jun 8, 2026 / cadence: weekly
Overall statusRED
Degraded days7
RED components3
AMBER components11
Actions needing disposition14

Daily Production Health

Latest daily health issues
  • API endpoint p95 crossed actionable latency policy
  • mysql-catalog2 free storage was low
  • postgres-billing free storage was low

Daily Findings

May 31: 931Jun 1: 10 1Jun 2: 11 2Jun 3: 9 3Jun 4: 7 4Jun 5: 6 5Jun 6: 5 6max 11
Daily health findings by day

Reliability Movement

Customer visible failures-14
0 current / 14 previous
Observed incidents-14
0 current / 14 previous
Application alerts-213
0 current / 213 previous
Infrastructure alerts0
0 current / 0 previous
Pingdom events-14
0 current / 14 previous

Latency Trend

p95 ms max 4155.4p99 ms max 5000

Component Heatmap

May 31
Jun 1
Jun 2
Jun 3
Jun 4
Jun 5
Jun 6
Customer Edge
Application Runtime
Data Layer
Cache & Messaging
Batch & Scheduled Work
Kubernetes & Capacity
Observability & Alerting
External Dependencies

Daily Health Spine

Signals pulled from the daily production health PDF/JSON: customer checks, latency/errors, PHP-FPM probes, Kubernetes pressure, DB/cache pressure, and direct source links.

DayPingdomLatency / errorsPHP-FPM / KubernetesDB / cacheReferences
May 31AMBERevents0peak469 msp95284.67 msp992416.93 ms5xx0.04 %probe0restarts4HPA max1RDS pressure2Redis7158slow SQL0
Jun 1AMBERevents0peak0 msp95592.56 msp995000 ms5xx0.05 %probe0restarts7HPA max2RDS pressure2Redis7158slow SQL0
Jun 2REDevents2peak446 msp954155.4 msp995000 ms5xx0.34 %probe0restarts25HPA max1RDS pressure2Redis4029slow SQL0
Jun 3REDevents2peak721 msp951999.31 msp995000 ms5xx10.51 %probe15restarts65HPA max1RDS pressure2Redis0slow SQL0
Jun 4REDevents0peak476 msp95204.12 msp99906.46 ms5xx0.08 %probe4restarts35HPA max1RDS pressure1Redis0slow SQL0
Jun 5REDevents1peak459 msp95513.87 msp991237.46 ms5xx0.15 %probe0restarts9HPA max0RDS pressure2Redis0slow SQL0
Jun 6REDevents0peak470 msp95193.7 msp992245.39 ms5xx0.07 %probe0restarts3HPA max0RDS pressure2Redis0slow SQL0

Production Dependency Map

ComponentStatusCurrent periodSourceReferencesDisposition ID
Customer Edge4 components
Pingdom public checksREDPingdom DOWN/SLOW 1; Pingdom DOWN/SLOW 2pingdomcomponent-pingdom-public-checks
DNS resolutionGREENNo degraded signal in collected evidencedns_issue_checkcomponent-dns-resolution
Ingress / TraefikGREENNo degraded signal in collected evidenceservice_5xx_rate_pct, service_5xx_rps, service_p95_topcomponent-ingress-traefik
Public 5xx and latencyGREENNo degraded signal in collected evidenceservice_5xx_rate_pct, latency_triage, latency_signalcomponent-public-5xx-and-latency
Application Runtime5 components
PHP-FPMAMBERprobe failures 14; probe failures 3web_probe_failures_5m, web_restarts_5m, latency_triagecomponent-php-fpm
API/app podsGREENNo degraded signal in collected evidenceweb_ready_pods, web_running_pods, web_unavailable_replicascomponent-api-app-pods
Worker podsAMBERcontainer restarts 25; container restarts 3; container restarts 35; container restarts 4top_restarts_24h, top_memory_containers_24hcomponent-worker-pods
Runtime memory and restartsAMBERcontainer restarts 25; container restarts 3; container restarts 35; container restarts 4top_memory_containers_24h, top_restarts_24hcomponent-runtime-memory-and-restarts
Slow request familiesGREENNo degraded signal in collected evidencelatency_triage, slow_tracescomponent-slow-request-families
Data Layer5 components
MySQL catalogAMBERmysql-catalog swap observedrds:mysql-catalog, slowquery:mysql-catalogcomponent-mysql-catalog
MySQL catalog2AMBERmysql-catalog2 swap observedrds:mysql-catalog2, slowquery:mysql-catalog2component-mysql-catalog2
MySQL masterGREENNo degraded signal in collected evidencerds:mysql-master, slowquery:mysql-mastercomponent-mysql-master
Postgres billingAMBERpostgres-billing swap observedrds:postgres-billing, slowquery:postgres-billingcomponent-postgres-billing
Slow queriesGREENNo degraded signal in collected evidenceslowquerycomponent-slow-queries
Cache & Messaging3 components
Redis / SentinelAMBERRedis/Sentinel connection evidenceredis_issue_check, redis_evicted_keys_5m, redis_rejected_connections_5mcomponent-redis-sentinel
RabbitMQ / queuesGREENNo degraded signal in collected evidencerabbitmq, queue_depth, consumer_lagcomponent-rabbitmq-queues
Consumer lag or delayed processingGREENNo degraded signal in collected evidenceconsumer_lag, cron_activecomponent-consumer-lag-or-delayed-processing
Batch & Scheduled Work3 components
Watched CronJobsGREENNo degraded signal in collected evidencecron_activecomponent-watched-cronjobs
Kubernetes JobsGREENNo degraded signal in collected evidenceKubeJobFailed, job_failedcomponent-kubernetes-jobs
Long-running scheduled tasksGREENNo degraded signal in collected evidencecron_active, slow_tracescomponent-long-running-scheduled-tasks
Kubernetes & Capacity5 components
ReadinessGREENNo degraded signal in collected evidenceweb_ready_pods, web_unavailable_replicascomponent-readiness
Probe failuresAMBERprobe failures 14; probe failures 3web_probe_failures_5mcomponent-probe-failures
HPA saturationAMBERHPA max hit: docgen2-api; HPA max hit: docgen2-api, notifications-event-manager; HPA max hit: notifications-event-manager; HPA max hit: subscriptions-apihpa_current, hpa_maxcomponent-hpa-saturation
Pod/node churnAMBERcontainer restarts 25; container restarts 3; container restarts 35; container restarts 4top_restarts_24hcomponent-pod-node-churn
Top memory containersAMBERcontainer restarts 25; container restarts 3; container restarts 35; container restarts 4top_memory_containers_24hcomponent-top-memory-containers
Observability & Alerting5 components
Grafana / PrometheusREDAPI dashboard p95 needs route follow-up; API endpoint p95 crossed actionable latency policy; Platform p95 exceeded 1s in active hours; Platform p95 exceeded the 250ms target in active hoursprometheuscomponent-grafana-prometheus
LokiREDAPI dashboard p95 needs route follow-up; API endpoint p95 crossed actionable latency policy; Platform p95 exceeded 1s in active hours; Platform p95 exceeded the 250ms target in active hoursloki, latency_triagecomponent-loki
TempoGREENNo degraded signal in collected evidenceslow_tracescomponent-tempo
Alert rule health/noiseGREENNo degraded signal in collected evidenceslack_alerts, alert_tablecomponent-alert-rule-health-noise
AWS alarm emailsGREENNo degraded signal in collected evidenceaws_email_alerts, email_tablecomponent-aws-alarm-emails
External Dependencies5 components
Email providerGREENNo degraded signal in collected evidenceemail_providercomponent-email-provider
SMS providerGREENNo degraded signal in collected evidencesms_providercomponent-sms-provider
Payment providerGREENNo degraded signal in collected evidencepayment_providercomponent-payment-provider
Identity/login providerGREENNo degraded signal in collected evidenceidentity_providercomponent-identity-login-provider
Webhooks / third-party APIsGREENNo degraded signal in collected evidencethird_party_apicomponent-webhooks-third-party-apis

Previous period

MetricCurrentPreviousDelta
active_aws_alarms000
application_alerts0213-213
customer_incidents_confirmed000
customer_incidents_observed014-14
customer_visible_failures014-14
impacted_services014-14
infrastructure_alerts000
pingdom_downtime_minutes017-17
pingdom_events014-14

ADS Action Queue

Missing dispositions: component-pingdom-public-checks, component-php-fpm, component-worker-pods, component-runtime-memory-and-restarts, component-mysql-catalog, component-mysql-catalog2, component-postgres-billing, component-redis-sentinel, component-probe-failures, component-hpa-saturation, component-pod-node-churn, component-top-memory-containers, component-grafana-prometheus, component-loki

StatusActionDomainID
REDPingdom public checks needs dispositionCustomer Edgecomponent-pingdom-public-checks
AMBERPHP-FPM needs dispositionApplication Runtimecomponent-php-fpm
AMBERWorker pods needs dispositionApplication Runtimecomponent-worker-pods
AMBERRuntime memory and restarts needs dispositionApplication Runtimecomponent-runtime-memory-and-restarts
AMBERMySQL catalog needs dispositionData Layercomponent-mysql-catalog
AMBERMySQL catalog2 needs dispositionData Layercomponent-mysql-catalog2
AMBERPostgres billing needs dispositionData Layercomponent-postgres-billing
AMBERRedis / Sentinel needs dispositionCache & Messagingcomponent-redis-sentinel
AMBERProbe failures needs dispositionKubernetes & Capacitycomponent-probe-failures
AMBERHPA saturation needs dispositionKubernetes & Capacitycomponent-hpa-saturation
AMBERPod/node churn needs dispositionKubernetes & Capacitycomponent-pod-node-churn
AMBERTop memory containers needs dispositionKubernetes & Capacitycomponent-top-memory-containers
REDGrafana / Prometheus needs dispositionObservability & Alertingcomponent-grafana-prometheus
REDLoki needs dispositionObservability & Alertingcomponent-loki

Source Coverage

SourceStatusDetail
Daily health JSONok7 daily artifact(s) found
Production reliability dashboardmissingWeekly alerts/reliability artifact
Team weekly reportwarningDelivery/deploy evidence
Engineering council test reportwarningTest/smoke evidence
AWS posture evidencewarningaws CLI not found; provide --aws-evidence-json for AWS posture.
Action registerwarningADS/accepted-risk/false-positive dispositions

Evidence References

Reliability

application alerts
0
infrastructure alerts
0
active aws alarms
0
degraded days
7
red components
3
amber components
11

Customer Impact

customer visible failures
0
customer incidents observed
0
customer incidents confirmed
0
pingdom events
0
pingdom downtime minutes
0

Delivery Health

production bugs closed
0
delivery items
0
deployments
0
test runs
0
test pass rate
0
smoke attempts
0
smoke failed
0

Cost

total
0
currency
USD
forecast
0

Security

security hub score
0
critical findings
0
high findings
0
guardduty findings
0
inspector critical
0
iam external access
0

Aws Recommendations

trusted advisor red
0
trusted advisor yellow
0
compute optimizer savings
0
cost optimization savings
0
well architected high risk issues
0
well architected medium risk issues
0

Backup

failed jobs
0
protected resources
0