Production Ops

Production Operations Weekly Review

May 16 to May 24, 2026 / cadence: weekly
Overall statusRED
Degraded days5
RED components2
AMBER components8
Actions needing disposition10

Daily Production Health

Latest daily health issues
  • Platform p95 exceeded the 250ms target overnight
  • Platform p99 hit extreme active-hour tail latency
  • docgen2-api HPA hit max replicas

Daily Findings

May 20: 1020May 21: 921May 22: 622May 23: 223May 24: 324max 10
Daily health findings by day

Reliability Movement

Customer visible failures-1
14 current / 15 previous
Observed incidents-1
14 current / 15 previous
Application alerts+128
213 current / 85 previous
Infrastructure alerts0
0 current / 0 previous
Pingdom events-1
14 current / 15 previous

Latency Trend

p95 ms max 12920p99 ms max 20000

Component Heatmap

May 20
May 21
May 22
May 23
May 24
Customer Edge
Application Runtime
Data Layer
Cache & Messaging
Batch & Scheduled Work
Kubernetes & Capacity
Observability & Alerting
External Dependencies

Daily Health Spine

Signals pulled from the daily production health PDF/JSON: customer checks, latency/errors, PHP-FPM probes, Kubernetes pressure, DB/cache pressure, and direct source links.

DayPingdomLatency / errorsPHP-FPM / KubernetesDB / cacheReferences
May 20REDevents0peak0 msp952021 msp997240 ms5xx6.1 %probe1683restarts37HPA max2RDS pressure2Redis0slow SQL0
May 21REDevents0peak415 msp9512920 msp9920000 ms5xx1.15 %probe24restarts35HPA max3RDS pressure1Redis0slow SQL0
May 22REDevents0peak414 msp951150 msp9920000 ms5xx0.02 %probe0restarts4HPA max3RDS pressure1Redis0slow SQL0
May 23REDevents0peak421 msp95237.73 msp9920000 ms5xx0.03 %probe0restarts2HPA max1RDS pressure1Redis0slow SQL0
May 24REDevents0peak427 msp95800 msp9917080 ms5xx0.03 %probe0restarts3HPA max1RDS pressure2Redis0slow SQL0

Production Dependency Map

ComponentStatusCurrent periodSourceReferencesDisposition ID
Customer Edge4 components
Pingdom public checksGREENNo degraded signal in collected evidencepingdomcomponent-pingdom-public-checks
DNS resolutionGREENNo degraded signal in collected evidencedns_issue_checkcomponent-dns-resolution
Ingress / TraefikGREENNo degraded signal in collected evidenceservice_5xx_rate_pct, service_5xx_rps, service_p95_topcomponent-ingress-traefik
Public 5xx and latencyGREENNo degraded signal in collected evidenceservice_5xx_rate_pct, latency_triage, latency_signalcomponent-public-5xx-and-latency
Application Runtime5 components
PHP-FPMAMBERprobe failures 1683; probe failures 23web_probe_failures_5m, web_restarts_5m, latency_triagecomponent-php-fpm
API/app podsGREENNo degraded signal in collected evidenceweb_ready_pods, web_running_pods, web_unavailable_replicascomponent-api-app-pods
Worker podsAMBERcontainer restarts 2; container restarts 3; container restarts 34; container restarts 36top_restarts_24h, top_memory_containers_24hcomponent-worker-pods
Runtime memory and restartsAMBERcontainer restarts 2; container restarts 3; container restarts 34; container restarts 36top_memory_containers_24h, top_restarts_24hcomponent-runtime-memory-and-restarts
Slow request familiesGREENNo degraded signal in collected evidencelatency_triage, slow_tracescomponent-slow-request-families
Data Layer5 components
MySQL catalogREDmysql-catalog free memory below 512MiB; mysql-catalog swap observedrds:mysql-catalog, slowquery:mysql-catalogcomponent-mysql-catalog
MySQL catalog2REDmysql-catalog2 free memory below 512MiB; mysql-catalog2 swap observedrds:mysql-catalog2, slowquery:mysql-catalog2component-mysql-catalog2
MySQL masterGREENNo degraded signal in collected evidencerds:mysql-master, slowquery:mysql-mastercomponent-mysql-master
Postgres billingAMBERpostgres-billing swap observedrds:postgres-billing, slowquery:postgres-billingcomponent-postgres-billing
Slow queriesGREENNo degraded signal in collected evidenceslowquerycomponent-slow-queries
Cache & Messaging3 components
Redis / SentinelGREENNo degraded signal in collected evidenceredis_issue_check, redis_evicted_keys_5m, redis_rejected_connections_5mcomponent-redis-sentinel
RabbitMQ / queuesGREENNo degraded signal in collected evidencerabbitmq, queue_depth, consumer_lagcomponent-rabbitmq-queues
Consumer lag or delayed processingGREENNo degraded signal in collected evidenceconsumer_lag, cron_activecomponent-consumer-lag-or-delayed-processing
Batch & Scheduled Work3 components
Watched CronJobsGREENNo degraded signal in collected evidencecron_activecomponent-watched-cronjobs
Kubernetes JobsGREENNo degraded signal in collected evidenceKubeJobFailed, job_failedcomponent-kubernetes-jobs
Long-running scheduled tasksGREENNo degraded signal in collected evidencecron_active, slow_tracescomponent-long-running-scheduled-tasks
Kubernetes & Capacity5 components
ReadinessGREENNo degraded signal in collected evidenceweb_ready_pods, web_unavailable_replicascomponent-readiness
Probe failuresAMBERprobe failures 1683; probe failures 23web_probe_failures_5mcomponent-probe-failures
HPA saturationAMBERHPA max hit: docgen2-api; HPA max hit: docgen2-api, notifications-event-manager; HPA max hit: docgen2-api, notifications-event-manager, web; HPA max hit: docgen2-api, subscriptions-api, webhpa_current, hpa_maxcomponent-hpa-saturation
Pod/node churnAMBERcontainer restarts 2; container restarts 3; container restarts 34; container restarts 36top_restarts_24hcomponent-pod-node-churn
Top memory containersAMBERcontainer restarts 2; container restarts 3; container restarts 34; container restarts 36top_memory_containers_24hcomponent-top-memory-containers
Observability & Alerting5 components
Grafana / PrometheusGREENNo degraded signal in collected evidenceprometheuscomponent-grafana-prometheus
LokiGREENNo degraded signal in collected evidenceloki, latency_triagecomponent-loki
TempoGREENNo degraded signal in collected evidenceslow_tracescomponent-tempo
Alert rule health/noiseGREENNo degraded signal in collected evidenceslack_alerts, alert_tablecomponent-alert-rule-health-noise
AWS alarm emailsGREENNo degraded signal in collected evidenceaws_email_alerts, email_tablecomponent-aws-alarm-emails
External Dependencies5 components
Email providerGREENNo degraded signal in collected evidenceemail_providercomponent-email-provider
SMS providerGREENNo degraded signal in collected evidencesms_providercomponent-sms-provider
Payment providerGREENNo degraded signal in collected evidencepayment_providercomponent-payment-provider
Identity/login providerGREENNo degraded signal in collected evidenceidentity_providercomponent-identity-login-provider
Webhooks / third-party APIsGREENNo degraded signal in collected evidencethird_party_apicomponent-webhooks-third-party-apis

Previous period

MetricCurrentPreviousDelta
active_aws_alarms000
application_alerts21385+128
customer_incidents_confirmed000
customer_incidents_observed1415-1
customer_visible_failures1415-1
impacted_services1410+4
infrastructure_alerts000
pingdom_downtime_minutes1715+2
pingdom_events1415-1

ADS Action Queue

Missing dispositions: component-php-fpm, component-worker-pods, component-runtime-memory-and-restarts, component-mysql-catalog, component-mysql-catalog2, component-postgres-billing, component-probe-failures, component-hpa-saturation, component-pod-node-churn, component-top-memory-containers

StatusActionDomainID
AMBERPHP-FPM needs dispositionApplication Runtimecomponent-php-fpm
AMBERWorker pods needs dispositionApplication Runtimecomponent-worker-pods
AMBERRuntime memory and restarts needs dispositionApplication Runtimecomponent-runtime-memory-and-restarts
REDMySQL catalog needs dispositionData Layercomponent-mysql-catalog
REDMySQL catalog2 needs dispositionData Layercomponent-mysql-catalog2
AMBERPostgres billing needs dispositionData Layercomponent-postgres-billing
AMBERProbe failures needs dispositionKubernetes & Capacitycomponent-probe-failures
AMBERHPA saturation needs dispositionKubernetes & Capacitycomponent-hpa-saturation
AMBERPod/node churn needs dispositionKubernetes & Capacitycomponent-pod-node-churn
AMBERTop memory containers needs dispositionKubernetes & Capacitycomponent-top-memory-containers

Source Coverage

SourceStatusDetail
Daily health JSONok5 daily artifact(s) found
Production reliability dashboardokWeekly alerts/reliability artifact
Team weekly reportokDelivery/deploy evidence
Engineering council test reportokTest/smoke evidence
AWS posture evidencewarningCost/security/recommendation evidence
Action registerwarningADS/accepted-risk/false-positive dispositions

Evidence References

Reliability

application alerts
213
infrastructure alerts
0
active aws alarms
0
degraded days
5
red components
2
amber components
8

Customer Impact

customer visible failures
14
customer incidents observed
14
customer incidents confirmed
0
pingdom events
14
pingdom downtime minutes
17

Delivery Health

production bugs closed
41
delivery items
31
deployments
9
test runs
9
test pass rate
22.22
smoke attempts
16
smoke failed
2

Cost

total
0
currency
USD
forecast
0

Security

security hub score
0
critical findings
0
high findings
0
guardduty findings
0
inspector critical
0
iam external access
0

Aws Recommendations

trusted advisor red
0
trusted advisor yellow
0
compute optimizer savings
0
cost optimization savings
0
well architected high risk issues
0
well architected medium risk issues
0

Backup

failed jobs
0
protected resources
0