Production Ops

Production Operations Weekly Review

Jun 13 to Jun 20, 2026 / cadence: weekly
Overall statusRED
Degraded days3
RED components4
AMBER components10
Actions needing disposition14

Daily Production Health

Latest daily health issues
  • API endpoint p95 crossed actionable latency policy
  • Web probes failed during the day
  • Redis pod churn/failover happened during the report day

Daily Findings

Jun 14: 414Jun 15: 1715Jun 16: 1216max 17
Daily health findings by day

Reliability Movement

Customer visible failures0
0 current / 0 previous
Observed incidents0
0 current / 0 previous
Application alerts-55
0 current / 55 previous
Infrastructure alerts-7
0 current / 7 previous
Pingdom events0
0 current / 0 previous

Latency Trend

p95 ms max 5000p99 ms max 5000

Component Heatmap

Jun 14
Jun 15
Jun 16
Customer Edge
Application Runtime
Data Layer
Cache & Messaging
Batch & Scheduled Work
Kubernetes & Capacity
Observability & Alerting
External Dependencies

Daily Health Spine

Signals pulled from the daily production health PDF/JSON: customer checks, latency/errors, PHP-FPM probes, Kubernetes pressure, DB/cache pressure, and direct source links.

DayPingdomLatency / errorsPHP-FPM / KubernetesDB / cacheReferences
Jun 14REDevents0peak0 msp95145.78 msp99377.46 ms5xx0.04 %probe0restarts8HPA max1RDS pressure2Redis0slow SQL0
Jun 15REDevents0peak0 msp955000 msp995000 ms5xx2.03 %probe15633restarts49HPA max2RDS pressure2Redis6340slow SQL0
Jun 16REDevents0peak0 msp95768.31 msp995000 ms5xx14.35 %probe9restarts532HPA max2RDS pressure2Redis3170slow SQL0

Production Dependency Map

ComponentStatusCurrent periodSourceReferencesDisposition ID
Customer Edge4 components
Pingdom public checksGREENNo degraded signal in collected evidencepingdomcomponent-pingdom-public-checks
DNS resolutionGREENNo degraded signal in collected evidencedns_issue_checkcomponent-dns-resolution
Ingress / TraefikGREENNo degraded signal in collected evidenceservice_5xx_rate_pct, service_5xx_rps, service_p95_topcomponent-ingress-traefik
Public 5xx and latencyAMBERroute latency candidates 8service_5xx_rate_pct, latency_triage, latency_signalcomponent-public-5xx-and-latency
Application Runtime5 components
PHP-FPMAMBERprobe failures 15633; probe failures 8web_probe_failures_5m, web_restarts_5m, latency_triagecomponent-php-fpm
API/app podsGREENNo degraded signal in collected evidenceweb_ready_pods, web_running_pods, web_unavailable_replicascomponent-api-app-pods
Worker podsAMBERcontainer restarts 49; container restarts 532; container restarts 8top_restarts_24h, top_memory_containers_24hcomponent-worker-pods
Runtime memory and restartsAMBERcontainer restarts 49; container restarts 532; container restarts 8top_memory_containers_24h, top_restarts_24hcomponent-runtime-memory-and-restarts
Slow request familiesAMBERroute latency candidates 8latency_triage, slow_tracescomponent-slow-request-families
Data Layer5 components
MySQL catalogREDmysql-catalog free memory below 1GiB; mysql-catalog free memory below 512MiB; mysql-catalog swap observedrds:mysql-catalog, slowquery:mysql-catalogcomponent-mysql-catalog
MySQL catalog2REDmysql-catalog2 free memory below 512MiB; mysql-catalog2 swap observedrds:mysql-catalog2, slowquery:mysql-catalog2component-mysql-catalog2
MySQL masterGREENNo degraded signal in collected evidencerds:mysql-master, slowquery:mysql-mastercomponent-mysql-master
Postgres billingGREENNo degraded signal in collected evidencerds:postgres-billing, slowquery:postgres-billingcomponent-postgres-billing
Slow queriesGREENNo degraded signal in collected evidenceslowquerycomponent-slow-queries
Cache & Messaging3 components
Redis / SentinelAMBERRedis/Sentinel connection evidenceredis_issue_check, redis_evicted_keys_5m, redis_rejected_connections_5mcomponent-redis-sentinel
RabbitMQ / queuesGREENNo degraded signal in collected evidencerabbitmq, queue_depth, consumer_lagcomponent-rabbitmq-queues
Consumer lag or delayed processingGREENNo degraded signal in collected evidenceconsumer_lag, cron_activecomponent-consumer-lag-or-delayed-processing
Batch & Scheduled Work3 components
Watched CronJobsGREENNo degraded signal in collected evidencecron_activecomponent-watched-cronjobs
Kubernetes JobsGREENNo degraded signal in collected evidenceKubeJobFailed, job_failedcomponent-kubernetes-jobs
Long-running scheduled tasksGREENNo degraded signal in collected evidencecron_active, slow_tracescomponent-long-running-scheduled-tasks
Kubernetes & Capacity5 components
ReadinessGREENNo degraded signal in collected evidenceweb_ready_pods, web_unavailable_replicascomponent-readiness
Probe failuresAMBERprobe failures 15633; probe failures 8web_probe_failures_5mcomponent-probe-failures
HPA saturationAMBERHPA max hit: docgen2-api; HPA max hit: docgen2-api, subscriptions-apihpa_current, hpa_maxcomponent-hpa-saturation
Pod/node churnAMBERcontainer restarts 49; container restarts 532; container restarts 8top_restarts_24hcomponent-pod-node-churn
Top memory containersAMBERcontainer restarts 49; container restarts 532; container restarts 8top_memory_containers_24hcomponent-top-memory-containers
Observability & Alerting5 components
Grafana / PrometheusREDAPI endpoint p95 crossed actionable latency policy; Platform p95/p99 crossed actionable latency policy; Web probes failed during the day; mysql-catalog2 free storage was critically lowprometheuscomponent-grafana-prometheus
LokiREDAPI endpoint p95 crossed actionable latency policy; Platform p95/p99 crossed actionable latency policy; mysql-catalog2 free storage was critically low; postgres-billing free storage was lowloki, latency_triagecomponent-loki
TempoGREENNo degraded signal in collected evidenceslow_tracescomponent-tempo
Alert rule health/noiseGREENNo degraded signal in collected evidenceslack_alerts, alert_tablecomponent-alert-rule-health-noise
AWS alarm emailsGREENNo degraded signal in collected evidenceaws_email_alerts, email_tablecomponent-aws-alarm-emails
External Dependencies5 components
Email providerGREENNo degraded signal in collected evidenceemail_providercomponent-email-provider
SMS providerGREENNo degraded signal in collected evidencesms_providercomponent-sms-provider
Payment providerGREENNo degraded signal in collected evidencepayment_providercomponent-payment-provider
Identity/login providerGREENNo degraded signal in collected evidenceidentity_providercomponent-identity-login-provider
Webhooks / third-party APIsGREENNo degraded signal in collected evidencethird_party_apicomponent-webhooks-third-party-apis

Previous period

MetricCurrentPreviousDelta
active_aws_alarms01-1
application_alerts055-55
customer_incidents_confirmed000
customer_incidents_observed000
customer_visible_failures000
impacted_services013-13
infrastructure_alerts07-7
pingdom_downtime_minutes000
pingdom_events000

ADS Action Queue

Missing dispositions: component-public-5xx-and-latency, component-php-fpm, component-worker-pods, component-runtime-memory-and-restarts, component-slow-request-families, component-mysql-catalog, component-mysql-catalog2, component-redis-sentinel, component-probe-failures, component-hpa-saturation, component-pod-node-churn, component-top-memory-containers, component-grafana-prometheus, component-loki

StatusActionDomainID
AMBERPublic 5xx and latency needs dispositionCustomer Edgecomponent-public-5xx-and-latency
AMBERPHP-FPM needs dispositionApplication Runtimecomponent-php-fpm
AMBERWorker pods needs dispositionApplication Runtimecomponent-worker-pods
AMBERRuntime memory and restarts needs dispositionApplication Runtimecomponent-runtime-memory-and-restarts
AMBERSlow request families needs dispositionApplication Runtimecomponent-slow-request-families
REDMySQL catalog needs dispositionData Layercomponent-mysql-catalog
REDMySQL catalog2 needs dispositionData Layercomponent-mysql-catalog2
AMBERRedis / Sentinel needs dispositionCache & Messagingcomponent-redis-sentinel
AMBERProbe failures needs dispositionKubernetes & Capacitycomponent-probe-failures
AMBERHPA saturation needs dispositionKubernetes & Capacitycomponent-hpa-saturation
AMBERPod/node churn needs dispositionKubernetes & Capacitycomponent-pod-node-churn
AMBERTop memory containers needs dispositionKubernetes & Capacitycomponent-top-memory-containers
REDGrafana / Prometheus needs dispositionObservability & Alertingcomponent-grafana-prometheus
REDLoki needs dispositionObservability & Alertingcomponent-loki

Source Coverage

SourceStatusDetail
Daily health JSONok3 daily artifact(s) found
Production reliability dashboardmissingWeekly alerts/reliability artifact
Team weekly reportwarningDelivery/deploy evidence
Engineering council test reportwarningTest/smoke evidence
AWS posture evidencewarningaws CLI not found; provide --aws-evidence-json for AWS posture.
Action registerwarningADS/accepted-risk/false-positive dispositions

Evidence References

Reliability

application alerts
0
infrastructure alerts
0
active aws alarms
0
degraded days
3
red components
4
amber components
10

Customer Impact

customer visible failures
0
customer incidents observed
0
customer incidents confirmed
0
pingdom events
0
pingdom downtime minutes
0

Delivery Health

production bugs closed
0
delivery items
0
deployments
0
test runs
0
test pass rate
0
smoke attempts
0
smoke failed
0

Cost

total
0
currency
USD
forecast
0

Security

security hub score
0
critical findings
0
high findings
0
guardduty findings
0
inspector critical
0
iam external access
0

Aws Recommendations

trusted advisor red
0
trusted advisor yellow
0
compute optimizer savings
0
cost optimization savings
0
well architected high risk issues
0
well architected medium risk issues
0

Backup

failed jobs
0
protected resources
0