Production Ops

Production Operations Weekly Review

May 16 to May 24, 2026 / cadence: weekly

Overall statusRED

Degraded days5

RED components2

AMBER components8

Actions needing disposition10

Daily Production Health

May 2010 May 219 May 226 May 232 May 243

Latest daily health issues

Platform p95 exceeded the 250ms target overnight
Platform p99 hit extreme active-hour tail latency
docgen2-api HPA hit max replicas

Daily Findings

Daily health findings by day

Reliability Movement

Customer visible failures-1

14 current / 15 previous

Observed incidents-1

14 current / 15 previous

Application alerts+128

213 current / 85 previous

Infrastructure alerts0

0 current / 0 previous

Pingdom events-1

14 current / 15 previous

Latency Trend

p95 ms max 12920p99 ms max 20000

Component Heatmap

May 20

May 21

May 22

May 23

May 24

Customer Edge

Application Runtime

Data Layer

Cache & Messaging

Batch & Scheduled Work

Kubernetes & Capacity

Observability & Alerting

External Dependencies

Daily Health Spine

Signals pulled from the daily production health PDF/JSON: customer checks, latency/errors, PHP-FPM probes, Kubernetes pressure, DB/cache pressure, and direct source links.

Day	Pingdom	Latency / errors	PHP-FPM / Kubernetes	DB / cache	References
May 20RED	events0peak0 ms	p952021 msp997240 ms5xx6.1 %	probe1683restarts37HPA max2	RDS pressure2Redis0slow SQL0	Daily PDF Daily JSON Traefik latency dashboard DB/readiness validation dashboard
May 21RED	events0peak415 ms	p9512920 msp9920000 ms5xx1.15 %	probe24restarts35HPA max3	RDS pressure1Redis0slow SQL0	Daily PDF Daily JSON Traefik latency dashboard DB/readiness validation dashboard
May 22RED	events0peak414 ms	p951150 msp9920000 ms5xx0.02 %	probe0restarts4HPA max3	RDS pressure1Redis0slow SQL0	Daily PDF Daily JSON Traefik latency dashboard DB/readiness validation dashboard
May 23RED	events0peak421 ms	p95237.73 msp9920000 ms5xx0.03 %	probe0restarts2HPA max1	RDS pressure1Redis0slow SQL0	Daily PDF Daily JSON Traefik latency dashboard DB/readiness validation dashboard
May 24RED	events0peak427 ms	p95800 msp9917080 ms5xx0.03 %	probe0restarts3HPA max1	RDS pressure2Redis0slow SQL0	Daily PDF Daily JSON Traefik latency dashboard Traefik endpoint observability

Production Dependency Map

Component	Status	Current period	Source	References	Disposition ID
Customer Edge4 components
Pingdom public checks	GREEN	No degraded signal in collected evidence	pingdom	Pingdom export JSON Pingdom target: Adservio Ro Pingdom target: https://www.adservio.ro/api/v2/status	component-pingdom-public-checks
DNS resolution	GREEN	No degraded signal in collected evidence	dns_issue_check	Production reliability dashboard Production reliability JSON Team weekly report Team weekly JSON	component-dns-resolution
Ingress / Traefik	GREEN	No degraded signal in collected evidence	service_5xx_rate_pct, service_5xx_rps, service_p95_top	Traefik latency dashboard API family dashboards Traefik all traffic observability Traefik endpoint observability	component-ingress-traefik
Public 5xx and latency	GREEN	No degraded signal in collected evidence	service_5xx_rate_pct, latency_triage, latency_signal	Traefik latency dashboard API family dashboards Traefik all traffic observability Traefik endpoint observability	component-public-5xx-and-latency
Application Runtime5 components
PHP-FPM	AMBER	probe failures 1683; probe failures 23	web_probe_failures_5m, web_restarts_5m, latency_triage	Traefik latency dashboard API family dashboards Traefik all traffic observability Traefik endpoint observability	component-php-fpm
API/app pods	GREEN	No degraded signal in collected evidence	web_ready_pods, web_running_pods, web_unavailable_replicas	DB/readiness validation dashboard Grafana explore Traefik latency dashboard API family dashboards	component-api-app-pods
Worker pods	AMBER	container restarts 2; container restarts 3; container restarts 34; container restarts 36	top_restarts_24h, top_memory_containers_24h	DB/readiness validation dashboard Grafana explore Traefik latency dashboard API family dashboards	component-worker-pods
Runtime memory and restarts	AMBER	container restarts 2; container restarts 3; container restarts 34; container restarts 36	top_memory_containers_24h, top_restarts_24h	DB/readiness validation dashboard Grafana explore Traefik latency dashboard API family dashboards	component-runtime-memory-and-restarts
Slow request families	GREEN	No degraded signal in collected evidence	latency_triage, slow_traces	Traefik latency dashboard API family dashboards Traefik all traffic observability Traefik endpoint observability	component-slow-request-families
Data Layer5 components
MySQL catalog	RED	mysql-catalog free memory below 512MiB; mysql-catalog swap observed	rds:mysql-catalog, slowquery:mysql-catalog	DB/readiness validation dashboard	component-mysql-catalog
MySQL catalog2	RED	mysql-catalog2 free memory below 512MiB; mysql-catalog2 swap observed	rds:mysql-catalog2, slowquery:mysql-catalog2	DB/readiness validation dashboard	component-mysql-catalog2
MySQL master	GREEN	No degraded signal in collected evidence	rds:mysql-master, slowquery:mysql-master	DB/readiness validation dashboard	component-mysql-master
Postgres billing	AMBER	postgres-billing swap observed	rds:postgres-billing, slowquery:postgres-billing	DB/readiness validation dashboard	component-postgres-billing
Slow queries	GREEN	No degraded signal in collected evidence	slowquery	DB/readiness validation dashboard	component-slow-queries
Cache & Messaging3 components
Redis / Sentinel	GREEN	No degraded signal in collected evidence	redis_issue_check, redis_evicted_keys_5m, redis_rejected_connections_5m	Production reliability dashboard Production reliability JSON Team weekly report Team weekly JSON	component-redis-sentinel
RabbitMQ / queues	GREEN	No degraded signal in collected evidence	rabbitmq, queue_depth, consumer_lag	Production reliability dashboard Production reliability JSON Team weekly report Team weekly JSON	component-rabbitmq-queues
Consumer lag or delayed processing	GREEN	No degraded signal in collected evidence	consumer_lag, cron_active	Production reliability dashboard Production reliability JSON Team weekly report Team weekly JSON	component-consumer-lag-or-delayed-processing
Batch & Scheduled Work3 components
Watched CronJobs	GREEN	No degraded signal in collected evidence	cron_active	Production reliability dashboard Production reliability JSON Team weekly report Team weekly JSON	component-watched-cronjobs
Kubernetes Jobs	GREEN	No degraded signal in collected evidence	KubeJobFailed, job_failed	Production reliability dashboard Production reliability JSON Team weekly report Team weekly JSON	component-kubernetes-jobs
Long-running scheduled tasks	GREEN	No degraded signal in collected evidence	cron_active, slow_traces	Production reliability dashboard Production reliability JSON Team weekly report Team weekly JSON	component-long-running-scheduled-tasks
Kubernetes & Capacity5 components
Readiness	GREEN	No degraded signal in collected evidence	web_ready_pods, web_unavailable_replicas	DB/readiness validation dashboard Grafana explore Traefik latency dashboard API family dashboards	component-readiness
Probe failures	AMBER	probe failures 1683; probe failures 23	web_probe_failures_5m	DB/readiness validation dashboard Grafana explore Traefik latency dashboard API family dashboards	component-probe-failures
HPA saturation	AMBER	HPA max hit: docgen2-api; HPA max hit: docgen2-api, notifications-event-manager; HPA max hit: docgen2-api, notifications-event-manager, web; HPA max hit: docgen2-api, subscriptions-api, web	hpa_current, hpa_max	DB/readiness validation dashboard Grafana explore Traefik latency dashboard API family dashboards	component-hpa-saturation
Pod/node churn	AMBER	container restarts 2; container restarts 3; container restarts 34; container restarts 36	top_restarts_24h	DB/readiness validation dashboard Grafana explore Traefik latency dashboard API family dashboards	component-pod-node-churn
Top memory containers	AMBER	container restarts 2; container restarts 3; container restarts 34; container restarts 36	top_memory_containers_24h	DB/readiness validation dashboard Grafana explore Traefik latency dashboard API family dashboards	component-top-memory-containers
Observability & Alerting5 components
Grafana / Prometheus	GREEN	No degraded signal in collected evidence	prometheus	DB/readiness validation dashboard Grafana explore Traefik latency dashboard API family dashboards	component-grafana-prometheus
Loki	GREEN	No degraded signal in collected evidence	loki, latency_triage	Traefik latency dashboard API family dashboards Traefik all traffic observability Traefik endpoint observability	component-loki
Tempo	GREEN	No degraded signal in collected evidence	slow_traces	Production reliability dashboard Production reliability JSON Team weekly report Team weekly JSON	component-tempo
Alert rule health/noise	GREEN	No degraded signal in collected evidence	slack_alerts, alert_table	Slack alert: TraefikServiceHighErrorRate Slack alert: TraefikServiceHighLatency Slack alert: KubeJobFailed Slack alert: NodeHighNumberConntrackEntriesUsed	component-alert-rule-health-noise
AWS alarm emails	GREEN	No degraded signal in collected evidence	aws_email_alerts, email_table	Production reliability dashboard Production reliability JSON Team weekly report Team weekly JSON	component-aws-alarm-emails
External Dependencies5 components
Email provider	GREEN	No degraded signal in collected evidence	email_provider	Production reliability dashboard Production reliability JSON Team weekly report Team weekly JSON	component-email-provider
SMS provider	GREEN	No degraded signal in collected evidence	sms_provider	Production reliability dashboard Production reliability JSON Team weekly report Team weekly JSON	component-sms-provider
Payment provider	GREEN	No degraded signal in collected evidence	payment_provider	Production reliability dashboard Production reliability JSON Team weekly report Team weekly JSON	component-payment-provider
Identity/login provider	GREEN	No degraded signal in collected evidence	identity_provider	Production reliability dashboard Production reliability JSON Team weekly report Team weekly JSON	component-identity-login-provider
Webhooks / third-party APIs	GREEN	No degraded signal in collected evidence	third_party_api	Production reliability dashboard Production reliability JSON Team weekly report Team weekly JSON	component-webhooks-third-party-apis

Previous period

Metric	Current	Previous	Delta
active_aws_alarms	0	0	0
application_alerts	213	85	+128
customer_incidents_confirmed	0	0	0
customer_incidents_observed	14	15	-1
customer_visible_failures	14	15	-1
impacted_services	14	10	+4
infrastructure_alerts	0	0	0
pingdom_downtime_minutes	17	15	+2
pingdom_events	14	15	-1

ADS Action Queue

Missing dispositions: component-php-fpm, component-worker-pods, component-runtime-memory-and-restarts, component-mysql-catalog, component-mysql-catalog2, component-postgres-billing, component-probe-failures, component-hpa-saturation, component-pod-node-churn, component-top-memory-containers

Status	Action	Domain	ID
AMBER	PHP-FPM needs disposition	Application Runtime	component-php-fpm
AMBER	Worker pods needs disposition	Application Runtime	component-worker-pods
AMBER	Runtime memory and restarts needs disposition	Application Runtime	component-runtime-memory-and-restarts
RED	MySQL catalog needs disposition	Data Layer	component-mysql-catalog
RED	MySQL catalog2 needs disposition	Data Layer	component-mysql-catalog2
AMBER	Postgres billing needs disposition	Data Layer	component-postgres-billing
AMBER	Probe failures needs disposition	Kubernetes & Capacity	component-probe-failures
AMBER	HPA saturation needs disposition	Kubernetes & Capacity	component-hpa-saturation
AMBER	Pod/node churn needs disposition	Kubernetes & Capacity	component-pod-node-churn
AMBER	Top memory containers needs disposition	Kubernetes & Capacity	component-top-memory-containers

Source Coverage

Source	Status	Detail
Daily health JSON	ok	5 daily artifact(s) found Open evidence
Production reliability dashboard	ok	Weekly alerts/reliability artifact Open evidence
Team weekly report	ok	Delivery/deploy evidence Open evidence
Engineering council test report	ok	Test/smoke evidence Open evidence
AWS posture evidence	warning	Cost/security/recommendation evidence
Action register	warning	ADS/accepted-risk/false-positive dispositions

Evidence References

Reliability

application alerts: 213
infrastructure alerts: 0
active aws alarms: 0
degraded days: 5
red components: 2
amber components: 8

Customer Impact

customer visible failures: 14
customer incidents observed: 14
customer incidents confirmed: 0
pingdom events: 14
pingdom downtime minutes: 17

Delivery Health

production bugs closed: 41
delivery items: 31
deployments: 9
test runs: 9
test pass rate: 22.22
smoke attempts: 16
smoke failed: 2

Cost

total: 0
currency: USD
forecast: 0

Security

security hub score: 0
critical findings: 0
high findings: 0
guardduty findings: 0
inspector critical: 0
iam external access: 0

Aws Recommendations

trusted advisor red: 0
trusted advisor yellow: 0
compute optimizer savings: 0
cost optimization savings: 0
well architected high risk issues: 0
well architected medium risk issues: 0

Backup

failed jobs: 0
protected resources: 0

Daily Production Health

Daily Findings

Reliability Movement

Latency Trend

Component Heatmap

Daily Health Spine

Production Dependency Map

Previous period

ADS Action Queue

Source Coverage

Evidence References

Source reports

Grafana

Pingdom

Slack

Reliability

Customer Impact

Delivery Health

Cost

Security

Aws Recommendations

Backup