Ops Briefing Surface

Production Reliability Dashboard

Generated 2026-05-18 15:02 for 2026-05-09 07:00 to 2026-05-16 07:00 from Pingdom checks, Slack #_alerts_prod, and AWS SNS alerts.

All sources Pingdom customer checks Slack alert families AWS alarm emails

Email-confirmed customer incidents0Pingdom down/slow events confirmed by inbox alertsObserved in Pingdom: 15

Impacted services10Mapped from Slack and Pingdom evidence

AWS alarms in ALARM0Still alarming at window end

Latest observed signal2026-05-16 05:19Most recent cross-source activity

Executive summary

What needs attention

Bottom line: Pingdom observed recent customer-facing glitches (email unconfirmed) and application-level critical paths are present.

Pingdom customer impact

External signal

No criticalActive: 1Total seen: 1

1 active item(s) in this window.

Slack impacted services

Application signal

1 criticalActive: 8Total seen: 8

1 critical, 7 non-critical active item(s).

AWS alarms

Infrastructure signal

No criticalActive: 0Total seen: 0

No AWS alarm emails were captured in this window.

No active issue listed in this category.

What to do next

NextUse Pingdom check https://www
Use Pingdom check https://www.adservio.ro/api/v2/status as churn/degradation evidence, and escalate to customer-facing incident priority only after inbox confirmation or strong cross-source correlation.
Executive recommendation
ThenTreat critical services uni-api as the primary application investigation set
Treat critical services uni-api as the primary application investigation set. Use TraefikServiceHighErrorRate as the leading category, reproduce the failing paths on uni-api, compare against the latest deploy or config change, and do not close it until both error rate and latency flatten.
Executive recommendation

Global Evidence Explorer

Report-wide charts and tables stay here, separate from the active investigation scope.

Application + Infrastructure Alerts by Day

Pingdom latency + downtime by source

Customer View

Pingdom Checks

Pingdom Check	Status	Events	Downtime	Last Seen	Likely Services	Correlated Evidence
https://www.adservio.ro/api/v2/status	Recovered recently	15	15m	2026-05-15 11:49	adservio-ro-api-v2-status	Pingdom-only evidence so far
Adservio Ro	No recent customer-visible issue	0	0m	2026-05-16 00:00	adservio-ro	Pingdom-only evidence so far

Pingdom rows show externally visible signal first. The correlated evidence column helps tie the failing check back to services, Slack alert families, or AWS alarms when those links exist.

Application View

Slack Impacted Service / Resource View

This view attributes alerts to the workload or resource named in the alert text. Grafana, Loki, and Tempo are treated as observability components and are excluded when a more specific impacted target is also present.

Impacted Service / Resource	Highest Severity	Count	Last Seen	Status	Top Alert Types	Discussion Signal	Latest Thread Note
uni-api Latest alert Latest thread	Critical	3	2026-05-15 11:17	Seen today	TraefikServiceHighErrorRate (1)TraefikServiceHighLatency (2)	Release / migration issueGeneral investigation	Rojan Shrestha no errors now pod is healthy \| \| we saw this error \| \| Service method CareerServiceImpl.saveCareer failed after Nms: \| Invalid employment t… Thread Alert
grafana Latest alert Latest thread	Warning	38	2026-05-15 15:02	Seen today	NodeHighNumberConntrackEntriesUsed (23)NodeSystemSaturation (6)KubeCPUOvercommit (5)NodeCPUHighUsage (3)AlertmanagerFailedToSendAlerts (1)	Resource limitsGeneral investigation	Rojan Shrestha This has been resolved Thread Alert
audit-worker Latest alert	Warning	19	2026-05-16 05:19	Seen today	KubeDeploymentReplicasMismatch (14)KubePodCrashLooping (5)	None	No thread note
ai-api Latest alert Latest thread	Warning	11	2026-05-15 19:19	Seen today	TraefikServiceHighLatency (11)	Release / migration issue	Thread summary · Rojan Shrestha Uneven load distribution on the nodes . some are running at 100 percent some at around 10 to 15 \| Web deployment is bin-packing onto 2 nodes because it has neither CPU limits \| \| container specs show limits for memory only (no cpu), and… Thread Alert
web-80 Latest alert Latest thread	Warning	6	2026-05-15 10:19	Seen today	TraefikServiceHighLatency (6)	Scaling config	Thread summary · Rojan Shrestha, danny, Andrei Petrescu 12 min ago, 15+ web pod probe failures fired simultaneously (readiness /ready timeouts + php-fpm-healthcheck 10s timeouts). All 6+ web pods… \| Suggesting two quick config changes on web to stop the recurring p95 latency alerts \| \| 1. Loosen readiness probe: failureThreshold: 4, per… \| understood, good point Thread Alert
social-api Latest alert Latest thread	Warning	11	2026-05-14 11:20	Recent (72h)	TraefikServiceHighLatency (11)	Scaling config	Thread summary · Rojan Shrestha, danny, Andrei Petrescu 12 min ago, 15+ web pod probe failures fired simultaneously (readiness /ready timeouts + php-fpm-healthcheck 10s timeouts). All 6+ web pods… \| Suggesting two quick config changes on web to stop the recurring p95 latency alerts \| \| 1. Loosen readiness probe: failureThreshold: 4, per… \| understood, good point Thread Alert
subscriptions-api Latest alert	Warning	1	2026-05-13 09:14	Recent (72h)	TraefikServiceHighLatency (1)	None	No thread note
docgen2-api Latest alert	Warning	2	2026-05-12 13:48	Seen this week	KubeHpaMaxedOut (2)	None	No thread note

Evidence

Slack Alert Families

Alert	Severity	Count	Last Seen	Status	Threads	Top Impacted Services	Discussion Signal	Latest Thread Note
TraefikServiceHighErrorRate Latest alert Latest thread	Critical	1	2026-05-13 11:55	Recent (72h)	1	uni-api (1)	General investigation	Rojan Shrestha no errors now pod is healthy \| \| we saw this error \| \| Service method CareerServiceImpl.saveCareer failed after Nms: \| Invalid employment t… Thread Alert
TraefikServiceHighLatency Latest alert Latest thread	Warning	25	2026-05-15 19:19	Seen today	2	social-api (11)ai-api (11)web-80 (6)uni-api (2)subscriptions-api (1)	Release / migration issueScaling config	Thread summary · Rojan Shrestha, danny, Andrei Petrescu 12 min ago, 15+ web pod probe failures fired simultaneously (readiness /ready timeouts + php-fpm-healthcheck 10s timeouts). All 6+ web pods… \| Suggesting two quick config changes on web to stop the recurring p95 latency alerts \| \| 1. Loosen readiness probe: failureThreshold: 4, per… \| understood, good point Thread Alert
NodeHighNumberConntrackEntriesUsed Latest alert Latest thread	Warning	23	2026-05-15 12:29	Seen today	1	grafana (23)	General investigation	Rojan Shrestha This has been resolved Thread Alert
KubeDeploymentReplicasMismatch Latest alert	Warning	14	2026-05-16 05:19	Seen today	0	audit-worker (14)	None	Alert
NodeSystemSaturation Latest alert	Warning	6	2026-05-15 12:33	Seen today	0	grafana (6)	None	Alert
KubePodCrashLooping Latest alert	Warning	5	2026-05-16 04:43	Seen today	0	audit-worker (5)	None	Alert
KubeCPUOvercommit Latest alert	Warning	5	2026-05-15 15:02	Seen today	0	grafana (5)	None	Alert
NodeCPUHighUsage Latest alert Latest thread	Warning	3	2026-05-13 09:27	Recent (72h)	1	grafana (3)	Resource limits	Thread summary · Rojan Shrestha Node running at 98 percent CPU checking \| \| Web pods are the most resource dominant on the cluster \| this is resolved for now , rollout restarted the web pods to replace them to the other nodes that was sitting idle to balance the load Thread Alert
KubeHpaMaxedOut Latest alert	Warning	2	2026-05-12 13:48	Seen this week	0	docgen2-api (2)	None	Alert
AlertmanagerFailedToSendAlerts Latest alert	Warning	1	2026-05-12 10:35	Seen this week	0	grafana (1)	None	Alert

Status is heuristic. Slack rarely posts explicit resolutions, so “Seen today” or “Recent” means the alert family still appeared in production recently, not that it is definitely unresolved.

AWS Email Alarm Families

AWS Alarm	Emails	ALARM	OK	State Flips	First Seen	Last Seen	Latest State	Status

“Flapping, latest OK” means the most recent email was an OK, but the alarm toggled repeatedly and is still a reliability concern.

Global Discussion-Derived Signal

Thread Date	Alert	Severity	Services	Signal	Key Notes
2026-05-14 12:17	NodeHighNumberConntrackEntriesUsed Thread Alert	Warning	grafana	General investigation	Rojan Shrestha This has been resolved Thread Alert
2026-05-14 11:20	TraefikServiceHighLatency Thread Alert	Warning	social-api, web-80	Scaling config	Thread summary · Rojan Shrestha, danny, Andrei Petrescu 12 min ago, 15+ web pod probe failures fired simultaneously (readiness /ready timeouts + php-fpm-healthcheck 10s timeouts). All 6+ web pods… \| Suggesting two quick config changes on web to stop the recurring p95 latency alerts \| \| 1. Loosen readiness probe: failureThreshold: 4, per… \| understood, good point Thread Alert
2026-05-13 11:55	TraefikServiceHighErrorRate Thread Alert	Critical	uni-api	General investigation	Rojan Shrestha no errors now pod is healthy \| \| we saw this error \| \| Service method CareerServiceImpl.saveCareer failed after Nms: \| Invalid employment t… Thread Alert
2026-05-13 09:19	TraefikServiceHighLatency Thread Alert	Warning	ai-api, uni-api	Release / migration issue	Thread summary · Rojan Shrestha Uneven load distribution on the nodes . some are running at 100 percent some at around 10 to 15 \| Web deployment is bin-packing onto 2 nodes because it has neither CPU limits \| \| container specs show limits for memory only (no cpu), and… Thread Alert
2026-05-13 08:52	NodeCPUHighUsage Thread Alert	Warning	grafana	Resource limits	Thread summary · Rojan Shrestha Node running at 98 percent CPU checking \| \| Web pods are the most resource dominant on the cluster \| this is resolved for now , rollout restarted the web pods to replace them to the other nodes that was sitting idle to balance the load Thread Alert