Ops Briefing Surface

Production Reliability Dashboard

Generated 2026-05-05 01:00 for 2026-04-25 07:00 to 2026-05-05 12:00 from Pingdom checks, Slack #_alerts_prod, and AWS SNS alerts.

All sources Pingdom customer checks Slack alert families AWS alarm emails

Email-confirmed customer incidents0Pingdom down/slow events confirmed by inbox alertsObserved in Pingdom: 9

Impacted services17Mapped from Slack and Pingdom evidence

AWS alarms in ALARM0Still alarming at window end

Latest observed signal2026-05-05 00:00Most recent cross-source activity

Executive summary

What needs attention

Bottom line: Pingdom observed recent customer-facing glitches (email unconfirmed) and application-level critical paths are present.

Pingdom customer impact

External signal

No criticalActive: 1Total seen: 1

1 active item(s) in this window.

Slack impacted services

Application signal

5 criticalActive: 14Total seen: 15

5 critical, 9 non-critical active item(s).

AWS alarms

Infrastructure signal

No criticalActive: 1Total seen: 2

1 active item(s) in this window.

What to do next

NextUse Pingdom check https://www
Use Pingdom check https://www.adservio.ro/api/v2/status as churn/degradation evidence, and escalate to customer-facing incident priority only after inbox confirmation or strong cross-source correlation.
Executive recommendation
ThenTreat critical services accommodations-api, web-80, uni-api, core-grafana-80 as the primary application investigation set
Treat critical services accommodations-api, web-80, uni-api, core-grafana-80 as the primary application investigation set. Use TraefikServiceHighErrorRate as the leading category, reproduce the failing paths on accommodations-api, web-80, uni-api, core-grafana-80, compare against the latest deploy or config change, and do not close it until both error rate and latency flatten.
Executive recommendation

Global Evidence Explorer

Report-wide charts and tables stay here, separate from the active investigation scope.

Application + Infrastructure Alerts by Day

Pingdom latency + downtime by source

Customer View

Pingdom Checks

Pingdom Check	Status	Events	Downtime	Last Seen	Likely Services	Correlated Evidence
https://www.adservio.ro/api/v2/status	Recovered recently	9	9m	2026-05-05 00:00	adservio-ro-api-v2-status	adservio-rds-mysql-catalog-disk-queue-high
Adservio Ro	No recent customer-visible issue	0	0m	2026-05-05 00:00	adservio-ro	Pingdom-only evidence so far

Pingdom rows show externally visible signal first. The correlated evidence column helps tie the failing check back to services, Slack alert families, or AWS alarms when those links exist.

Application View

Slack Impacted Service / Resource View

This view attributes alerts to the workload or resource named in the alert text. Grafana, Loki, and Tempo are treated as observability components and are excluded when a more specific impacted target is also present.

Impacted Service / Resource	Highest Severity	Count	Last Seen	Status	Top Alert Types	Discussion Signal	Latest Thread Note
accommodations-api Latest alert	Critical	10	2026-05-04 15:06	Seen today	TraefikServiceHighErrorRate (10)	None	No thread note
uni-api Latest alert Latest thread	Critical	7	2026-05-04 12:07	Seen today	TraefikServiceHighErrorRate (7)	General investigationObservability storage	Rojan Shrestha uni-api is throwing 500s because the same academic-structure save is being submitted twice second insert hits a unique-key constraint (stud…
web-80 Latest alert	Critical	10	2026-05-01 10:45	Seen this week	TraefikServiceHighErrorRate (8)TraefikServiceHighLatency (2)	None	No thread note
grafana Latest alert Latest thread	Critical	27	2026-05-04 13:12	Recent but likely noise	KubeAPIErrorBudgetBurn (4)KubeCPUOvercommit (20)NodeSystemSaturation (1)KubeMemoryOvercommit (1)KubeClientErrors (1)	General investigationAlert tuning / noiseResource limits	Thread summary · Rojan Shrestha Node 10.66.121.122 is at 79% CPU / 50% memory, no pressure conditions load is 2.27 because 5 of the web pods landed on the same node plus s… \| Will clear when traffic dies down or HPA rebalances \| Stabilized load on 10.66.121.122 dropped from 2.27 to 1.68 per core (15m avg), CPU from 79% to 55%. HPA scaled down + web…
core-grafana-80 Latest alert	Critical	1	2026-04-25 20:57	No recent signal	TraefikServiceHighErrorRate (1)	None	No thread note
metrics-server Latest alert Latest thread	Warning	43	2026-05-04 18:19	Seen today	KubeAggregatedAPIDown (43)	General investigation	Rojan Shrestha False alarm metrics-server pod is healthy (1/1 Running, 0 restarts) and kubectl top nodes works. Alert was a brief flap during a node rotat…
minicrm-sync Latest alert	Warning	23	2026-05-04 19:21	Seen today	KubeJobFailed (23)	None	No thread note
colecteaza-sms-note-abs Latest alert	Warning	21	2026-05-04 19:21	Seen today	KubeJobFailed (21)	None	No thread note
accommodations-sync-users Grouped 2 variantsVariant mentions 19Active variants 2 Latest alert	Warning	19	2026-05-04 19:21	Seen today	KubeJobFailed (19)	None	No thread note
social-api Latest alert Latest thread	Warning	10	2026-05-04 12:39	Seen today	TraefikServiceHighLatency (10)	General investigation	Rojan Shrestha Two outbound calls to web dominated: GET /api/v2/utilizatori/profile?cuPermisiuni=1&withRoles=1 (6.6s) and GET /api/v2/ani/scolari (6.4s) —…
docgen2-api Latest alert Latest thread	Warning	5	2026-05-04 12:17	Seen today	KubeHpaMaxedOut (5)	Resource limits	Rojan Shrestha HPA already scaled back down (now 4/6). Pods are barely using CPU, so the scale-up was driven by a KEDA cron trigger, not real load.
rooms-api Latest alert Latest thread	Warning	26	2026-05-03 07:14	Recent (72h)	KubePodCrashLooping (13)KubeDeploymentReplicasMismatch (13)	General investigation	@U09JYAWCGLB danny Valentin Pal something is breaking on the Tuiasi cluster which i am not able to connect \| \| i would need a kubeconfig for tuiasi and…
ai-api Latest alert	Warning	7	2026-05-04 10:00	Recent (72h)	TraefikServiceHighLatency (7)	None	No thread note
core-scheduled-events-worker Latest alert Latest thread	Warning	5	2026-05-01 10:19	Seen this week	KubePodCrashLooping (3)KubeDeploymentReplicasMismatch (2)	General investigation	@U09JYAWCGLB danny Valentin Pal something is breaking on the Tuiasi cluster which i am not able to connect \| \| i would need a kubeconfig for tuiasi and…
subscriptions-api Latest alert	Warning	2	2026-04-28 09:11	No recent signal	TraefikServiceHighLatency (2)	None	No thread note

Evidence

Slack Alert Families

Alert	Severity	Count	Last Seen	Status	Threads	Top Impacted Services	Discussion Signal	Latest Thread Note
TraefikServiceHighErrorRate Latest alert Latest thread	Critical	26	2026-05-04 15:06	Seen today	2	accommodations-api (10)web-80 (8)uni-api (7)core-grafana-80 (1)	General investigationObservability storage	Rojan Shrestha uni-api is throwing 500s because the same academic-structure save is being submitted twice second insert hits a unique-key constraint (stud… Thread Alert
KubeAPIErrorBudgetBurn Latest alert Latest thread	Critical	3	2026-04-27 12:48	Likely noise / resolved	2	grafana (4)	General investigationAlert tuning / noise	Thread summary · Rojan Shrestha, Raul Popovici Nothing is down apiserver healthy (/readyz ok), all pods Running, HPAs scaled back down (web 9, docgen2 4). Alert is the rolling 1h SLO win… \| you can silence it in alertmanger \| Okay i will try a bit later thanks for the info Thread Alert
KubeAggregatedAPIDown Latest alert Latest thread	Warning	43	2026-05-04 18:19	Seen today	1	metrics-server (43)	General investigation	Rojan Shrestha False alarm metrics-server pod is healthy (1/1 Running, 0 restarts) and kubectl top nodes works. Alert was a brief flap during a node rotat… Thread Alert
KubeJobFailed Latest alert	Warning	23	2026-05-04 19:21	Seen today	0	minicrm-sync (23)colecteaza-sms-note-abs (21)accommodations-sync-users (19)	None	Alert
KubeCPUOvercommit Latest alert	Warning	20	2026-05-04 13:12	Seen today	0	grafana (20)	None	Alert
TraefikServiceHighLatency Latest alert Latest thread	Warning	18	2026-05-04 12:39	Seen today	1	social-api (10)ai-api (7)subscriptions-api (2)web-80 (2)	General investigation	Rojan Shrestha Two outbound calls to web dominated: GET /api/v2/utilizatori/profile?cuPermisiuni=1&withRoles=1 (6.6s) and GET /api/v2/ani/scolari (6.4s) —… Thread Alert
KubeHpaMaxedOut Latest alert Latest thread	Warning	5	2026-05-04 12:17	Seen today	1	docgen2-api (5)	Resource limits	Rojan Shrestha HPA already scaled back down (now 4/6). Pods are barely using CPU, so the scale-up was driven by a KEDA cron trigger, not real load. Thread Alert
KubePodCrashLooping Latest alert Latest thread	Warning	14	2026-05-03 07:14	Recent (72h)	1	rooms-api (13)core-scheduled-events-worker (3)	General investigation	@U09JYAWCGLB danny Valentin Pal something is breaking on the Tuiasi cluster which i am not able to connect \| \| i would need a kubeconfig for tuiasi and… Thread Alert
KubeDeploymentReplicasMismatch Latest alert	Warning	13	2026-05-03 07:09	Recent (72h)	0	rooms-api (13)core-scheduled-events-worker (2)	None	Alert
NodeSystemSaturation Latest alert Latest thread	Warning	1	2026-04-30 12:30	Seen this week	1	grafana (1)	Resource limits	Thread summary · Rojan Shrestha Node 10.66.121.122 is at 79% CPU / 50% memory, no pressure conditions load is 2.27 because 5 of the web pods landed on the same node plus s… \| Will clear when traffic dies down or HPA rebalances \| Stabilized load on 10.66.121.122 dropped from 2.27 to 1.68 per core (15m avg), CPU from 79% to 55%. HPA scaled down + web… Thread Alert
KubeMemoryOvercommit Latest alert	Warning	1	2026-04-29 09:26	Seen this week	0	grafana (1)	None	Alert
KubeAPIErrorBudgetBurn Latest alert Latest thread	Warning	1	2026-04-27 15:37	Likely noise / resolved	2	grafana (4)	General investigationAlert tuning / noise	Thread summary · Rojan Shrestha, Raul Popovici Nothing is down apiserver healthy (/readyz ok), all pods Running, HPAs scaled back down (web 9, docgen2 4). Alert is the rolling 1h SLO win… \| you can silence it in alertmanger \| Okay i will try a bit later thanks for the info Thread Alert
KubeClientErrors Latest alert Latest thread	Warning	1	2026-04-27 12:09	No recent signal	1	grafana (1)	General investigation	Rojan Shrestha aiserver is healthy (/readyz ok). The 2% error rate is from admission webhook timeouts during the HPA scaling burst (web 11 replicas, docge… Thread Alert

Status is heuristic. Slack rarely posts explicit resolutions, so “Seen today” or “Recent” means the alert family still appeared in production recently, not that it is definitely unresolved.

AWS Email Alarm Families

AWS Alarm	Emails	ALARM	OK	State Flips	First Seen	Last Seen	Latest State	Status
adservio-rds-mysql-catalog-disk-queue-high	14	7	7	13	2026-04-27 09:13	2026-04-30 10:21	OK	Flapping, latest OK
adservio-rds-postgres-billing-cpu-high	2	1	1	1	2026-05-01 10:12	2026-05-01 10:13	OK	Latest OK

“Flapping, latest OK” means the most recent email was an OK, but the alarm toggled repeatedly and is still a reliability concern.

Global Discussion-Derived Signal

Thread Date	Alert	Severity	Services	Signal	Key Notes
2026-05-04 12:39	TraefikServiceHighLatency Thread Alert	Warning	social-api	General investigation	Rojan Shrestha Two outbound calls to web dominated: GET /api/v2/utilizatori/profile?cuPermisiuni=1&withRoles=1 (6.6s) and GET /api/v2/ani/scolari (6.4s) —… Thread Alert
2026-05-04 11:28	TraefikServiceHighErrorRate Thread Alert	Critical	uni-api	Observability storage	Rojan Shrestha uni-api is throwing 500s because the same academic-structure save is being submitted twice second insert hits a unique-key constraint (stud… Thread Alert
2026-05-01 06:14	KubePodCrashLooping Thread Alert	Warning	core-scheduled-events-worker, rooms-api	General investigation	@U09JYAWCGLB danny Valentin Pal something is breaking on the Tuiasi cluster which i am not able to connect \| \| i would need a kubeconfig for tuiasi and… Thread Alert
2026-04-30 12:30	NodeSystemSaturation Thread Alert	Warning	grafana	Resource limits	Thread summary · Rojan Shrestha Node 10.66.121.122 is at 79% CPU / 50% memory, no pressure conditions load is 2.27 because 5 of the web pods landed on the same node plus s… \| Will clear when traffic dies down or HPA rebalances \| Stabilized load on 10.66.121.122 dropped from 2.27 to 1.68 per core (15m avg), CPU from 79% to 55%. HPA scaled down + web… Thread Alert
2026-04-30 11:11	TraefikServiceHighErrorRate Thread Alert	Critical	uni-api	General investigation	Thread summary · Rojan Shrestha, Valentin Pal No infra issue uni-api pod is healthy (1/1 Running, 0 restarts). The 6.67% errors are MySQL duplicate-key exceptions in GroupServiceImpl.sa… \| Andrei Alexandru - can you please check? \| \| JDBC exception executing SQL [INSERT INTO studiu(...) VALUES(...,1511,1519)] Thread Alert
2026-04-27 12:48	KubeAPIErrorBudgetBurn Thread Alert	Critical	grafana	Alert tuning / noise	Thread summary · Rojan Shrestha, Raul Popovici Nothing is down apiserver healthy (/readyz ok), all pods Running, HPAs scaled back down (web 9, docgen2 4). Alert is the rolling 1h SLO win… \| you can silence it in alertmanger \| Okay i will try a bit later thanks for the info Thread Alert
2026-04-27 12:09	KubeClientErrors Thread Alert	Warning	grafana	General investigation	Rojan Shrestha aiserver is healthy (/readyz ok). The 2% error rate is from admission webhook timeouts during the HPA scaling burst (web 11 replicas, docge… Thread Alert
2026-04-27 11:56	KubeHpaMaxedOut Thread Alert	Warning	docgen2-api	Resource limits	Rojan Shrestha HPA already scaled back down (now 4/6). Pods are barely using CPU, so the scale-up was driven by a KEDA cron trigger, not real load. Thread Alert
2026-04-27 11:53	KubeAPIErrorBudgetBurn Thread Alert	Critical	grafana	General investigation	Rojan Shrestha apiserver is healthy now. The alert was triggered by a brief spike in pod status updates ~15 min ago when several web pods went NotReady at… Thread Alert
2026-04-27 09:59	KubeAggregatedAPIDown Thread Alert	Warning	metrics-server	General investigation	Rojan Shrestha False alarm metrics-server pod is healthy (1/1 Running, 0 restarts) and kubectl top nodes works. Alert was a brief flap during a node rotat… Thread Alert