Ops Briefing Surface

Production Reliability Dashboard

Generated 2026-04-25 16:19 for 2026-04-18 07:00 to 2026-04-25 07:00 from Pingdom checks, Slack #_alerts_prod, and AWS SNS alerts.

All sources Pingdom customer checks Slack alert families AWS alarm emails

Email-confirmed customer incidents0Pingdom down/slow events confirmed by inbox alertsObserved in Pingdom: 0

Impacted services14Mapped from Slack and Pingdom evidence

AWS alarms in ALARM0Still alarming at window end

Latest observed signal2026-04-24 17:35Most recent cross-source activity

Executive summary

What needs attention

Bottom line: application-level critical paths are present.

Pingdom customer impact

External signal

No criticalActive: 0Total seen: 0

No Pingdom checks were available in this window.

No active issue listed in this category.

Slack impacted services

Application signal

3 criticalActive: 13Total seen: 14

3 critical, 10 non-critical active item(s).

AWS alarms

Infrastructure signal

No criticalActive: 3Total seen: 3

3 active item(s) in this window.

What to do next

NextTreat critical services admission-api, core-grafana-80, uni-api as the primary application investigation set
Treat critical services admission-api, core-grafana-80, uni-api as the primary application investigation set. Use TraefikServiceHighErrorRate as the leading category, reproduce the failing paths on admission-api, core-grafana-80, uni-api, compare against the latest deploy or config change, and do not close it until both error rate and latency flatten.
Executive recommendation
ThenReduce the operational drag on grafana by separating repeated symptom alerts from the underlying workload failure
Reduce the operational drag on grafana by separating repeated symptom alerts from the underlying workload failure. Either eliminate the recurrent fault or retune the alert once the failure mode is understood.
Executive recommendation

Global Evidence Explorer

Report-wide charts and tables stay here, separate from the active investigation scope.

Application + Infrastructure Alerts by Day

Pingdom latency + downtime by source

Customer View

Pingdom Checks

Pingdom Check	Status	Events	Downtime	Last Seen	Likely Services	Correlated Evidence

Pingdom rows show externally visible signal first. The correlated evidence column helps tie the failing check back to services, Slack alert families, or AWS alarms when those links exist.

Application View

Slack Impacted Service / Resource View

This view attributes alerts to the workload or resource named in the alert text. Grafana, Loki, and Tempo are treated as observability components and are excluded when a more specific impacted target is also present.

Impacted Service / Resource	Highest Severity	Count	Last Seen	Status	Top Alert Types	Discussion Signal	Latest Thread Note
admission-api Latest alert Latest thread	Critical	3	2026-04-21 09:33	Seen this week	TraefikServiceHighErrorRate (3)	General investigation	Thread summary · Raul Popovici, Rojan Shrestha @U09JYAWCGLB we know about this, discussing in daily with the devs \| Hi Raul Popovici sure thanks
core-grafana-80 Latest alert	Critical	3	2026-04-21 00:41	Seen this week	TraefikServiceHighErrorRate (3)	None	No thread note
uni-api Latest alert	Critical	1	2026-04-21 12:41	Seen this week	TraefikServiceHighErrorRate (1)	None	No thread note
grafana Latest alert	Warning	26	2026-04-24 12:29	Seen today	KubeCPUOvercommit (24)KubeContainerWaiting (2)	None	No thread note
ai-api Latest alert Latest thread	Warning	4	2026-04-24 17:35	Seen today	TraefikServiceHighLatency (4)	General investigation	Valentin Pal some POSTs to ai summaries seem to take 5s or more. But we don't have traces in ai-api pod to see if it is normal (summaries could indeed t…
social-api Latest alert Latest thread	Warning	3	2026-04-24 15:07	Seen today	TraefikServiceHighLatency (3)	Resource limits	Valentin Pal Root cause summary \| This looks like a slow successful `social-api` news detail path, not a crash, 5xx, or resource saturation issue. The…
update-recurenta Grouped 4 variantsVariant mentions 28Active variants 4 Latest alert	Warning	28	2026-04-22 13:48	Recent (72h)	KubeJobFailed (27)KubePodCrashLooping (1)	None	No thread note
docgen2-api Latest alert	Warning	4	2026-04-22 09:47	Recent (72h)	KubeHpaMaxedOut (4)	None	No thread note
colecteaza-sms-note-abs Latest alert	Warning	14	2026-04-20 11:53	Seen this week	KubeJobFailed (14)	None	No thread note
download-album Latest alert	Warning	14	2026-04-20 11:53	Seen this week	KubeJobFailed (14)	None	No thread note
rezumat Latest alert	Warning	14	2026-04-20 11:53	Seen this week	KubeJobFailed (14)	None	No thread note
core-getresponse-events-worker Latest alert	Warning	2	2026-04-20 11:48	Seen this week	KubeDeploymentReplicasMismatch (2)	None	No thread note
admission-migration Latest alert	Warning	1	2026-04-21 13:18	Seen this week	KubeJobFailed (1)	None	No thread note
forms-api Grouped 2 variantsVariant mentions 2Active variants 2 Latest alert Latest thread	Warning	4	2026-04-20 17:27	Likely noise / resolved	KubePodCrashLooping (2)KubeDeploymentRolloutStuck (2)	General investigationAlert tuning / noise	Thread summary · codex, Valentin Pal Investigation summary for `forms-api` \| \| What happened \| - The `forms-api` rollout stalled and triggered `KubeDeploymentRolloutStuck`. \|… \| <!subteam^S0A2M3GD3CN>, @U09JYAWCGLB @U633D9JBW I made this skill that can run automatically as soon as an alert appears in Slack, so it ca… \| we can improve it further based o…

Evidence

Slack Alert Families

Alert	Severity	Count	Last Seen	Status	Threads	Top Impacted Services	Discussion Signal	Latest Thread Note
TraefikServiceHighErrorRate Latest alert Latest thread	Critical	7	2026-04-21 12:41	Seen this week	1	core-grafana-80 (3)admission-api (3)uni-api (1)	General investigation	Thread summary · Raul Popovici, Rojan Shrestha @U09JYAWCGLB we know about this, discussing in daily with the devs \| Hi Raul Popovici sure thanks Thread Alert
KubeCPUOvercommit Latest alert	Warning	24	2026-04-24 12:29	Seen today	0	grafana (24)	None	Alert
TraefikServiceHighLatency Latest alert Latest thread	Warning	7	2026-04-24 17:35	Seen today	2	ai-api (4)social-api (3)	Resource limitsGeneral investigation	Valentin Pal some POSTs to ai summaries seem to take 5s or more. But we don't have traces in ai-api pod to see if it is normal (summaries could indeed t… Thread Alert
KubeJobFailed Latest alert	Warning	27	2026-04-22 13:48	Recent (72h)	0	update-recurenta (27)colecteaza-sms-note-abs (14)download-album (14)rezumat (14)admission-migration (1)	None	Alert
KubeHpaMaxedOut Latest alert	Warning	4	2026-04-22 09:47	Recent (72h)	0	docgen2-api (4)	None	Alert
KubePodCrashLooping Latest alert Latest thread	Warning	3	2026-04-20 17:24	Seen this week	1	forms-api (2)update-recurenta (1)	General investigation	Thread summary · Rojan Shrestha, Andrei Petrescu Warning Unhealthy 80s (x19 over 3m22s) kubelet spec.containers{forms-api-container}: Readiness probe failed: Get "": dial tcp 10.66.111.240… \| one of the pod is healthy to keep the service alive . \| Thats right Thread Alert
KubeContainerWaiting Latest alert	Warning	2	2026-04-20 12:38	Seen this week	0	grafana (2)	None	Alert
KubeDeploymentReplicasMismatch Latest alert	Warning	2	2026-04-20 11:48	Seen this week	0	core-getresponse-events-worker (2)	None	Alert
KubeDeploymentRolloutStuck Latest alert Latest thread	Warning	2	2026-04-20 17:27	Likely noise / resolved	1	forms-api (2)	Alert tuning / noise	Thread summary · codex, Valentin Pal Investigation summary for `forms-api` \| \| What happened \| - The `forms-api` rollout stalled and triggered `KubeDeploymentRolloutStuck`. \|… \| <!subteam^S0A2M3GD3CN>, @U09JYAWCGLB @U633D9JBW I made this skill that can run automatically as soon as an alert appears in Slack, so it ca… \| we can improve it further based o… Thread Alert

Status is heuristic. Slack rarely posts explicit resolutions, so “Seen today” or “Recent” means the alert family still appeared in production recently, not that it is definitely unresolved.

AWS Email Alarm Families

AWS Alarm	Emails	ALARM	OK	State Flips	First Seen	Last Seen	Latest State	Status
adservio-root-account-usage	4	2	2	3	2026-04-23 14:38	2026-04-23 15:32	OK	Latest OK
adservio-rds-mysql-catalog2-storage-low	2	1	1	1	2026-04-20 13:48	2026-04-20 13:52	OK	Latest OK
adservio-rds-mysql-catalog-disk-queue-high	2	1	1	1	2026-04-20 10:13	2026-04-20 10:14	OK	Latest OK

“Flapping, latest OK” means the most recent email was an OK, but the alarm toggled repeatedly and is still a reliability concern.

Global Discussion-Derived Signal

Thread Date	Alert	Severity	Services	Signal	Key Notes
2026-04-24 17:35	TraefikServiceHighLatency Thread Alert	Warning	ai-api	General investigation	Valentin Pal some POSTs to ai summaries seem to take 5s or more. But we don't have traces in ai-api pod to see if it is normal (summaries could indeed t… Thread Alert
2026-04-21 10:19	TraefikServiceHighLatency Thread Alert	Warning	social-api	Resource limits	Valentin Pal Root cause summary \| This looks like a slow successful `social-api` news detail path, not a crash, 5xx, or resource saturation issue. The… Thread Alert
2026-04-21 09:33	TraefikServiceHighErrorRate Thread Alert	Critical	admission-api	General investigation	Thread summary · Raul Popovici, Rojan Shrestha @U09JYAWCGLB we know about this, discussing in daily with the devs \| Hi Raul Popovici sure thanks Thread Alert
2026-04-20 17:27	KubeDeploymentRolloutStuck Thread Alert	Warning	forms-api	Alert tuning / noise	Thread summary · codex, Valentin Pal Investigation summary for `forms-api` \| \| What happened \| - The `forms-api` rollout stalled and triggered `KubeDeploymentRolloutStuck`. \|… \| <!subteam^S0A2M3GD3CN>, @U09JYAWCGLB @U633D9JBW I made this skill that can run automatically as soon as an alert appears in Slack, so it ca… \| we can improve it further based o… Thread Alert
2026-04-20 16:46	KubePodCrashLooping Thread Alert	Warning	forms-api	General investigation	Thread summary · Rojan Shrestha, Andrei Petrescu Warning Unhealthy 80s (x19 over 3m22s) kubelet spec.containers{forms-api-container}: Readiness probe failed: Get "": dial tcp 10.66.111.240… \| one of the pod is healthy to keep the service alive . \| Thats right Thread Alert