Ops Briefing Surface

Production Reliability Dashboard

Generated 2026-04-25 16:19 for 2026-04-18 07:00 to 2026-04-25 07:00 from Pingdom checks, Slack #_alerts_prod, and AWS SNS alerts.

All sources Pingdom customer checks Slack alert families AWS alarm emails
Email-confirmed customer incidents0Pingdom down/slow events confirmed by inbox alertsObserved in Pingdom: 0
Impacted services14Mapped from Slack and Pingdom evidence
AWS alarms in ALARM0Still alarming at window end
Latest observed signal2026-04-24 17:35Most recent cross-source activity
Executive summary

What needs attention

Bottom line: application-level critical paths are present.

Pingdom customer impact

External signal
No criticalActive: 0Total seen: 0

No Pingdom checks were available in this window.

No active issue listed in this category.

Slack impacted services

Application signal
3 criticalActive: 13Total seen: 14

3 critical, 10 non-critical active item(s).

AWS alarms

Infrastructure signal
No criticalActive: 3Total seen: 3

3 active item(s) in this window.

What to do next

  1. NextTreat critical services admission-api, core-grafana-80, uni-api as the primary application investigation set

    Treat critical services admission-api, core-grafana-80, uni-api as the primary application investigation set. Use TraefikServiceHighErrorRate as the leading category, reproduce the failing paths on admission-api, core-grafana-80, uni-api, compare against the latest deploy or config change, and do not close it until both error rate and latency flatten.

    Executive recommendation
  2. ThenReduce the operational drag on grafana by separating repeated symptom alerts from the underlying workload failure

    Reduce the operational drag on grafana by separating repeated symptom alerts from the underlying workload failure. Either eliminate the recurrent fault or retune the alert once the failure mode is understood.

    Executive recommendation
Global Evidence Explorer

Global Evidence Explorer

Report-wide charts and tables stay here, separate from the active investigation scope.

Application + Infrastructure Alerts by Day

Pingdom latency + downtime by source

Customer View

Pingdom Checks

Pingdom CheckStatusEventsDowntimeLast SeenLikely ServicesCorrelated Evidence

Pingdom rows show externally visible signal first. The correlated evidence column helps tie the failing check back to services, Slack alert families, or AWS alarms when those links exist.

Application View

Slack Impacted Service / Resource View

This view attributes alerts to the workload or resource named in the alert text. Grafana, Loki, and Tempo are treated as observability components and are excluded when a more specific impacted target is also present.

Impacted Service / ResourceHighest SeverityCountLast SeenStatusTop Alert TypesDiscussion SignalLatest Thread Note
admission-apiCritical32026-04-21 09:33Seen this weekTraefikServiceHighErrorRate (3)General investigation
@U09JYAWCGLB we know about this, discussing in daily with the devs | Hi Raul Popovici sure thanks
core-grafana-80Critical32026-04-21 00:41Seen this weekTraefikServiceHighErrorRate (3)NoneNo thread note
uni-apiCritical12026-04-21 12:41Seen this weekTraefikServiceHighErrorRate (1)NoneNo thread note
grafanaWarning262026-04-24 12:29Seen todayKubeCPUOvercommit (24)KubeContainerWaiting (2)NoneNo thread note
ai-apiWarning42026-04-24 17:35Seen todayTraefikServiceHighLatency (4)General investigation
some POSTs to ai summaries seem to take 5s or more. But we don't have traces in ai-api pod to see if it is normal (summaries could indeed t…
social-apiWarning32026-04-24 15:07Seen todayTraefikServiceHighLatency (3)Resource limits
*Root cause summary* | This looks like a slow successful `social-api` news detail path, not a crash, 5xx, or resource saturation issue. The…
update-recurenta
Grouped 4 variantsVariant mentions 28Active variants 4
Warning282026-04-22 13:48Recent (72h)KubeJobFailed (27)KubePodCrashLooping (1)NoneNo thread note
docgen2-apiWarning42026-04-22 09:47Recent (72h)KubeHpaMaxedOut (4)NoneNo thread note
colecteaza-sms-note-absWarning142026-04-20 11:53Seen this weekKubeJobFailed (14)NoneNo thread note
download-albumWarning142026-04-20 11:53Seen this weekKubeJobFailed (14)NoneNo thread note
rezumatWarning142026-04-20 11:53Seen this weekKubeJobFailed (14)NoneNo thread note
core-getresponse-events-workerWarning22026-04-20 11:48Seen this weekKubeDeploymentReplicasMismatch (2)NoneNo thread note
admission-migrationWarning12026-04-21 13:18Seen this weekKubeJobFailed (1)NoneNo thread note
forms-api
Grouped 2 variantsVariant mentions 2Active variants 2
Warning42026-04-20 17:27Likely noise / resolvedKubePodCrashLooping (2)KubeDeploymentRolloutStuck (2)General investigationAlert tuning / noise
Investigation summary for `forms-api` | | *What happened* | - The `forms-api` rollout stalled and triggered `KubeDeploymentRolloutStuck`. |… | <!subteam^S0A2M3GD3CN>, @U09JYAWCGLB @U633D9JBW I made this skill that can run automatically as soon as an alert appears in Slack, so it ca… | we can improve it further based o…
Evidence

Slack Alert Families

AlertSeverityCountLast SeenStatusThreadsTop Impacted ServicesDiscussion SignalLatest Thread Note
TraefikServiceHighErrorRateCritical72026-04-21 12:41Seen this week1core-grafana-80 (3)admission-api (3)uni-api (1)General investigation
@U09JYAWCGLB we know about this, discussing in daily with the devs | Hi Raul Popovici sure thanks
KubeCPUOvercommitWarning242026-04-24 12:29Seen today0grafana (24)None
TraefikServiceHighLatencyWarning72026-04-24 17:35Seen today2ai-api (4)social-api (3)Resource limitsGeneral investigation
some POSTs to ai summaries seem to take 5s or more. But we don't have traces in ai-api pod to see if it is normal (summaries could indeed t…
KubeJobFailedWarning272026-04-22 13:48Recent (72h)0update-recurenta (27)colecteaza-sms-note-abs (14)download-album (14)rezumat (14)admission-migration (1)None
KubeHpaMaxedOutWarning42026-04-22 09:47Recent (72h)0docgen2-api (4)None
KubePodCrashLoopingWarning32026-04-20 17:24Seen this week1forms-api (2)update-recurenta (1)General investigation
Warning Unhealthy 80s (x19 over 3m22s) kubelet spec.containers{forms-api-container}: Readiness probe failed: Get "": dial tcp 10.66.111.240… | one of the pod is healthy to keep the service alive . | Thats right
KubeContainerWaitingWarning22026-04-20 12:38Seen this week0grafana (2)None
KubeDeploymentReplicasMismatchWarning22026-04-20 11:48Seen this week0core-getresponse-events-worker (2)None
KubeDeploymentRolloutStuckWarning22026-04-20 17:27Likely noise / resolved1forms-api (2)Alert tuning / noise
Investigation summary for `forms-api` | | *What happened* | - The `forms-api` rollout stalled and triggered `KubeDeploymentRolloutStuck`. |… | <!subteam^S0A2M3GD3CN>, @U09JYAWCGLB @U633D9JBW I made this skill that can run automatically as soon as an alert appears in Slack, so it ca… | we can improve it further based o…

Status is heuristic. Slack rarely posts explicit resolutions, so “Seen today” or “Recent” means the alert family still appeared in production recently, not that it is definitely unresolved.

AWS Email Alarm Families

AWS AlarmEmailsALARMOKState FlipsFirst SeenLast SeenLatest StateStatus
adservio-root-account-usage42232026-04-23 14:382026-04-23 15:32OKLatest OK
adservio-rds-mysql-catalog2-storage-low21112026-04-20 13:482026-04-20 13:52OKLatest OK
adservio-rds-mysql-catalog-disk-queue-high21112026-04-20 10:132026-04-20 10:14OKLatest OK

“Flapping, latest OK” means the most recent email was an OK, but the alarm toggled repeatedly and is still a reliability concern.

Global Discussion-Derived Signal

Thread DateAlertSeverityServicesSignalKey Notes
2026-04-24 17:35TraefikServiceHighLatencyWarningai-apiGeneral investigation
some POSTs to ai summaries seem to take 5s or more. But we don't have traces in ai-api pod to see if it is normal (summaries could indeed t…
2026-04-21 10:19TraefikServiceHighLatencyWarningsocial-apiResource limits
*Root cause summary* | This looks like a slow successful `social-api` news detail path, not a crash, 5xx, or resource saturation issue. The…
2026-04-21 09:33TraefikServiceHighErrorRateCriticaladmission-apiGeneral investigation
@U09JYAWCGLB we know about this, discussing in daily with the devs | Hi Raul Popovici sure thanks
2026-04-20 17:27KubeDeploymentRolloutStuckWarningforms-apiAlert tuning / noise
Investigation summary for `forms-api` | | *What happened* | - The `forms-api` rollout stalled and triggered `KubeDeploymentRolloutStuck`. |… | <!subteam^S0A2M3GD3CN>, @U09JYAWCGLB @U633D9JBW I made this skill that can run automatically as soon as an alert appears in Slack, so it ca… | we can improve it further based o…
2026-04-20 16:46KubePodCrashLoopingWarningforms-apiGeneral investigation
Warning Unhealthy 80s (x19 over 3m22s) kubelet spec.containers{forms-api-container}: Readiness probe failed: Get "": dial tcp 10.66.111.240… | one of the pod is healthy to keep the service alive . | Thats right