Ops Briefing Surface

Production Reliability Dashboard

Generated 2026-04-18 22:01 for 2026-04-11 07:00 to 2026-04-18 21:50 from Pingdom checks, Slack #_alerts_prod, and AWS SNS alerts.

All sources Pingdom customer checks Slack alert families AWS alarm emails
Email-confirmed customer incidents0Pingdom down/slow events confirmed by inbox alertsObserved in Pingdom: 0
Impacted services12Mapped from Slack and Pingdom evidence
AWS alarms in ALARM0Still alarming at window end
Latest observed signal2026-04-18 19:53Most recent cross-source activity
Executive summary

What needs attention

Bottom line: the main risk is warning-level operational drag rather than a fresh critical alert.

Pingdom customer impact

External signal
No criticalActive: 0Total seen: 0

No Pingdom checks were available in this window.

No active issue listed in this category.

Slack impacted services

Application signal
No criticalActive: 12Total seen: 12

12 active item(s) in this window.

AWS alarms

Infrastructure signal
No criticalActive: 1Total seen: 1

1 active item(s) in this window.

What to do next

  1. NextReduce the operational drag on update-recurenta by separating repeated symptom alerts from the underlying workload failure

    Reduce the operational drag on update-recurenta by separating repeated symptom alerts from the underlying workload failure. Either eliminate the recurrent fault or retune the alert once the failure mode is understood.

    Executive recommendation
Global Evidence Explorer

Global Evidence Explorer

Report-wide charts and tables stay here, separate from the active investigation scope.

Application + Infrastructure Alerts by Day

Pingdom latency + downtime by source

Customer View

Pingdom Checks

Pingdom CheckStatusEventsDowntimeLast SeenLikely ServicesCorrelated Evidence

Pingdom rows show externally visible signal first. The correlated evidence column helps tie the failing check back to services, Slack alert families, or AWS alarms when those links exist.

Application View

Slack Impacted Service / Resource View

This view attributes alerts to the workload or resource named in the alert text. Grafana, Loki, and Tempo are treated as observability components and are excluded when a more specific impacted target is also present.

Impacted Service / ResourceHighest SeverityCountLast SeenStatusTop Alert TypesDiscussion SignalLatest Thread Note
update-recurenta
Grouped 8 variantsVariant mentions 43Active variants 8
Warning432026-04-18 19:53Seen todayKubeJobFailed (43)NoneNo thread note
colecteaza-sms-note-absWarning102026-04-18 19:53Seen todayKubeJobFailed (10)NoneNo thread note
download-albumWarning92026-04-18 19:53Seen todayKubeJobFailed (9)NoneNo thread note
rezumatWarning92026-04-18 19:53Seen todayKubeJobFailed (9)NoneNo thread note
grafanaWarning82026-04-17 13:24Recent (72h)KubeCPUOvercommit (6)TargetDown (1)AlertmanagerFailedToSendAlerts (1)NoneNo thread note
ai-apiWarning62026-04-17 18:22Recent (72h)TraefikServiceHighLatency (6)NoneNo thread note
docgen2-apiWarning52026-04-17 09:51Recent (72h)KubeHpaMaxedOut (5)NoneNo thread note
web
Grouped 9 variantsVariant mentions 13Active variants 9
Warning42026-04-17 14:01Recent (72h)KubePodNotReady (3)KubeDeploymentReplicasMismatch (1)NoneNo thread note
library-apiWarning32026-04-16 07:25Recent (72h)KubePodCrashLooping (2)KubeDeploymentReplicasMismatch (1)NoneNo thread note
rooms-apiWarning32026-04-16 07:25Recent (72h)KubePodCrashLooping (2)KubeDeploymentReplicasMismatch (1)NoneNo thread note
accommodations-events
Grouped 2 variantsVariant mentions 2Active variants 2
Warning22026-04-17 10:19Recent (72h)KubePodCrashLooping (2)NoneNo thread note
accommodations-api
Grouped 2 variantsVariant mentions 2Active variants 2
Warning22026-04-16 07:25Recent (72h)KubePodCrashLooping (1)KubeDeploymentReplicasMismatch (1)NoneNo thread note
Evidence

Slack Alert Families

AlertSeverityCountLast SeenStatusThreadsTop Impacted ServicesDiscussion SignalLatest Thread Note
KubeJobFailedWarning482026-04-18 19:53Seen today0update-recurenta (43)colecteaza-sms-note-abs (10)download-album (9)rezumat (9)None
TraefikServiceHighLatencyWarning62026-04-17 18:22Recent (72h)0ai-api (6)None
KubeCPUOvercommitWarning62026-04-17 10:18Recent (72h)0grafana (6)None
KubeHpaMaxedOutWarning52026-04-17 09:51Recent (72h)0docgen2-api (5)None
KubePodCrashLoopingWarning42026-04-17 10:19Recent (72h)0accommodations-events (2)library-api (2)rooms-api (2)accommodations-api (1)None
KubePodNotReadyWarning32026-04-17 14:01Recent (72h)0web (3)None
KubeDeploymentReplicasMismatchWarning22026-04-17 13:28Recent (72h)0accommodations-api (1)library-api (1)rooms-api (1)web (1)None
TargetDownWarning12026-04-17 13:24Recent (72h)0grafana (1)None
AlertmanagerFailedToSendAlertsWarning12026-04-16 07:29Recent (72h)0grafana (1)None

Status is heuristic. Slack rarely posts explicit resolutions, so “Seen today” or “Recent” means the alert family still appeared in production recently, not that it is definitely unresolved.

AWS Email Alarm Families

AWS AlarmEmailsALARMOKState FlipsFirst SeenLast SeenLatest StateStatus
adservio-rds-mysql-catalog-disk-queue-high21112026-04-15 12:212026-04-15 12:26OKLatest OK

“Flapping, latest OK” means the most recent email was an OK, but the alarm toggled repeatedly and is still a reliability concern.

Global Discussion-Derived Signal

Thread DateAlertSeverityServicesSignalKey Notes