Ops Briefing Surface

Production Reliability Dashboard

Generated 2026-06-28 19:26 for 2026-06-21 07:00 to 2026-06-28 07:00 from Pingdom checks, Slack #_alerts_prod, and AWS SNS alerts.

All sources Pingdom customer checks Slack alert families AWS alarm emails
Email-confirmed customer incidents0Pingdom down/slow events confirmed by inbox alertsObserved in Pingdom: 0
Impacted services20Mapped from Slack and Pingdom evidence
AWS alarms in ALARM0Still alarming at window end
Latest observed signal2026-06-28 06:53Most recent cross-source activity
Executive summary

What needs attention

Bottom line: application-level critical paths are present.

Pingdom customer impact

External signal
No criticalActive: 0Total seen: 0

No Pingdom checks were available in this window.

No active issue listed in this category.

Slack impacted services

Application signal
6 criticalActive: 20Total seen: 20

6 critical, 14 non-critical active item(s).

AWS alarms

Infrastructure signal
No criticalActive: 3Total seen: 3

3 active item(s) in this window.

What to do next

  1. NextTreat critical services web-80, accommodations-api, docgen2-api, admission-api, grafana, uni-api as the primary application investigation set

    Treat critical services web-80, accommodations-api, docgen2-api, admission-api, grafana, uni-api as the primary application investigation set. Use TraefikServiceHighErrorRate as the leading category, reproduce the failing paths on web-80, accommodations-api, docgen2-api, admission-api, grafana, uni-api, compare against the latest deploy or config change, and do not close it until both error rate and latency flatten.

    Executive recommendation
  2. ThenCheck whether docgen2-api needs a short-term scaling adjustment or a queue and load change before the next traffic bump

    Check whether docgen2-api needs a short-term scaling adjustment or a queue and load change before the next traffic bump.

    Executive recommendation
Global Evidence Explorer

Global Evidence Explorer

Report-wide charts and tables stay here, separate from the active investigation scope.

Application + Infrastructure Alerts by Day

Pingdom latency + downtime by source

Customer View

Pingdom Checks

Pingdom CheckStatusEventsDowntimeLast SeenLikely ServicesCorrelated Evidence

Pingdom rows show externally visible signal first. The correlated evidence column helps tie the failing check back to services, Slack alert families, or AWS alarms when those links exist.

Application View

Slack Impacted Service / Resource View

This view attributes alerts to the workload or resource named in the alert text. Grafana, Loki, and Tempo are treated as observability components and are excluded when a more specific impacted target is also present.

Impacted Service / ResourceHighest SeverityCountLast SeenStatusTop Alert TypesDiscussion SignalLatest Thread Note
grafanaCritical332026-06-28 06:03Seen todayPlatform5xxLowVolumeCritical100 (1)KubeContainerWaiting (20)Platform5xxLowVolumeWarning20 (2)KubeContainerMemoryHigh (7)TargetDown (1)NoneNo thread note
admission-apiCritical32026-06-27 09:21Seen todayTraefikServiceHighErrorRate (1)TraefikServiceHighLatency (2)NoneNo thread note
docgen2-apiCritical72026-06-25 10:04Recent (72h)TraefikServiceHighErrorRate (3)KubeHpaMaxedOut (4)Release / migration issue
docgen2-api 5xx spike: the routine backend deploy during peak gradebook hours hit the default 1s /health probe timeout while 3–6s PDF rende…
web-80Critical42026-06-26 22:52Recent (72h)TraefikServiceHighErrorRate (4)NoneNo thread note
accommodations-apiCritical42026-06-26 12:08Recent (72h)TraefikServiceHighErrorRate (4)NoneNo thread note
uni-apiCritical12026-06-24 15:13Seen this weekTraefikServiceHighErrorRate (1)Observability storage
info{ | "ClientHost": "140.248.36.239", | "DownstreamContentSize": 11, | "DownstreamStatus": 502, | "Duration": 1436165, | "RequestHost": "…
notifications-apiCritical12026-06-23 14:48Seen this weekTraefikServiceHighErrorRate (1)NoneNo thread note
download-album
Grouped 2 variantsVariant mentions 16Active variants 2
Warning162026-06-28 03:42Seen todayKubeJobFailed (16)NoneNo thread note
social-migrationWarning142026-06-28 04:47Seen todayKubeJobNotCompleted (14)NoneNo thread note
colecteaza-sms-note-absWarning112026-06-28 03:42Seen todayKubeJobFailed (11)NoneNo thread note
rezumatWarning92026-06-28 03:42Seen todayKubeJobFailed (9)NoneNo thread note
admission-end-session
Grouped 2 variantsVariant mentions 8Active variants 2
Warning72026-06-28 03:42Seen todayKubeJobFailed (7)NoneNo thread note
ai-apiWarning32026-06-26 17:31Recent (72h)TraefikServiceHighLatency (3)NoneNo thread note
billing-retry-payment-workerWarning22026-06-26 10:17Recent (72h)KubePodCrashLooping (1)KubeDeploymentReplicasMismatch (1)NoneNo thread note
core-getresponse-events-workerWarning12026-06-26 23:06Recent (72h)KubePodCrashLooping (1)NoneNo thread note
billing-cycle-workerWarning12026-06-26 10:17Recent (72h)KubePodCrashLooping (1)NoneNo thread note
lokiWarning32026-06-25 00:07Seen this weekKubeContainerMemoryHigh (2)KubePersistentVolumeFillingUp (1)NoneNo thread note
otel-collectorWarning12026-06-23 17:24Seen this weekKubePodCrashLooping (1)NoneNo thread note
tempo-0Warning12026-06-23 12:26Seen this weekKubePodCrashLooping (1)General investigation
Same OOM killed in tempo we need to patch this once the the release freez is lifted
s3-migrationWarning12026-06-22 14:28Seen this weekKubeDeploymentReplicasMismatch (1)NoneNo thread note
Evidence

Slack Alert Families

AlertSeverityCountLast SeenStatusThreadsTop Impacted ServicesDiscussion SignalLatest Thread Note
TraefikServiceHighErrorRateCritical122026-06-27 09:21Seen today2web-80 (4)accommodations-api (4)docgen2-api (3)notifications-api (1)uni-api (1)Release / migration issueObservability storage
info{ | "ClientHost": "140.248.36.239", | "DownstreamContentSize": 11, | "DownstreamStatus": 502, | "Duration": 1436165, | "RequestHost": "…
Platform5xxLowVolumeCritical100Critical12026-06-26 22:56Recent (72h)0grafana (1)None
KubeContainerWaitingWarning202026-06-28 06:03Seen today0grafana (20)None
KubeJobFailedWarning172026-06-28 03:42Seen today0download-album (16)colecteaza-sms-note-abs (11)rezumat (9)admission-end-session (7)None
KubeJobNotCompletedWarning142026-06-28 04:47Seen today0social-migration (14)None
TraefikServiceHighLatencyWarning52026-06-26 17:31Recent (72h)0ai-api (3)admission-api (2)None
KubePodCrashLoopingWarning42026-06-26 23:06Recent (72h)1tempo-0 (1)otel-collector (1)billing-cycle-worker (1)billing-retry-payment-worker (1)core-getresponse-events-worker (1)General investigation
Same OOM killed in tempo we need to patch this once the the release freez is lifted
KubeHpaMaxedOutWarning42026-06-25 10:04Recent (72h)0docgen2-api (4)None
Platform5xxLowVolumeWarning20Warning22026-06-26 22:56Recent (72h)0grafana (2)None
KubeDeploymentReplicasMismatchWarning22026-06-26 10:16Recent (72h)0s3-migration (1)billing-retry-payment-worker (1)None
KubeContainerMemoryHighWarning72026-06-25 00:07Seen this week0grafana (7)loki (2)None
TargetDownWarning12026-06-23 17:18Seen this week0grafana (1)None
KubePdbNotEnoughHealthyPodsWarning12026-06-22 14:28Seen this week0grafana (1)None
KubePersistentVolumeFillingUpWarning12026-06-22 10:43Seen this week0grafana (1)loki (1)None

Status is heuristic. Slack rarely posts explicit resolutions, so “Seen today” or “Recent” means the alert family still appeared in production recently, not that it is definitely unresolved.

AWS Email Alarm Families

AWS AlarmEmailsALARMOKState FlipsFirst SeenLast SeenLatest StateStatus
adservio-rds-mysql-catalog2-memory-low492425482026-06-21 08:452026-06-24 07:34OKFlapping, latest OK
adservio-rds-mysql-catalog-memory-low1367122026-06-22 12:012026-06-28 06:53OKFlapping, latest OK
adservio-rds-mysql-catalog-swap-high73462026-06-22 12:022026-06-28 05:22OKFlapping, latest OK

“Flapping, latest OK” means the most recent email was an OK, but the alarm toggled repeatedly and is still a reliability concern.

Global Discussion-Derived Signal

Thread DateAlertSeverityServicesSignalKey Notes
2026-06-24 15:13TraefikServiceHighErrorRateCriticaluni-apiObservability storage
info{ | "ClientHost": "140.248.36.239", | "DownstreamContentSize": 11, | "DownstreamStatus": 502, | "Duration": 1436165, | "RequestHost": "…
2026-06-24 15:11TraefikServiceHighErrorRateCriticaldocgen2-apiRelease / migration issue
docgen2-api 5xx spike: the routine backend deploy during peak gradebook hours hit the default 1s /health probe timeout while 3–6s PDF rende…
2026-06-23 12:26KubePodCrashLoopingWarningtempo-0General investigation
Same OOM killed in tempo we need to patch this once the the release freez is lifted