Ops Briefing Surface

Production Reliability Dashboard

Generated 2026-06-24 14:45 for 2026-06-13 07:00 to 2026-06-20 07:00 from Pingdom checks, Slack #_alerts_prod, and AWS SNS alerts.

All sources Pingdom customer checks Slack alert families AWS alarm emails
Email-confirmed customer incidents0Pingdom down/slow events confirmed by inbox alertsObserved in Pingdom: 0
Impacted services37Mapped from Slack and Pingdom evidence
AWS alarms in ALARM1Still alarming at window end
Latest observed signal2026-06-20 06:33Most recent cross-source activity
Run notes

Data source availability

Some enrichments were unavailable in this run; drilldowns below stay focused on captured evidence.

Executive summary

What needs attention

Bottom line: application-level critical paths are present and catalog database alarms still look active.

Pingdom customer impact

External signal
No criticalActive: 0Total seen: 0

Pingdom did not show a fresh customer-visible issue in this window.

No active issue listed in this category.

Slack impacted services

Application signal
5 criticalActive: 35Total seen: 35

5 critical, 30 non-critical active item(s).

AWS alarms

Infrastructure signal
1 criticalActive: 2Total seen: 12

1 critical, 1 non-critical active item(s).

What to do next

  1. NextTreat critical services web-80, grafana, uni-api, core-grafana-80 as the primary application investigation set

    Treat critical services web-80, grafana, uni-api, core-grafana-80 as the primary application investigation set. Use TraefikServiceHighErrorRate as the leading category, reproduce the failing paths on web-80, grafana, uni-api, core-grafana-80, compare against the latest deploy or config change, and do not close it until both error rate and latency flatten.

    Executive recommendation
  2. ThenReduce the operational drag on ai-api by separating repeated symptom alerts from the underlying workload failure

    Reduce the operational drag on ai-api by separating repeated symptom alerts from the underlying workload failure. Either eliminate the recurrent fault or retune the alert once the failure mode is understood.

    Executive recommendation
Global Evidence Explorer

Global Evidence Explorer

Report-wide charts and tables stay here, separate from the active investigation scope.

Application + Infrastructure Alerts by Day

Pingdom latency + downtime by source

Customer View

Pingdom Checks

Pingdom CheckStatusEventsDowntimeLast SeenLikely ServicesCorrelated Evidence
Adservio RoNo recent customer-visible issue00madservio-ro
https://www.adservio.ro/api/v2/statusNo recent customer-visible issue00madservio-ro-api-v2-status

Pingdom rows show externally visible signal first. The correlated evidence column helps tie the failing check back to services, Slack alert families, or AWS alarms when those links exist.

Application View

Slack Impacted Service / Resource View

This view attributes alerts to the workload or resource named in the alert text. Grafana, Loki, and Tempo are treated as observability components and are excluded when a more specific impacted target is also present.

Impacted Service / ResourceHighest SeverityCountLast SeenStatusTop Alert TypesDiscussion SignalLatest Thread Note
grafanaCritical692026-06-19 14:18Seen todayPlatformLatencyP95Critical1s (7)NodeHighNumberConntrackEntriesUsed (7)KubePersistentVolumeFillingUp (6)KubeContainerMemoryHigh (15)NodeSystemSaturation (4)Observability storage
*Alert:* `Platform5xxRatioCritical3pct` (critical) | *When:* 2026-06-16 06:53 Europe/Bucharest | | *Service:* `grafana` | *TL;DR:* `grafana…
docgen2-apiCritical282026-06-19 09:05Seen todayTraefikServiceHighErrorRate (4)CPUThrottlingHigh (13)KubeHpaMaxedOut (11)NoneNo thread note
core-grafana-80Critical32026-06-19 10:43Seen todayTraefikServiceHighErrorRate (3)NoneNo thread note
web-80Critical312026-06-18 16:09Recent (72h)TraefikServiceHighErrorRate (14)TraefikServiceHighLatency (17)Observability storage
*Alert:* `TraefikServiceHighErrorRate` (critical) | *When:* 2026-06-16 05:55 Europe/Bucharest | | *Service:* `web-80` | *TL;DR:* `web-80` c…
uni-api
Grouped 2 variantsVariant mentions 13Active variants 2
Critical142026-06-17 18:34Recent (72h)TraefikServiceHighErrorRate (4)KubePodCrashLooping (1)KubeDeploymentReplicasMismatch (1)TraefikServiceHighLatency (8)Release / migration issue
*Alert:* `TraefikServiceHighErrorRate` (critical) | *When:* 2026-06-16 16:38 Europe/Bucharest | | *Service:* `uni-api` | *TL;DR:* `uni-api`… | *Fix now:* Check the latest deployed revision for `/api/v2/uni/student-studies/students/1882/disciplines?semester=1` and rollback or hotfix…
ai-apiWarning172026-06-19 17:15Seen todayTraefikServiceHighLatency (17)NoneNo thread note
download-album
Grouped 3 variantsVariant mentions 14Active variants 3
Warning142026-06-20 05:13Seen todayKubeJobFailed (14)General investigation
These are failed jobs which can be cleaned up from cluster , removed these failed jobs .
lokiWarning62026-06-19 14:18Seen todayKubePersistentVolumeFillingUp (6)NoneNo thread note
admission-apiWarning32026-06-18 11:26Recent (72h)TraefikServiceHighLatency (3)NoneNo thread note
subscriptions-api
Grouped 2 variantsVariant mentions 14Active variants 2
Warning212026-06-16 01:43Seen this weekTraefikServiceHighLatency (13)KubeDeploymentReplicasMismatch (7)KubePodCrashLooping (1)NoneNo thread note
social-apiWarning172026-06-15 14:11Seen this weekTraefikServiceHighLatency (17)NoneNo thread note
core-getresponse-events-workerWarning132026-06-16 07:03Seen this weekKubeDeploymentReplicasMismatch (9)KubePodCrashLooping (4)NoneNo thread note
notifications-event-managerWarning122026-06-16 01:43Seen this weekKubeDeploymentReplicasMismatch (6)KubeHpaMaxedOut (5)KubePodCrashLooping (1)NoneNo thread note
colecteaza-sms-note-absWarning112026-06-16 15:36Seen this weekKubeJobFailed (11)General investigation
These are failed jobs which can be cleaned up from cluster , removed these failed jobs .
service-websocketWarning112026-06-16 01:43Seen this weekKubeDeploymentReplicasMismatch (8)KubePodCrashLooping (3)NoneNo thread note
rezumatWarning102026-06-16 15:36Seen this weekKubeJobFailed (10)General investigation
These are failed jobs which can be cleaned up from cluster , removed these failed jobs .
publish-resultsWarning92026-06-16 15:36Seen this weekKubeJobFailed (9)General investigation
These are failed jobs which can be cleaned up from cluster , removed these failed jobs .
update-recurentaWarning92026-06-16 15:36Seen this weekKubeJobFailed (9)General investigation
These are failed jobs which can be cleaned up from cluster , removed these failed jobs .
attendance-register-missed-attendanceWarning82026-06-16 15:36Seen this weekKubeJobFailed (8)General investigation
These are failed jobs which can be cleaned up from cluster , removed these failed jobs .
stats-active-usersWarning82026-06-16 15:36Seen this weekKubeJobFailed (8)General investigation
These are failed jobs which can be cleaned up from cluster , removed these failed jobs .
stats-school-studentsWarning72026-06-16 15:36Seen this weekKubeJobFailed (7)General investigation
These are failed jobs which can be cleaned up from cluster , removed these failed jobs .
library-apiWarning72026-06-16 01:43Seen this weekKubeDeploymentReplicasMismatch (6)KubePodCrashLooping (1)NoneNo thread note
notifications-push-sender-workerWarning72026-06-16 01:43Seen this weekKubeDeploymentReplicasMismatch (6)KubePodCrashLooping (1)NoneNo thread note
service-avWarning72026-06-16 01:43Seen this weekKubeDeploymentReplicasMismatch (6)KubePodCrashLooping (1)NoneNo thread note
send-codesWarning62026-06-16 15:36Seen this weekKubeJobFailed (6)General investigation
These are failed jobs which can be cleaned up from cluster , removed these failed jobs .
notifications-events-workerWarning62026-06-16 01:43Seen this weekKubeDeploymentReplicasMismatch (5)KubePodCrashLooping (1)NoneNo thread note
subscriptions-school-stats-workerWarning62026-06-16 01:43Seen this weekKubeDeploymentReplicasMismatch (5)KubePodCrashLooping (1)NoneNo thread note
generate-invoicesWarning52026-06-16 15:36Seen this weekKubeJobFailed (5)General investigation
These are failed jobs which can be cleaned up from cluster , removed these failed jobs .
service-fetgenerator
Grouped 2 variantsVariant mentions 2Active variants 2
Warning52026-06-16 01:43Seen this weekKubeDeploymentReplicasMismatch (4)KubePodCrashLooping (1)NoneNo thread note
subscriptions-assign-workerWarning52026-06-16 01:43Seen this weekKubeDeploymentReplicasMismatch (4)KubePodCrashLooping (1)NoneNo thread note
rooms-apiWarning52026-06-15 11:29Seen this weekTraefikServiceHighLatency (5)NoneNo thread note
webWarning32026-06-15 11:36Seen this weekKubeDeploymentReplicasMismatch (3)NoneNo thread note
social-local-cacheWarning12026-06-16 15:30Seen this weekKubeJobNotCompleted (1)NoneNo thread note
metrics-serverWarning12026-06-15 21:00Seen this weekKubeAggregatedAPIDown (1)NoneNo thread note
billing-apiWarning12026-06-15 14:11Seen this weekTraefikServiceHighLatency (1)NoneNo thread note
Evidence

Slack Alert Families

AlertSeverityCountLast SeenStatusThreadsTop Impacted ServicesDiscussion SignalLatest Thread Note
TraefikServiceHighErrorRateCritical252026-06-19 10:43Seen today2web-80 (14)docgen2-api (4)uni-api (4)core-grafana-80 (3)Observability storageRelease / migration issue
*Alert:* `TraefikServiceHighErrorRate` (critical) | *When:* 2026-06-16 16:38 Europe/Bucharest | | *Service:* `uni-api` | *TL;DR:* `uni-api`… | *Fix now:* Check the latest deployed revision for `/api/v2/uni/student-studies/students/1882/disciplines?semester=1` and rollback or hotfix…
Platform5xxRatioCritical3pctCritical22026-06-18 16:18Recent (72h)1grafana (2)Observability storage
*Alert:* `Platform5xxRatioCritical3pct` (critical) | *When:* 2026-06-16 06:53 Europe/Bucharest | | *Service:* `grafana` | *TL;DR:* `grafana…
PlatformLatencyP95Critical1sCritical72026-06-15 14:01Seen this week0grafana (7)None
Platform5xxLowVolumeCritical100Critical42026-06-16 05:48Seen this week1grafana (4)Observability storage
*Alert:* `Platform5xxLowVolumeCritical100` (critical) | *When:* 2026-06-16 05:48 Europe/Bucharest | | *Service:* `grafana` | *TL;DR:* `graf…
TraefikServiceHighLatencyWarning262026-06-19 17:15Seen today0social-api (17)web-80 (17)ai-api (17)subscriptions-api (13)uni-api (8)None
KubeJobFailedWarning212026-06-20 05:13Seen today1download-album (14)colecteaza-sms-note-abs (11)rezumat (10)publish-results (9)update-recurenta (9)General investigation
These are failed jobs which can be cleaned up from cluster , removed these failed jobs .
CPUThrottlingHighWarning132026-06-19 09:05Seen today0docgen2-api (13)None
NodeHighNumberConntrackEntriesUsedWarning72026-06-19 09:32Seen today0grafana (7)None
KubePersistentVolumeFillingUpWarning62026-06-19 14:18Seen today0grafana (6)loki (6)None
KubeHpaMaxedOutWarning222026-06-18 13:21Recent (72h)0docgen2-api (11)grafana (6)notifications-event-manager (5)None
KubeContainerMemoryHighWarning152026-06-18 17:22Recent (72h)0grafana (15)None
KubeDeploymentReplicasMismatchWarning142026-06-17 18:28Recent (72h)0core-getresponse-events-worker (9)service-websocket (8)subscriptions-api (7)library-api (6)notifications-event-manager (6)None
KubePodCrashLoopingWarning62026-06-17 18:29Recent (72h)0core-getresponse-events-worker (4)service-websocket (3)library-api (1)notifications-event-manager (1)notifications-events-worker (1)None
NodeSystemSaturationWarning42026-06-19 01:24Recent (72h)0grafana (4)None
NodeDiskIOSaturationWarning22026-06-19 01:40Recent (72h)0grafana (2)None
Platform5xxRatioWarning1pctWarning22026-06-18 16:18Recent (72h)0grafana (2)None
PlatformLatencyP95Warning400msWarning62026-06-15 14:01Seen this week0grafana (6)None
PlatformLatencyP99Warning5sWarning52026-06-15 14:07Seen this week0grafana (5)None
TargetDownWarning22026-06-15 09:31Seen this week0grafana (2)None
KubeJobNotCompletedWarning12026-06-16 15:30Seen this week0social-local-cache (1)None
Platform5xxLowVolumeWarning20Warning12026-06-16 04:11Seen this week0grafana (1)None
KubeAggregatedAPIDownWarning12026-06-15 21:00Seen this week0metrics-server (1)None

Status is heuristic. Slack rarely posts explicit resolutions, so “Seen today” or “Recent” means the alert family still appeared in production recently, not that it is definitely unresolved.

AWS Email Alarm Families

AWS AlarmEmailsALARMOKState FlipsFirst SeenLast SeenLatest StateStatus
adservio-rds-mysql-catalog2-swap-high32122026-06-15 06:332026-06-17 06:42ALARMStill alarming
adservio-rds-mysql-catalog2-memory-low221111212026-06-15 06:222026-06-20 06:33OKFlapping, latest OK
adservio-root-account-usage42232026-06-16 02:472026-06-16 15:07OKLatest OK
adservio-rds-mysql-catalog2-storage-low42232026-06-15 23:552026-06-16 06:00OKLatest OK
adservio-rds-mysql-catalog3-cpu-high10102026-06-15 20:052026-06-15 20:05OKLatest OK
adservio-rds-mysql-catalog3-read-latency-high10102026-06-15 20:052026-06-15 20:05OKLatest OK
adservio-rds-mysql-catalog3-memory-low10102026-06-15 20:052026-06-15 20:05OKLatest OK
adservio-rds-mysql-catalog3-storage-low10102026-06-15 20:052026-06-15 20:05OKLatest OK
adservio-rds-mysql-catalog3-write-latency-high10102026-06-15 20:052026-06-15 20:05OKLatest OK
adservio-rds-mysql-catalog3-connections-high10102026-06-15 20:042026-06-15 20:04OKLatest OK
adservio-rds-mysql-catalog3-disk-queue-high10102026-06-15 20:042026-06-15 20:04OKLatest OK
adservio-rds-mysql-catalog3-swap-high10102026-06-15 20:042026-06-15 20:04OKLatest OK

“Flapping, latest OK” means the most recent email was an OK, but the alarm toggled repeatedly and is still a reliability concern.

Global Discussion-Derived Signal

Thread DateAlertSeverityServicesSignalKey Notes
2026-06-16 16:38TraefikServiceHighErrorRateCriticaluni-apiRelease / migration issue
*Alert:* `TraefikServiceHighErrorRate` (critical) | *When:* 2026-06-16 16:38 Europe/Bucharest | | *Service:* `uni-api` | *TL;DR:* `uni-api`… | *Fix now:* Check the latest deployed revision for `/api/v2/uni/student-studies/students/1882/disciplines?semester=1` and rollback or hotfix…
2026-06-16 11:31KubeJobFailedWarningattendance-register-missed-attendance, colecteaza-sms-note-abs, download-album, generate-invoicesGeneral investigation
These are failed jobs which can be cleaned up from cluster , removed these failed jobs .
2026-06-16 06:53Platform5xxRatioCritical3pctCriticalgrafanaObservability storage
*Alert:* `Platform5xxRatioCritical3pct` (critical) | *When:* 2026-06-16 06:53 Europe/Bucharest | | *Service:* `grafana` | *TL;DR:* `grafana…
2026-06-16 05:55TraefikServiceHighErrorRateCriticalweb-80Observability storage
*Alert:* `TraefikServiceHighErrorRate` (critical) | *When:* 2026-06-16 05:55 Europe/Bucharest | | *Service:* `web-80` | *TL;DR:* `web-80` c…
2026-06-16 05:48Platform5xxLowVolumeCritical100CriticalgrafanaObservability storage
*Alert:* `Platform5xxLowVolumeCritical100` (critical) | *When:* 2026-06-16 05:48 Europe/Bucharest | | *Service:* `grafana` | *TL;DR:* `graf…