Ops Briefing Surface

Production Reliability Dashboard

Generated 2026-06-08 02:19 for 2026-05-30 07:00 to 2026-06-08 02:18 from Pingdom checks, Slack #_alerts_prod, and AWS SNS alerts.

All sources Pingdom customer checks Slack alert families AWS alarm emails
Email-confirmed customer incidents0Pingdom down/slow events confirmed by inbox alertsObserved in Pingdom: 5
Impacted services16Mapped from Slack and Pingdom evidence
AWS alarms in ALARM0Still alarming at window end
Latest observed signal2026-06-08 00:20Most recent cross-source activity
Run notes

Data source availability

Some enrichments were unavailable in this run; drilldowns below stay focused on captured evidence.

Executive summary

What needs attention

Bottom line: Pingdom observed recent customer-facing glitches (email unconfirmed) and application-level critical paths are present.

Pingdom customer impact

External signal
No criticalActive: 1Total seen: 1

1 active item(s) in this window.

Slack impacted services

Application signal
4 criticalActive: 15Total seen: 15

4 critical, 11 non-critical active item(s).

AWS alarms

Infrastructure signal
No criticalActive: 3Total seen: 3

3 active item(s) in this window.

What to do next

  1. NextUse Pingdom check Adservio Ro as churn/degradation evidence, and escalate to customer-facing incident priority only after inbox confirmation or strong cross-source correlation

    Use Pingdom check Adservio Ro as churn/degradation evidence, and escalate to customer-facing incident priority only after inbox confirmation or strong cross-source correlation.

    Executive recommendation
  2. ThenTreat critical services core-grafana-80, accommodations-api, uni-api, web-80 as the primary application investigation set

    Treat critical services core-grafana-80, accommodations-api, uni-api, web-80 as the primary application investigation set. Use TraefikServiceHighErrorRate as the leading category, reproduce the failing paths on core-grafana-80, accommodations-api, uni-api, web-80, compare against the latest deploy or config change, and do not close it until both error rate and latency flatten.

    Executive recommendation
Global Evidence Explorer

Global Evidence Explorer

Report-wide charts and tables stay here, separate from the active investigation scope.

Application + Infrastructure Alerts by Day

Pingdom latency + downtime by source

Customer View

Pingdom Checks

Pingdom CheckStatusEventsDowntimeLast SeenLikely ServicesCorrelated Evidence
Adservio RoRecovered recently523m2026-06-05 18:34unclassified
https://www.adservio.ro/api/v2/statusNo recent customer-visible issue00munclassified

Pingdom rows show externally visible signal first. The correlated evidence column helps tie the failing check back to services, Slack alert families, or AWS alarms when those links exist.

Application View

Slack Impacted Service / Resource View

This view attributes alerts to the workload or resource named in the alert text. Grafana, Loki, and Tempo are treated as observability components and are excluded when a more specific impacted target is also present.

Impacted Service / ResourceHighest SeverityCountLast SeenStatusTop Alert TypesDiscussion SignalLatest Thread Note
uni-apiCritical22026-06-05 09:40Recent (72h)TraefikServiceHighErrorRate (1)TraefikServiceHighLatency (1)General investigation
@U09JYAWCGLB | uni-api on TUIASI is returning HTTP 500 from StudentServiceImpl.getCatalogCloseStatus and updateCatalogCloseStatus because the JPA query fo…
web-80Critical82026-06-03 23:38Seen this weekTraefikServiceHighErrorRate (1)TraefikServiceHighLatency (7)NoneNo thread note
core-grafana-80Critical32026-06-03 18:04Seen this weekTraefikServiceHighErrorRate (3)NoneNo thread note
accommodations-apiCritical22026-06-04 14:40Seen this weekTraefikServiceHighErrorRate (2)NoneNo thread note
metrics-serverWarning272026-06-08 00:20Seen todayKubeAggregatedAPIDown (27)NoneNo thread note
admission-apiWarning62026-06-07 21:40Seen todayTraefikServiceHighLatency (6)NoneNo thread note
ai-apiWarning112026-06-05 17:46Recent (72h)TraefikServiceHighLatency (11)NoneNo thread note
grafanaWarning242026-06-04 20:50Seen this weekKubeMemoryOvercommit (7)KubeNodeEviction (4)NodeSystemSaturation (3)KubeHpaMaxedOut (2)AlertmanagerFailedToSendAlerts (1)General investigation
Brief node-exporter target gap caused by Cluster Autoscaler scaling down worker node 10.66.126.144 the node was cordoned, drained, and remo…
social-apiWarning82026-06-02 14:16Seen this weekTraefikServiceHighLatency (8)NoneNo thread note
docgen2-apiWarning62026-06-02 16:17Seen this weekKubeHpaMaxedOut (6)NoneNo thread note
subscriptions-apiWarning22026-06-02 12:56Seen this weekTraefikServiceHighLatency (2)NoneNo thread note
accommodations-sync-usersWarning12026-06-03 14:10Seen this weekKubeJobFailed (1)NoneNo thread note
colecteaza-sms-note-absWarning12026-06-03 14:10Seen this weekKubeJobFailed (1)NoneNo thread note
minicrm-syncWarning12026-06-03 14:10Seen this weekKubeJobFailed (1)NoneNo thread note
notifications-event-managerWarning12026-06-02 12:34Seen this weekKubeHpaMaxedOut (1)NoneNo thread note
Evidence

Slack Alert Families

AlertSeverityCountLast SeenStatusThreadsTop Impacted ServicesDiscussion SignalLatest Thread Note
TraefikServiceHighErrorRateCritical72026-06-05 09:40Recent (72h)1core-grafana-80 (3)accommodations-api (2)web-80 (1)uni-api (1)General investigation
@U09JYAWCGLB | uni-api on TUIASI is returning HTTP 500 from StudentServiceImpl.getCatalogCloseStatus and updateCatalogCloseStatus because the JPA query fo…
KubeAggregatedAPIDownWarning272026-06-08 00:20Seen today0metrics-server (27)None
TraefikServiceHighLatencyWarning212026-06-07 21:40Seen today0ai-api (11)social-api (8)web-80 (7)admission-api (6)subscriptions-api (2)None
KubeHpaMaxedOutWarning92026-06-03 16:34Seen this week0docgen2-api (6)grafana (2)notifications-event-manager (1)None
KubeMemoryOvercommitWarning72026-06-04 19:08Seen this week0grafana (7)None
KubeNodeEvictionWarning42026-06-04 20:50Seen this week0grafana (4)None
NodeSystemSaturationWarning32026-06-02 12:28Seen this week0grafana (3)None
KubeJobFailedWarning12026-06-03 14:10Seen this week0accommodations-sync-users (1)colecteaza-sms-note-abs (1)minicrm-sync (1)None
AlertmanagerFailedToSendAlertsWarning12026-06-03 14:10Seen this week0grafana (1)None
TargetDownWarning12026-06-01 11:34Seen this week1grafana (1)General investigation
Brief node-exporter target gap caused by Cluster Autoscaler scaling down worker node 10.66.126.144 the node was cordoned, drained, and remo…
KubeletServerCertificateExpirationWarning62026-05-31 11:22No recent signal1grafana (6)General investigation
Cert auto-rotated over the weekend

Status is heuristic. Slack rarely posts explicit resolutions, so “Seen today” or “Recent” means the alert family still appeared in production recently, not that it is definitely unresolved.

AWS Email Alarm Families

AWS AlarmEmailsALARMOKState FlipsFirst SeenLast SeenLatest StateStatus
adservio-rds-postgres-billing-cpu-high63352026-06-01 10:112026-06-03 11:53OKLatest OK
adservio-rds-mysql-master-disk-queue-high21112026-06-06 04:222026-06-06 04:23OKLatest OK
adservio-rds-postgres-billing-disk-queue-high21112026-06-01 10:092026-06-01 10:13OKLatest OK

“Flapping, latest OK” means the most recent email was an OK, but the alarm toggled repeatedly and is still a reliability concern.

Global Discussion-Derived Signal

Thread DateAlertSeverityServicesSignalKey Notes
2026-06-05 09:40TraefikServiceHighErrorRateCriticaluni-apiGeneral investigation
@U09JYAWCGLB | uni-api on TUIASI is returning HTTP 500 from StudentServiceImpl.getCatalogCloseStatus and updateCatalogCloseStatus because the JPA query fo…
2026-06-01 11:34TargetDownWarninggrafanaGeneral investigation
Brief node-exporter target gap caused by Cluster Autoscaler scaling down worker node 10.66.126.144 the node was cordoned, drained, and remo…
2026-05-31 11:22KubeletServerCertificateExpirationWarninggrafanaGeneral investigation
Cert auto-rotated over the weekend