Ops Briefing Surface

Production Reliability Dashboard

Generated 2026-06-15 02:04 for 2026-06-06 07:00 to 2026-06-13 07:00 from Pingdom checks, Slack #_alerts_prod, and AWS SNS alerts.

All sources Pingdom customer checks Slack alert families AWS alarm emails
Email-confirmed customer incidents0Pingdom down/slow events confirmed by inbox alertsObserved in Pingdom: 0
Impacted services13Mapped from Slack and Pingdom evidence
AWS alarms in ALARM1Still alarming at window end
Latest observed signal2026-06-12 23:34Most recent cross-source activity
Run notes

Data source availability

Some enrichments were unavailable in this run; drilldowns below stay focused on captured evidence.

Executive summary

What needs attention

Bottom line: application-level critical paths are present and catalog database alarms still look active.

Pingdom customer impact

External signal
No criticalActive: 0Total seen: 0

No Pingdom checks were available in this window.

No active issue listed in this category.

Slack impacted services

Application signal
4 criticalActive: 13Total seen: 13

4 critical, 9 non-critical active item(s).

AWS alarms

Infrastructure signal
1 criticalActive: 1Total seen: 2

1 critical, 0 non-critical active item(s).

What to do next

  1. NextTreat critical services grafana, docgen2-api, accommodations-api, social-api as the primary application investigation set

    Treat critical services grafana, docgen2-api, accommodations-api, social-api as the primary application investigation set. Use PlatformLatencyP95Critical1s as the leading category, reproduce the failing paths on grafana, docgen2-api, accommodations-api, social-api, compare against the latest deploy or config change, and do not close it until both error rate and latency flatten.

    Executive recommendation
  2. ThenReduce the operational drag on ai-api by separating repeated symptom alerts from the underlying workload failure

    Reduce the operational drag on ai-api by separating repeated symptom alerts from the underlying workload failure. Either eliminate the recurrent fault or retune the alert once the failure mode is understood.

    Executive recommendation
Global Evidence Explorer

Global Evidence Explorer

Report-wide charts and tables stay here, separate from the active investigation scope.

Application + Infrastructure Alerts by Day

Pingdom latency + downtime by source

Customer View

Pingdom Checks

Pingdom CheckStatusEventsDowntimeLast SeenLikely ServicesCorrelated Evidence

Pingdom rows show externally visible signal first. The correlated evidence column helps tie the failing check back to services, Slack alert families, or AWS alarms when those links exist.

Application View

Slack Impacted Service / Resource View

This view attributes alerts to the workload or resource named in the alert text. Grafana, Loki, and Tempo are treated as observability components and are excluded when a more specific impacted target is also present.

Impacted Service / ResourceHighest SeverityCountLast SeenStatusTop Alert TypesDiscussion SignalLatest Thread Note
social-apiCritical22026-06-12 09:19Seen todayTraefikServiceHighErrorRate (1)TraefikServiceHighLatency (1)NoneNo thread note
docgen2-apiCritical12026-06-12 10:21Seen todayTraefikServiceHighErrorRate (1)NoneNo thread note
accommodations-apiCritical12026-06-09 08:15Seen this weekTraefikServiceHighErrorRate (1)Observability storage
*Alert:* `TraefikServiceHighErrorRate` (critical) | *When:* 2026-06-09 08:15 Europe/Bucharest | | *Service:* `accommodations-api` | *TL;DR:…
grafanaCritical212026-06-12 23:34Recent but likely noisePlatformLatencyP95Critical1s (2)PagerDutyDeploySmoketest (1)PDDebugTest (1)KubeContainerMemoryHigh (7)NodeSystemSaturation (3)Release / migration issueAlert tuning / noise
*Alert:* `PlatformLatencyP95Critical1s` (critical) | *When:* 2026-06-12 09:22 Europe/Bucharest | | *Service:* `grafana` | *TL;DR:* `grafana…
ai-apiWarning132026-06-12 19:17Seen todayTraefikServiceHighLatency (13)NoneNo thread note
notifications-event-managerWarning22026-06-12 10:55Seen todayKubeHpaMaxedOut (2)NoneNo thread note
webWarning22026-06-12 10:16Seen todayKubeDeploymentReplicasMismatch (2)NoneNo thread note
rooms-apiWarning12026-06-12 09:19Seen todayTraefikServiceHighLatency (1)NoneNo thread note
subscriptions-apiWarning12026-06-12 09:19Seen todayTraefikServiceHighLatency (1)NoneNo thread note
uni-apiWarning12026-06-12 09:19Seen todayTraefikServiceHighLatency (1)NoneNo thread note
web-80Warning12026-06-12 09:19Seen todayTraefikServiceHighLatency (1)NoneNo thread note
admission-apiWarning22026-06-10 11:36Recent (72h)TraefikServiceHighLatency (2)NoneNo thread note
metrics-serverWarning122026-06-08 04:25Seen this weekKubeAggregatedAPIDown (12)NoneNo thread note
Evidence

Slack Alert Families

AlertSeverityCountLast SeenStatusThreadsTop Impacted ServicesDiscussion SignalLatest Thread Note
TraefikServiceHighErrorRateCritical32026-06-12 10:21Seen today1social-api (1)accommodations-api (1)docgen2-api (1)Observability storage
*Alert:* `TraefikServiceHighErrorRate` (critical) | *When:* 2026-06-09 08:15 Europe/Bucharest | | *Service:* `accommodations-api` | *TL;DR:…
PlatformLatencyP95Critical1sCritical22026-06-12 10:27Seen today1grafana (2)Release / migration issue
*Alert:* `PlatformLatencyP95Critical1s` (critical) | *When:* 2026-06-12 09:22 Europe/Bucharest | | *Service:* `grafana` | *TL;DR:* `grafana…
PagerDutyDeploySmoketestCritical12026-06-09 08:56Seen this week1grafana (1)Release / migration issue
*Alert:* `PagerDutyDeploySmoketest` (critical) | *When:* 2026-06-09 08:56 Europe/Bucharest | | *Service:* `grafana` | *TL;DR:* `grafana` is…
PDDebugTestCritical12026-06-09 09:27Likely noise / resolved1grafana (1)Alert tuning / noise
*Alert:* `PDDebugTest` (critical) | *When:* 2026-06-09 09:27 Europe/Bucharest | | *Service:* `grafana` | *TL;DR:* `grafana` is the alert ho…
TraefikServiceHighLatencyWarning152026-06-12 19:17Seen today0ai-api (13)admission-api (2)rooms-api (1)social-api (1)subscriptions-api (1)None
KubeContainerMemoryHighWarning72026-06-12 15:39Seen today0grafana (7)None
NodeSystemSaturationWarning32026-06-12 23:18Seen today0grafana (3)None
KubeHpaMaxedOutWarning22026-06-12 10:55Seen today0notifications-event-manager (2)None
KubeDeploymentReplicasMismatchWarning22026-06-12 10:16Seen today0web (2)None
TargetDownWarning22026-06-12 10:15Seen today0grafana (2)None
NodeDiskIOSaturationWarning12026-06-12 23:34Seen today0grafana (1)None
PlatformLatencyP99Warning5sWarning12026-06-12 09:28Seen today0grafana (1)None
PlatformLatencyP95Warning400msWarning12026-06-12 09:22Seen today0grafana (1)None
SlackHealthCheck20260610Warning12026-06-10 10:53Recent (72h)0grafana (1)None
KubeAggregatedAPIDownWarning122026-06-08 04:25Seen this week0metrics-server (12)None
SlackPipelineTestWarning12026-06-09 13:06Seen this week0grafana (1)None

Status is heuristic. Slack rarely posts explicit resolutions, so “Seen today” or “Recent” means the alert family still appeared in production recently, not that it is definitely unresolved.

AWS Email Alarm Families

AWS AlarmEmailsALARMOKState FlipsFirst SeenLast SeenLatest StateStatus
adservio-rds-mysql-catalog-memory-low11002026-06-12 08:212026-06-12 08:21ALARMStill alarming
adservio-root-account-usage63352026-06-09 10:362026-06-09 14:51OKLatest OK

“Flapping, latest OK” means the most recent email was an OK, but the alarm toggled repeatedly and is still a reliability concern.

Global Discussion-Derived Signal

Thread DateAlertSeverityServicesSignalKey Notes
2026-06-12 09:22PlatformLatencyP95Critical1sCriticalgrafanaRelease / migration issue
*Alert:* `PlatformLatencyP95Critical1s` (critical) | *When:* 2026-06-12 09:22 Europe/Bucharest | | *Service:* `grafana` | *TL;DR:* `grafana…
2026-06-09 09:27PDDebugTestCriticalgrafanaAlert tuning / noise
*Alert:* `PDDebugTest` (critical) | *When:* 2026-06-09 09:27 Europe/Bucharest | | *Service:* `grafana` | *TL;DR:* `grafana` is the alert ho…
2026-06-09 08:56PagerDutyDeploySmoketestCriticalgrafanaRelease / migration issue
*Alert:* `PagerDutyDeploySmoketest` (critical) | *When:* 2026-06-09 08:56 Europe/Bucharest | | *Service:* `grafana` | *TL;DR:* `grafana` is…
2026-06-09 08:15TraefikServiceHighErrorRateCriticalaccommodations-apiObservability storage
*Alert:* `TraefikServiceHighErrorRate` (critical) | *When:* 2026-06-09 08:15 Europe/Bucharest | | *Service:* `accommodations-api` | *TL;DR:…