Ops Briefing Surface

Production Reliability Dashboard

Generated 2026-05-09 15:35 for 2026-05-02 07:00 to 2026-05-09 10:00 from Pingdom checks, Slack #_alerts_prod, and AWS SNS alerts.

All sources Pingdom customer checks Slack alert families AWS alarm emails
Email-confirmed customer incidents0Pingdom down/slow events confirmed by inbox alertsObserved in Pingdom: 5
Impacted services13Mapped from Slack and Pingdom evidence
AWS alarms in ALARM0Still alarming at window end
Latest observed signal2026-05-09 00:00Most recent cross-source activity
Executive summary

What needs attention

Bottom line: Pingdom observed recent customer-facing glitches (email unconfirmed) and application-level critical paths are present.

Pingdom customer impact

External signal
No criticalActive: 1Total seen: 1

1 active item(s) in this window.

Slack impacted services

Application signal
2 criticalActive: 11Total seen: 11

2 critical, 9 non-critical active item(s).

AWS alarms

Infrastructure signal
No criticalActive: 3Total seen: 3

3 active item(s) in this window.

What to do next

  1. NextUse Pingdom check https://www

    Use Pingdom check https://www.adservio.ro/api/v2/status as churn/degradation evidence, and escalate to customer-facing incident priority only after inbox confirmation or strong cross-source correlation.

    Executive recommendation
  2. ThenTreat critical services accommodations-api, uni-api as the primary application investigation set

    Treat critical services accommodations-api, uni-api as the primary application investigation set. Use TraefikServiceHighErrorRate as the leading category, reproduce the failing paths on accommodations-api, uni-api, compare against the latest deploy or config change, and do not close it until both error rate and latency flatten.

    Executive recommendation
Global Evidence Explorer

Global Evidence Explorer

Report-wide charts and tables stay here, separate from the active investigation scope.

Application + Infrastructure Alerts by Day

Pingdom latency + downtime by source

Customer View

Pingdom Checks

Pingdom CheckStatusEventsDowntimeLast SeenLikely ServicesCorrelated Evidence
https://www.adservio.ro/api/v2/statusRecovered recently55m2026-05-09 00:00adservio-ro-api-v2-status
Adservio RoNo recent customer-visible issue00m2026-05-09 00:00adservio-ro

Pingdom rows show externally visible signal first. The correlated evidence column helps tie the failing check back to services, Slack alert families, or AWS alarms when those links exist.

Application View

Slack Impacted Service / Resource View

This view attributes alerts to the workload or resource named in the alert text. Grafana, Loki, and Tempo are treated as observability components and are excluded when a more specific impacted target is also present.

Impacted Service / ResourceHighest SeverityCountLast SeenStatusTop Alert TypesDiscussion SignalLatest Thread Note
accommodations-apiCritical82026-05-04 15:06Seen this weekTraefikServiceHighErrorRate (8)NoneNo thread note
uni-apiCritical42026-05-04 12:07Seen this weekTraefikServiceHighErrorRate (4)Observability storage
uni-api is throwing 500s because the same academic-structure save is being submitted twice second insert hits a unique-key constraint (stud…
accommodations-sync-users
Grouped 2 variantsVariant mentions 15Active variants 2
Warning152026-05-04 19:21Seen this weekKubeJobFailed (15)NoneNo thread note
colecteaza-sms-note-absWarning152026-05-04 19:21Seen this weekKubeJobFailed (15)NoneNo thread note
minicrm-syncWarning152026-05-04 19:21Seen this weekKubeJobFailed (15)NoneNo thread note
metrics-serverWarning152026-05-04 18:19Seen this weekKubeAggregatedAPIDown (15)NoneNo thread note
rooms-apiWarning122026-05-03 07:14Seen this weekKubePodCrashLooping (6)KubeDeploymentReplicasMismatch (6)NoneNo thread note
social-apiWarning62026-05-04 12:39Seen this weekTraefikServiceHighLatency (6)General investigation
Two outbound calls to web dominated: GET /api/v2/utilizatori/profile?cuPermisiuni=1&withRoles=1 (6.6s) and GET /api/v2/ani/scolari (6.4s) —…
docgen2-apiWarning22026-05-04 12:17Seen this weekKubeHpaMaxedOut (2)NoneNo thread note
grafanaWarning12026-05-04 13:12Seen this weekKubeCPUOvercommit (1)NoneNo thread note
ai-apiWarning12026-05-04 10:00Seen this weekTraefikServiceHighLatency (1)NoneNo thread note
Evidence

Slack Alert Families

AlertSeverityCountLast SeenStatusThreadsTop Impacted ServicesDiscussion SignalLatest Thread Note
TraefikServiceHighErrorRateCritical122026-05-04 15:06Seen this week1accommodations-api (8)uni-api (4)Observability storage
uni-api is throwing 500s because the same academic-structure save is being submitted twice second insert hits a unique-key constraint (stud…
KubeJobFailedWarning152026-05-04 19:21Seen this week0accommodations-sync-users (15)colecteaza-sms-note-abs (15)minicrm-sync (15)None
KubeAggregatedAPIDownWarning152026-05-04 18:19Seen this week0metrics-server (15)None
TraefikServiceHighLatencyWarning72026-05-04 12:39Seen this week1social-api (6)ai-api (1)General investigation
Two outbound calls to web dominated: GET /api/v2/utilizatori/profile?cuPermisiuni=1&withRoles=1 (6.6s) and GET /api/v2/ani/scolari (6.4s) —…
KubePodCrashLoopingWarning62026-05-03 07:14Seen this week0rooms-api (6)None
KubeDeploymentReplicasMismatchWarning62026-05-03 07:09Seen this week0rooms-api (6)None
KubeHpaMaxedOutWarning22026-05-04 12:17Seen this week0docgen2-api (2)None
KubeCPUOvercommitWarning12026-05-04 13:12Seen this week0grafana (1)None

Status is heuristic. Slack rarely posts explicit resolutions, so “Seen today” or “Recent” means the alert family still appeared in production recently, not that it is definitely unresolved.

AWS Email Alarm Families

AWS AlarmEmailsALARMOKState FlipsFirst SeenLast SeenLatest StateStatus
adservio-rds-mysql-catalog-disk-queue-high21112026-05-06 12:202026-05-06 12:30OKLatest OK
adservio-rds-mysql-catalog-write-latency-high21112026-05-06 12:232026-05-06 12:26OKLatest OK
adservio-rds-mysql-catalog-storage-low21112026-05-06 12:252026-05-06 12:26OKLatest OK

“Flapping, latest OK” means the most recent email was an OK, but the alarm toggled repeatedly and is still a reliability concern.

Global Discussion-Derived Signal

Thread DateAlertSeverityServicesSignalKey Notes
2026-05-04 12:39TraefikServiceHighLatencyWarningsocial-apiGeneral investigation
Two outbound calls to web dominated: GET /api/v2/utilizatori/profile?cuPermisiuni=1&withRoles=1 (6.6s) and GET /api/v2/ani/scolari (6.4s) —…
2026-05-04 11:28TraefikServiceHighErrorRateCriticaluni-apiObservability storage
uni-api is throwing 500s because the same academic-structure save is being submitted twice second insert hits a unique-key constraint (stud…