Ops Briefing Surface

Production Reliability Dashboard

Generated 2026-05-05 01:00 for 2026-04-25 07:00 to 2026-05-05 12:00 from Pingdom checks, Slack #_alerts_prod, and AWS SNS alerts.

All sources Pingdom customer checks Slack alert families AWS alarm emails
Email-confirmed customer incidents0Pingdom down/slow events confirmed by inbox alertsObserved in Pingdom: 9
Impacted services17Mapped from Slack and Pingdom evidence
AWS alarms in ALARM0Still alarming at window end
Latest observed signal2026-05-05 00:00Most recent cross-source activity
Executive summary

What needs attention

Bottom line: Pingdom observed recent customer-facing glitches (email unconfirmed) and application-level critical paths are present.

Pingdom customer impact

External signal
No criticalActive: 1Total seen: 1

1 active item(s) in this window.

Slack impacted services

Application signal
5 criticalActive: 14Total seen: 15

5 critical, 9 non-critical active item(s).

AWS alarms

Infrastructure signal
No criticalActive: 1Total seen: 2

1 active item(s) in this window.

What to do next

  1. NextUse Pingdom check https://www

    Use Pingdom check https://www.adservio.ro/api/v2/status as churn/degradation evidence, and escalate to customer-facing incident priority only after inbox confirmation or strong cross-source correlation.

    Executive recommendation
  2. ThenTreat critical services accommodations-api, web-80, uni-api, core-grafana-80 as the primary application investigation set

    Treat critical services accommodations-api, web-80, uni-api, core-grafana-80 as the primary application investigation set. Use TraefikServiceHighErrorRate as the leading category, reproduce the failing paths on accommodations-api, web-80, uni-api, core-grafana-80, compare against the latest deploy or config change, and do not close it until both error rate and latency flatten.

    Executive recommendation
Global Evidence Explorer

Global Evidence Explorer

Report-wide charts and tables stay here, separate from the active investigation scope.

Application + Infrastructure Alerts by Day

Pingdom latency + downtime by source

Customer View

Pingdom Checks

Pingdom CheckStatusEventsDowntimeLast SeenLikely ServicesCorrelated Evidence
https://www.adservio.ro/api/v2/statusRecovered recently99m2026-05-05 00:00adservio-ro-api-v2-status
Adservio RoNo recent customer-visible issue00m2026-05-05 00:00adservio-ro

Pingdom rows show externally visible signal first. The correlated evidence column helps tie the failing check back to services, Slack alert families, or AWS alarms when those links exist.

Application View

Slack Impacted Service / Resource View

This view attributes alerts to the workload or resource named in the alert text. Grafana, Loki, and Tempo are treated as observability components and are excluded when a more specific impacted target is also present.

Impacted Service / ResourceHighest SeverityCountLast SeenStatusTop Alert TypesDiscussion SignalLatest Thread Note
accommodations-apiCritical102026-05-04 15:06Seen todayTraefikServiceHighErrorRate (10)NoneNo thread note
uni-apiCritical72026-05-04 12:07Seen todayTraefikServiceHighErrorRate (7)General investigationObservability storage
uni-api is throwing 500s because the same academic-structure save is being submitted twice second insert hits a unique-key constraint (stud…
web-80Critical102026-05-01 10:45Seen this weekTraefikServiceHighErrorRate (8)TraefikServiceHighLatency (2)NoneNo thread note
grafanaCritical272026-05-04 13:12Recent but likely noiseKubeAPIErrorBudgetBurn (4)KubeCPUOvercommit (20)NodeSystemSaturation (1)KubeMemoryOvercommit (1)KubeClientErrors (1)General investigationAlert tuning / noiseResource limits
Node 10.66.121.122 is at 79% CPU / 50% memory, no pressure conditions load is 2.27 because 5 of the web pods landed on the same node plus s… | Will clear when traffic dies down or HPA rebalances | Stabilized load on 10.66.121.122 dropped from 2.27 to 1.68 per core (15m avg), CPU from 79% to 55%. HPA scaled down + web…
core-grafana-80Critical12026-04-25 20:57No recent signalTraefikServiceHighErrorRate (1)NoneNo thread note
metrics-serverWarning432026-05-04 18:19Seen todayKubeAggregatedAPIDown (43)General investigation
False alarm metrics-server pod is healthy (1/1 Running, 0 restarts) and kubectl top nodes works. Alert was a brief flap during a node rotat…
minicrm-syncWarning232026-05-04 19:21Seen todayKubeJobFailed (23)NoneNo thread note
colecteaza-sms-note-absWarning212026-05-04 19:21Seen todayKubeJobFailed (21)NoneNo thread note
accommodations-sync-users
Grouped 2 variantsVariant mentions 19Active variants 2
Warning192026-05-04 19:21Seen todayKubeJobFailed (19)NoneNo thread note
social-apiWarning102026-05-04 12:39Seen todayTraefikServiceHighLatency (10)General investigation
Two outbound calls to web dominated: GET /api/v2/utilizatori/profile?cuPermisiuni=1&withRoles=1 (6.6s) and GET /api/v2/ani/scolari (6.4s) —…
docgen2-apiWarning52026-05-04 12:17Seen todayKubeHpaMaxedOut (5)Resource limits
HPA already scaled back down (now 4/6). Pods are barely using CPU, so the scale-up was driven by a KEDA cron trigger, not real load.
rooms-apiWarning262026-05-03 07:14Recent (72h)KubePodCrashLooping (13)KubeDeploymentReplicasMismatch (13)General investigation
danny Valentin Pal something is breaking on the Tuiasi cluster which i am not able to connect | | i would need a kubeconfig for tuiasi and…
ai-apiWarning72026-05-04 10:00Recent (72h)TraefikServiceHighLatency (7)NoneNo thread note
core-scheduled-events-workerWarning52026-05-01 10:19Seen this weekKubePodCrashLooping (3)KubeDeploymentReplicasMismatch (2)General investigation
danny Valentin Pal something is breaking on the Tuiasi cluster which i am not able to connect | | i would need a kubeconfig for tuiasi and…
subscriptions-apiWarning22026-04-28 09:11No recent signalTraefikServiceHighLatency (2)NoneNo thread note
Evidence

Slack Alert Families

AlertSeverityCountLast SeenStatusThreadsTop Impacted ServicesDiscussion SignalLatest Thread Note
TraefikServiceHighErrorRateCritical262026-05-04 15:06Seen today2accommodations-api (10)web-80 (8)uni-api (7)core-grafana-80 (1)General investigationObservability storage
uni-api is throwing 500s because the same academic-structure save is being submitted twice second insert hits a unique-key constraint (stud…
KubeAPIErrorBudgetBurnCritical32026-04-27 12:48Likely noise / resolved2grafana (4)General investigationAlert tuning / noise
Nothing is down apiserver healthy (/readyz ok), all pods Running, HPAs scaled back down (web 9, docgen2 4). Alert is the rolling 1h SLO win… | you can silence it in alertmanger | Okay i will try a bit later thanks for the info
KubeAggregatedAPIDownWarning432026-05-04 18:19Seen today1metrics-server (43)General investigation
False alarm metrics-server pod is healthy (1/1 Running, 0 restarts) and kubectl top nodes works. Alert was a brief flap during a node rotat…
KubeJobFailedWarning232026-05-04 19:21Seen today0minicrm-sync (23)colecteaza-sms-note-abs (21)accommodations-sync-users (19)None
KubeCPUOvercommitWarning202026-05-04 13:12Seen today0grafana (20)None
TraefikServiceHighLatencyWarning182026-05-04 12:39Seen today1social-api (10)ai-api (7)subscriptions-api (2)web-80 (2)General investigation
Two outbound calls to web dominated: GET /api/v2/utilizatori/profile?cuPermisiuni=1&withRoles=1 (6.6s) and GET /api/v2/ani/scolari (6.4s) —…
KubeHpaMaxedOutWarning52026-05-04 12:17Seen today1docgen2-api (5)Resource limits
HPA already scaled back down (now 4/6). Pods are barely using CPU, so the scale-up was driven by a KEDA cron trigger, not real load.
KubePodCrashLoopingWarning142026-05-03 07:14Recent (72h)1rooms-api (13)core-scheduled-events-worker (3)General investigation
danny Valentin Pal something is breaking on the Tuiasi cluster which i am not able to connect | | i would need a kubeconfig for tuiasi and…
KubeDeploymentReplicasMismatchWarning132026-05-03 07:09Recent (72h)0rooms-api (13)core-scheduled-events-worker (2)None
NodeSystemSaturationWarning12026-04-30 12:30Seen this week1grafana (1)Resource limits
Node 10.66.121.122 is at 79% CPU / 50% memory, no pressure conditions load is 2.27 because 5 of the web pods landed on the same node plus s… | Will clear when traffic dies down or HPA rebalances | Stabilized load on 10.66.121.122 dropped from 2.27 to 1.68 per core (15m avg), CPU from 79% to 55%. HPA scaled down + web…
KubeMemoryOvercommitWarning12026-04-29 09:26Seen this week0grafana (1)None
KubeAPIErrorBudgetBurnWarning12026-04-27 15:37Likely noise / resolved2grafana (4)General investigationAlert tuning / noise
Nothing is down apiserver healthy (/readyz ok), all pods Running, HPAs scaled back down (web 9, docgen2 4). Alert is the rolling 1h SLO win… | you can silence it in alertmanger | Okay i will try a bit later thanks for the info
KubeClientErrorsWarning12026-04-27 12:09No recent signal1grafana (1)General investigation
aiserver is healthy (/readyz ok). The 2% error rate is from admission webhook timeouts during the HPA scaling burst (web 11 replicas, docge…

Status is heuristic. Slack rarely posts explicit resolutions, so “Seen today” or “Recent” means the alert family still appeared in production recently, not that it is definitely unresolved.

AWS Email Alarm Families

AWS AlarmEmailsALARMOKState FlipsFirst SeenLast SeenLatest StateStatus
adservio-rds-mysql-catalog-disk-queue-high1477132026-04-27 09:132026-04-30 10:21OKFlapping, latest OK
adservio-rds-postgres-billing-cpu-high21112026-05-01 10:122026-05-01 10:13OKLatest OK

“Flapping, latest OK” means the most recent email was an OK, but the alarm toggled repeatedly and is still a reliability concern.

Global Discussion-Derived Signal

Thread DateAlertSeverityServicesSignalKey Notes
2026-05-04 12:39TraefikServiceHighLatencyWarningsocial-apiGeneral investigation
Two outbound calls to web dominated: GET /api/v2/utilizatori/profile?cuPermisiuni=1&withRoles=1 (6.6s) and GET /api/v2/ani/scolari (6.4s) —…
2026-05-04 11:28TraefikServiceHighErrorRateCriticaluni-apiObservability storage
uni-api is throwing 500s because the same academic-structure save is being submitted twice second insert hits a unique-key constraint (stud…
2026-05-01 06:14KubePodCrashLoopingWarningcore-scheduled-events-worker, rooms-apiGeneral investigation
danny Valentin Pal something is breaking on the Tuiasi cluster which i am not able to connect | | i would need a kubeconfig for tuiasi and…
2026-04-30 12:30NodeSystemSaturationWarninggrafanaResource limits
Node 10.66.121.122 is at 79% CPU / 50% memory, no pressure conditions load is 2.27 because 5 of the web pods landed on the same node plus s… | Will clear when traffic dies down or HPA rebalances | Stabilized load on 10.66.121.122 dropped from 2.27 to 1.68 per core (15m avg), CPU from 79% to 55%. HPA scaled down + web…
2026-04-30 11:11TraefikServiceHighErrorRateCriticaluni-apiGeneral investigation
No infra issue uni-api pod is healthy (1/1 Running, 0 restarts). The 6.67% errors are MySQL duplicate-key exceptions in GroupServiceImpl.sa… | Andrei Alexandru - can you please check? | | JDBC exception executing SQL [INSERT INTO studiu(...) VALUES(...,1511,1519)]
2026-04-27 12:48KubeAPIErrorBudgetBurnCriticalgrafanaAlert tuning / noise
Nothing is down apiserver healthy (/readyz ok), all pods Running, HPAs scaled back down (web 9, docgen2 4). Alert is the rolling 1h SLO win… | you can silence it in alertmanger | Okay i will try a bit later thanks for the info
2026-04-27 12:09KubeClientErrorsWarninggrafanaGeneral investigation
aiserver is healthy (/readyz ok). The 2% error rate is from admission webhook timeouts during the HPA scaling burst (web 11 replicas, docge…
2026-04-27 11:56KubeHpaMaxedOutWarningdocgen2-apiResource limits
HPA already scaled back down (now 4/6). Pods are barely using CPU, so the scale-up was driven by a KEDA cron trigger, not real load.
2026-04-27 11:53KubeAPIErrorBudgetBurnCriticalgrafanaGeneral investigation
apiserver is healthy now. The alert was triggered by a brief spike in pod status updates ~15 min ago when several web pods went NotReady at…
2026-04-27 09:59KubeAggregatedAPIDownWarningmetrics-serverGeneral investigation
False alarm metrics-server pod is healthy (1/1 Running, 0 restarts) and kubectl top nodes works. Alert was a brief flap during a node rotat…