Ops Briefing Surface

Production Reliability Dashboard

Generated 2026-05-18 15:02 for 2026-05-09 07:00 to 2026-05-16 07:00 from Pingdom checks, Slack #_alerts_prod, and AWS SNS alerts.

All sources Pingdom customer checks Slack alert families AWS alarm emails
Email-confirmed customer incidents0Pingdom down/slow events confirmed by inbox alertsObserved in Pingdom: 15
Impacted services10Mapped from Slack and Pingdom evidence
AWS alarms in ALARM0Still alarming at window end
Latest observed signal2026-05-16 05:19Most recent cross-source activity
Executive summary

What needs attention

Bottom line: Pingdom observed recent customer-facing glitches (email unconfirmed) and application-level critical paths are present.

Pingdom customer impact

External signal
No criticalActive: 1Total seen: 1

1 active item(s) in this window.

Slack impacted services

Application signal
1 criticalActive: 8Total seen: 8

1 critical, 7 non-critical active item(s).

AWS alarms

Infrastructure signal
No criticalActive: 0Total seen: 0

No AWS alarm emails were captured in this window.

No active issue listed in this category.

What to do next

  1. NextUse Pingdom check https://www

    Use Pingdom check https://www.adservio.ro/api/v2/status as churn/degradation evidence, and escalate to customer-facing incident priority only after inbox confirmation or strong cross-source correlation.

    Executive recommendation
  2. ThenTreat critical services uni-api as the primary application investigation set

    Treat critical services uni-api as the primary application investigation set. Use TraefikServiceHighErrorRate as the leading category, reproduce the failing paths on uni-api, compare against the latest deploy or config change, and do not close it until both error rate and latency flatten.

    Executive recommendation
Global Evidence Explorer

Global Evidence Explorer

Report-wide charts and tables stay here, separate from the active investigation scope.

Application + Infrastructure Alerts by Day

Pingdom latency + downtime by source

Customer View

Pingdom Checks

Pingdom CheckStatusEventsDowntimeLast SeenLikely ServicesCorrelated Evidence
https://www.adservio.ro/api/v2/statusRecovered recently1515m2026-05-15 11:49adservio-ro-api-v2-status
Adservio RoNo recent customer-visible issue00m2026-05-16 00:00adservio-ro

Pingdom rows show externally visible signal first. The correlated evidence column helps tie the failing check back to services, Slack alert families, or AWS alarms when those links exist.

Application View

Slack Impacted Service / Resource View

This view attributes alerts to the workload or resource named in the alert text. Grafana, Loki, and Tempo are treated as observability components and are excluded when a more specific impacted target is also present.

Impacted Service / ResourceHighest SeverityCountLast SeenStatusTop Alert TypesDiscussion SignalLatest Thread Note
uni-apiCritical32026-05-15 11:17Seen todayTraefikServiceHighErrorRate (1)TraefikServiceHighLatency (2)Release / migration issueGeneral investigation
no errors now pod is healthy | | we saw this error | | Service method CareerServiceImpl.saveCareer failed after Nms: | Invalid employment t…
grafanaWarning382026-05-15 15:02Seen todayNodeHighNumberConntrackEntriesUsed (23)NodeSystemSaturation (6)KubeCPUOvercommit (5)NodeCPUHighUsage (3)AlertmanagerFailedToSendAlerts (1)Resource limitsGeneral investigation
This has been resolved
audit-workerWarning192026-05-16 05:19Seen todayKubeDeploymentReplicasMismatch (14)KubePodCrashLooping (5)NoneNo thread note
ai-apiWarning112026-05-15 19:19Seen todayTraefikServiceHighLatency (11)Release / migration issue
Uneven load distribution on the nodes . some are running at 100 percent some at around 10 to 15 | Web deployment is bin-packing onto 2 nodes because it has neither CPU limits | | container specs show limits for memory only (no cpu), and…
web-80Warning62026-05-15 10:19Seen todayTraefikServiceHighLatency (6)Scaling config
12 min ago, 15+ web pod probe failures fired simultaneously (readiness /ready timeouts + php-fpm-healthcheck 10s timeouts). All 6+ web pods… | Suggesting two quick config changes on web to stop the recurring p95 latency alerts | | 1. Loosen readiness probe: failureThreshold: 4, per… | understood, good point
social-apiWarning112026-05-14 11:20Recent (72h)TraefikServiceHighLatency (11)Scaling config
12 min ago, 15+ web pod probe failures fired simultaneously (readiness /ready timeouts + php-fpm-healthcheck 10s timeouts). All 6+ web pods… | Suggesting two quick config changes on web to stop the recurring p95 latency alerts | | 1. Loosen readiness probe: failureThreshold: 4, per… | understood, good point
subscriptions-apiWarning12026-05-13 09:14Recent (72h)TraefikServiceHighLatency (1)NoneNo thread note
docgen2-apiWarning22026-05-12 13:48Seen this weekKubeHpaMaxedOut (2)NoneNo thread note
Evidence

Slack Alert Families

AlertSeverityCountLast SeenStatusThreadsTop Impacted ServicesDiscussion SignalLatest Thread Note
TraefikServiceHighErrorRateCritical12026-05-13 11:55Recent (72h)1uni-api (1)General investigation
no errors now pod is healthy | | we saw this error | | Service method CareerServiceImpl.saveCareer failed after Nms: | Invalid employment t…
TraefikServiceHighLatencyWarning252026-05-15 19:19Seen today2social-api (11)ai-api (11)web-80 (6)uni-api (2)subscriptions-api (1)Release / migration issueScaling config
12 min ago, 15+ web pod probe failures fired simultaneously (readiness /ready timeouts + php-fpm-healthcheck 10s timeouts). All 6+ web pods… | Suggesting two quick config changes on web to stop the recurring p95 latency alerts | | 1. Loosen readiness probe: failureThreshold: 4, per… | understood, good point
NodeHighNumberConntrackEntriesUsedWarning232026-05-15 12:29Seen today1grafana (23)General investigation
This has been resolved
KubeDeploymentReplicasMismatchWarning142026-05-16 05:19Seen today0audit-worker (14)None
NodeSystemSaturationWarning62026-05-15 12:33Seen today0grafana (6)None
KubePodCrashLoopingWarning52026-05-16 04:43Seen today0audit-worker (5)None
KubeCPUOvercommitWarning52026-05-15 15:02Seen today0grafana (5)None
NodeCPUHighUsageWarning32026-05-13 09:27Recent (72h)1grafana (3)Resource limits
Node running at 98 percent CPU checking | | Web pods are the most resource dominant on the cluster | this is resolved for now , rollout restarted the web pods to replace them to the other nodes that was sitting idle to balance the load
KubeHpaMaxedOutWarning22026-05-12 13:48Seen this week0docgen2-api (2)None
AlertmanagerFailedToSendAlertsWarning12026-05-12 10:35Seen this week0grafana (1)None

Status is heuristic. Slack rarely posts explicit resolutions, so “Seen today” or “Recent” means the alert family still appeared in production recently, not that it is definitely unresolved.

AWS Email Alarm Families

AWS AlarmEmailsALARMOKState FlipsFirst SeenLast SeenLatest StateStatus

“Flapping, latest OK” means the most recent email was an OK, but the alarm toggled repeatedly and is still a reliability concern.

Global Discussion-Derived Signal

Thread DateAlertSeverityServicesSignalKey Notes
2026-05-14 12:17NodeHighNumberConntrackEntriesUsedWarninggrafanaGeneral investigation
This has been resolved
2026-05-14 11:20TraefikServiceHighLatencyWarningsocial-api, web-80Scaling config
12 min ago, 15+ web pod probe failures fired simultaneously (readiness /ready timeouts + php-fpm-healthcheck 10s timeouts). All 6+ web pods… | Suggesting two quick config changes on web to stop the recurring p95 latency alerts | | 1. Loosen readiness probe: failureThreshold: 4, per… | understood, good point
2026-05-13 11:55TraefikServiceHighErrorRateCriticaluni-apiGeneral investigation
no errors now pod is healthy | | we saw this error | | Service method CareerServiceImpl.saveCareer failed after Nms: | Invalid employment t…
2026-05-13 09:19TraefikServiceHighLatencyWarningai-api, uni-apiRelease / migration issue
Uneven load distribution on the nodes . some are running at 100 percent some at around 10 to 15 | Web deployment is bin-packing onto 2 nodes because it has neither CPU limits | | container specs show limits for memory only (no cpu), and…
2026-05-13 08:52NodeCPUHighUsageWarninggrafanaResource limits
Node running at 98 percent CPU checking | | Web pods are the most resource dominant on the cluster | this is resolved for now , rollout restarted the web pods to replace them to the other nodes that was sitting idle to balance the load