Pingdom customer impact
External signal1 active item(s) in this window.
Generated 2026-05-18 15:02 for 2026-05-09 07:00 to 2026-05-16 07:00 from Pingdom checks, Slack #_alerts_prod, and AWS SNS alerts.
Bottom line: Pingdom observed recent customer-facing glitches (email unconfirmed) and application-level critical paths are present.
1 active item(s) in this window.
1 critical, 7 non-critical active item(s).
No AWS alarm emails were captured in this window.
No active issue listed in this category.
| Pingdom Check | Status | Events | Downtime | Last Seen | Likely Services | Correlated Evidence |
|---|---|---|---|---|---|---|
| https://www.adservio.ro/api/v2/status | Recovered recently | 15 | 15m | 2026-05-15 11:49 | adservio-ro-api-v2-status | Pingdom-only evidence so far |
| Adservio Ro | No recent customer-visible issue | 0 | 0m | 2026-05-16 00:00 | adservio-ro | Pingdom-only evidence so far |
Pingdom rows show externally visible signal first. The correlated evidence column helps tie the failing check back to services, Slack alert families, or AWS alarms when those links exist.
This view attributes alerts to the workload or resource named in the alert text. Grafana, Loki, and Tempo are treated as observability components and are excluded when a more specific impacted target is also present.
| Impacted Service / Resource | Highest Severity | Count | Last Seen | Status | Top Alert Types | Discussion Signal | Latest Thread Note |
|---|---|---|---|---|---|---|---|
| uni-api | Critical | 3 | 2026-05-15 11:17 | Seen today | TraefikServiceHighErrorRate (1)TraefikServiceHighLatency (2) | Release / migration issueGeneral investigation | no errors now pod is healthy | | we saw this error | | Service method CareerServiceImpl.saveCareer failed after Nms: | Invalid employment t… |
| grafana | Warning | 38 | 2026-05-15 15:02 | Seen today | NodeHighNumberConntrackEntriesUsed (23)NodeSystemSaturation (6)KubeCPUOvercommit (5)NodeCPUHighUsage (3)AlertmanagerFailedToSendAlerts (1) | Resource limitsGeneral investigation | This has been resolved |
| audit-worker | Warning | 19 | 2026-05-16 05:19 | Seen today | KubeDeploymentReplicasMismatch (14)KubePodCrashLooping (5) | None | No thread note |
| ai-api | Warning | 11 | 2026-05-15 19:19 | Seen today | TraefikServiceHighLatency (11) | Release / migration issue | Uneven load distribution on the nodes . some are running at 100 percent some at around 10 to 15 | Web deployment is bin-packing onto 2 nodes because it has neither CPU limits | | container specs show limits for memory only (no cpu), and… |
| web-80 | Warning | 6 | 2026-05-15 10:19 | Seen today | TraefikServiceHighLatency (6) | Scaling config | 12 min ago, 15+ web pod probe failures fired simultaneously (readiness /ready timeouts + php-fpm-healthcheck 10s timeouts). All 6+ web pods… | Suggesting two quick config changes on web to stop the recurring p95 latency alerts | | 1. Loosen readiness probe: failureThreshold: 4, per… | understood, good point |
| social-api | Warning | 11 | 2026-05-14 11:20 | Recent (72h) | TraefikServiceHighLatency (11) | Scaling config | 12 min ago, 15+ web pod probe failures fired simultaneously (readiness /ready timeouts + php-fpm-healthcheck 10s timeouts). All 6+ web pods… | Suggesting two quick config changes on web to stop the recurring p95 latency alerts | | 1. Loosen readiness probe: failureThreshold: 4, per… | understood, good point |
| subscriptions-api | Warning | 1 | 2026-05-13 09:14 | Recent (72h) | TraefikServiceHighLatency (1) | None | No thread note |
| docgen2-api | Warning | 2 | 2026-05-12 13:48 | Seen this week | KubeHpaMaxedOut (2) | None | No thread note |
| Alert | Severity | Count | Last Seen | Status | Threads | Top Impacted Services | Discussion Signal | Latest Thread Note |
|---|---|---|---|---|---|---|---|---|
| TraefikServiceHighErrorRate | Critical | 1 | 2026-05-13 11:55 | Recent (72h) | 1 | uni-api (1) | General investigation | no errors now pod is healthy | | we saw this error | | Service method CareerServiceImpl.saveCareer failed after Nms: | Invalid employment t… |
| TraefikServiceHighLatency | Warning | 25 | 2026-05-15 19:19 | Seen today | 2 | social-api (11)ai-api (11)web-80 (6)uni-api (2)subscriptions-api (1) | Release / migration issueScaling config | 12 min ago, 15+ web pod probe failures fired simultaneously (readiness /ready timeouts + php-fpm-healthcheck 10s timeouts). All 6+ web pods… | Suggesting two quick config changes on web to stop the recurring p95 latency alerts | | 1. Loosen readiness probe: failureThreshold: 4, per… | understood, good point |
| NodeHighNumberConntrackEntriesUsed | Warning | 23 | 2026-05-15 12:29 | Seen today | 1 | grafana (23) | General investigation | This has been resolved |
| KubeDeploymentReplicasMismatch | Warning | 14 | 2026-05-16 05:19 | Seen today | 0 | audit-worker (14) | None | |
| NodeSystemSaturation | Warning | 6 | 2026-05-15 12:33 | Seen today | 0 | grafana (6) | None | |
| KubePodCrashLooping | Warning | 5 | 2026-05-16 04:43 | Seen today | 0 | audit-worker (5) | None | |
| KubeCPUOvercommit | Warning | 5 | 2026-05-15 15:02 | Seen today | 0 | grafana (5) | None | |
| NodeCPUHighUsage | Warning | 3 | 2026-05-13 09:27 | Recent (72h) | 1 | grafana (3) | Resource limits | Node running at 98 percent CPU checking | | Web pods are the most resource dominant on the cluster | this is resolved for now , rollout restarted the web pods to replace them to the other nodes that was sitting idle to balance the load |
| KubeHpaMaxedOut | Warning | 2 | 2026-05-12 13:48 | Seen this week | 0 | docgen2-api (2) | None | |
| AlertmanagerFailedToSendAlerts | Warning | 1 | 2026-05-12 10:35 | Seen this week | 0 | grafana (1) | None |
Status is heuristic. Slack rarely posts explicit resolutions, so “Seen today” or “Recent” means the alert family still appeared in production recently, not that it is definitely unresolved.
| AWS Alarm | Emails | ALARM | OK | State Flips | First Seen | Last Seen | Latest State | Status |
|---|
“Flapping, latest OK” means the most recent email was an OK, but the alarm toggled repeatedly and is still a reliability concern.
| Thread Date | Alert | Severity | Services | Signal | Key Notes |
|---|---|---|---|---|---|
| 2026-05-14 12:17 | NodeHighNumberConntrackEntriesUsed | Warning | grafana | General investigation | This has been resolved |
| 2026-05-14 11:20 | TraefikServiceHighLatency | Warning | social-api, web-80 | Scaling config | 12 min ago, 15+ web pod probe failures fired simultaneously (readiness /ready timeouts + php-fpm-healthcheck 10s timeouts). All 6+ web pods… | Suggesting two quick config changes on web to stop the recurring p95 latency alerts | | 1. Loosen readiness probe: failureThreshold: 4, per… | understood, good point |
| 2026-05-13 11:55 | TraefikServiceHighErrorRate | Critical | uni-api | General investigation | no errors now pod is healthy | | we saw this error | | Service method CareerServiceImpl.saveCareer failed after Nms: | Invalid employment t… |
| 2026-05-13 09:19 | TraefikServiceHighLatency | Warning | ai-api, uni-api | Release / migration issue | Uneven load distribution on the nodes . some are running at 100 percent some at around 10 to 15 | Web deployment is bin-packing onto 2 nodes because it has neither CPU limits | | container specs show limits for memory only (no cpu), and… |
| 2026-05-13 08:52 | NodeCPUHighUsage | Warning | grafana | Resource limits | Node running at 98 percent CPU checking | | Web pods are the most resource dominant on the cluster | this is resolved for now , rollout restarted the web pods to replace them to the other nodes that was sitting idle to balance the load |