SUMMARY
On August 27, 2024 from 16:30 UTC to 22:30 UTC, Support customers in Pods 19, 20, and 27, experienced delays in Webhooks and triggers firing, which impacted ticket updates and communication with end users.
Timeline
August 27, 2024 08:03 PM UTC | August 27, 2024 01:03 PM PT
We are investigating reports of Support delayed triggers and webhooks firing. Next update in 30 minutes or when we have new information to share.
August 27, 2024 08:27 PM UTC | August 27, 2024 01:27 PM PT
The webhook and trigger delays are impacting Support customers on pods 19, 20, and 27. Our engineers are currently engaged and investigating. Next update in 30 minutes or when we have new information to share.
August 27, 2024 08:56 PM UTC | August 27, 2024 01:56 PM PT
Our engineers continue to investigate the webhook and trigger delays impacting Support customers on pods 19, 20, and 27. Next update in 1 hour or when we have new information to share.
August 27, 2024 09:24 PM UTC | August 27, 2024 02:24 PM PT
We are seeing improvements to webhook delays on Pod 19 and continuing to work on processing the backlog of webhooks on pods 20 and 27. Next update in 1 hour or when we have new information to share.
August 27, 2024 10:03 PM UTC | August 27, 2024 03:03 PM PT
The backlog of webhooks on pods 19 and 20 have been fully processed and there should no longer be any delays on those pods. We are still processing the backlog of webhooks on pod 27 and will provide an update once that backlog is clear.
August 27, 2024 10:40 PM UTC | August 27, 2024 03:40 PM PT
The backlog of webhooks on pods 19, 20, and 27 have been fully processed and there should no longer be any delays on those pods. The issue is now fully resolved.
POST-MORTEM
Root Cause Analysis
The incident was primarily caused by a sudden surge in traffic due to a mass user import by a large customer. This surge resulted in the Webhooks system hitting its throughput limit, leading to significant delays. Additionally, in Pod 27, the autoscaling mechanism failed to adequately handle the increased traffic, further exacerbating the delays.
Resolution
To fix this issue, the Webhooks dispatcher and the Untrusted Egress Zone (UEZ) were scalably ramped up to handle the traffic surge. Furthermore, the specific customer was requested to slow down their operations. Once the necessary scaling adjustments were made, the backlog started to decrease, and normal service was gradually restored across all affected pods.
Remediation Items
- Define horizontal auto-scaling policies for Webhooks services. [IN PROGRESS]
- Investigate enhancing rate-limiting logic to account for a single customer with many subdomains. [SCHEDULED]
- Investigate and fix the secure egress tier auto-scaling issue in Pod 27. [SCHEDULED]
- Streamline the deployment and configuration change process to reduce friction during emergency resolutions. [IN PROGRESS]
- Implement subdomain-specific kill switches for Webhooks. [IN PROGRESS]
- Add monitoring alerts to flag when the Webhooks backlog or delivery latency becomes too large. [SCHEDULED]
- Publicly document Webhooks rate limits to inform customers and preemptively manage traffic. [SCHEDULED]
FOR MORE INFORMATION
For current system status information about your Zendesk, check out our system status page. The summary of our post-mortem investigation is usually posted here a few days after the incident has ended. If you have additional questions about this incident, contact Zendesk customer support.
1 comment
Jessica G.
Postmortem published September 10, 2024.
0