On June 7, 2023 from 04:50 UTC to 09:42 UTC, Zendesk Support customers with active webhooks experienced failures with those webhooks.
07:56 UTC | 00:56 PT
We have received reports of webhooks failing for customers on Pod 25 and are investigating the issue. We will provide another update shortly.
08:20 UTC | 01:20 PT
We can confirm that the issue pertaining to webhooks failing is not limited to Pod 25, but also affects customers on Pods 17 and 27. We continue to investigate and will provide more information in 30 mins or when new information becomes available.
08:42 UTC | 01:42 PT
We identified a fix and are working on applying it to the issue affecting Support customers using webhooks on Pods 17, 25 & 27. We’ll provide more information in 30 mins or when new information becomes available.
09:18 UTC | 02:18 PT
The fix is being implemented across the affected Pods and we are waiting for it to soak in to all accounts. Webhooks that have previously failed, will be re-tried after this is done. Next update in 30 min or when we have more info.
09:44 UTC | 02:44 PT
The fix for the issue affecting Support webhooks across Pods 17, 25 & 27 has been successfully implemented. We’ll continue to monitor this until full resolution. Please let us know if you have any issues.
10:02 UTC | 03:02 PT
We’re happy to confirm all issues affecting customers using Support webhooks on Pods 17, 25 & 27 have been fully resolved. We appreciate your patience while we worked through this and apologize for any inconvenience caused.
Root Cause Analysis
This incident was caused by an issue with the webhook system’s error handling. An unexpected processing error was encountered with an individual webhook payload that wasn’t handled properly, causing the individual payload to get stuck in a retry loop when it should have been rejected. This led to all subsequent webhook deliveries failing.
To fix this issue, our team deployed a code fix which resolved the error handling problem.
- Reconfigure error retry mechanism [Completed]
- Refactor error handling to prioritize reliability and resiliency [In progress]
- Improve observability to better understand system health and performance [In progress]
- Deliver near real-time webhooks capability by effectively managing system capacity in the event of traffic spikes [Scheduled]
FOR MORE INFORMATION
For current system status information about your Zendesk, check out our system status page. The summary of our post-mortem investigation is usually posted here a few days after the incident has ended. If you have additional questions about this incident, please log a ticket with us via ZBot Messaging within the Widget.