SUMMARY
On May 2, 2024 from 13:35 UTC to 14:25 UTC, some customers experienced difficulties using Zendesk services. These included slow performance and occasional internal server errors when trying to access our products across multiple Pods. During this time, the impact was particularly significant for customers on Pod 23, where users might have encountered delays or trouble using the Support agent interface, Sunshine Conversations Messaging, Chat, Talk, Explore, Sell, and Guide. Additionally, there was a wider impact on Messaging services across all Pods.
Timeline
May 02, 2024 02:20 PM UTC | May 02, 2024 07:20 AM PT
We’re currently investigating multiple issues with Sunshine Conversations, AW Messaging and other products for customers in multiple Pods. More updates to follow.
May 02, 2024 02:35 PM UTC | May 02, 2024 07:35 AM PT
We’re actively working to address the general slowness and Internal Server Errors impacting several products due to this ongoing service incident, which affects customers in all Pods. We appreciate your patience. Next update in 30 min.
May 02, 2024 02:50 PM UTC | May 02, 2024 07:50 AM PT
We have implemented a fix for the issue, and we are observing improvements in the logs. Access to all products should now be restored. Please ensure you refresh your browser, and clear your cache and cookies if necessary. Thank you for your continued patience.
May 02, 2024 03:30 PM UTC | May 02, 2024 08:30 AM PT
We’ve been monitoring this incident and no longer see issues related to it after the fix was implemented. We are marking this as fully resolved now.
POST-MORTEM
Root Cause Analysis
The issue was caused by a glitch in our service update process that led to some parts of our infrastructure not being ready to handle traffic, which resulted in reduced capacity.
Resolution
To fix this issue, our engineers rerouted internal traffic to bypass the affected infrastructure, which allowed us to restore full service by 14:25 UTC.
Remediation Items
- Improve our infrastructure update process to ensure that any issues are detected and addressed before they affect customers.
- Enhance our system's capacity to handle traffic during routine updates.
- Add new checks to monitor the health of our services more effectively.
- Work on better coordination between different components of our service infrastructure.
- Review the impact on all services to understand why they were affected and to prevent similar issues in the future.
FOR MORE INFORMATION
For current system status information about your Zendesk, check out our system status page. The summary of our post-mortem investigation is usually posted here a few days after the incident has ended. If you have additional questions about this incident, contact Zendesk customer support.