SUMMARY
On August 29, 2024, from 08:32 UTC to 09:58 UTC our Chat customers on Pods 13, 20 and 25 experienced issues, such as being unable to login, error messages, delays in chat transcripts and the Chat history not showing. Explore customers across all US Pods experienced data refresh delays to the historical datasets and Pod 20 customers experienced impact to their Realtime data.
Timeline
August 29, 2024 09:14 AM UTC | August 29, 2024 02:14 AM PT
We are currently investigating multiple issues with Chat for customers on Pods 13, 20, and 25. The impact includes dropped messages, tickets not created, broken admin features and customers being unable to log in to chat. We will provide another update in 15 mins.
August 29, 2024 09:30 AM UTC | August 29, 2024 02:30 AM PT
We continue to investigate the Chat issues across multiple Pods at the highest priority and are working to find the root cause of this. Chat agents that are already logged in will be able to continue to chat, but the transcription will be delayed and chats will not show in the history. Any settings changed in Chat during this incident will not take effect on the affected Pods. We will provide another update in 30 mins or when we have more information to share.
August 29, 2024 09:59 AM UTC | August 29, 2024 02:59 AM PT
We continue our investigation into the root cause. We confirm that all Explore customers in the US will experience data refresh delays to the historical datasets and Pod20 customers will experience impact to their Realtime data in Explore. We will provide another update in 30 minutes or when we have more information.
August 29, 2024 10:07 AM UTC | August 29, 2024 03:07 AM PT
We are starting to see recovery for customers in both Chat and Explore. We will continue to monitor the services until full resolution. We will update you again in 60 mins or when we know more.
August 29, 2024 10:40 AM UTC | August 29, 2024 03:40 AM PT
Both the initial impact to Chat and the extended impact to Explore have been cleared and our services have returned to normal operation. With this we are marking this service incident as resolved. We thank you for your patience while we worked to resolve this.
POST-MORTEM
Root Cause Analysis
The root cause was a connectivity issue with a third-party service that provides essential credentials for our Chat service. When our system attempted to reload its components, it couldn't obtain the necessary credentials, causing it to fail and restart continuously.
Resolution
The issue was resolved once the third-party service restored its connectivity. Our systems automatically recovered and resumed normal operations shortly thereafter.
Remediation Items
- Improve redundancy and build more resilience to third-party service disruptions.
- Enhance monitoring to detect and respond to such issues more quickly.
- Updating our system to handle temporary credential issues more gracefully
FOR MORE INFORMATION
For current system status information about your Zendesk, check out our system status page. The summary of our post-mortem investigation is usually posted here a few days after the incident has ended. If you have additional questions about this incident, contact Zendesk customer support.
1 comment
Charlotte Kobler
Post-mortem published September 06, 2024
0