SUMMARY
On August 30, 2022 between 16:37 and 17:29 UTC, customers on Zendesk Support, Talk & Help Center (Guide) on Pod 19 experienced server errors and issues accessing Zendesk products. More specifically, Support and Help Center (Guide) customers could not authenticate or create new tickets, as well as calls were not properly routed to Talk. Related continuing Service Incident can be found here.
Timeline
17:06 UTC | 10:06 PT
We are investigating reports of accessibility issues for Support and Talk users on Pod 19 resulting in errors. More info in 15 minutes or when we have more information.
17:23 UTC | 10:23 PT
Our engineers are actively investigating access issues impacting Support and Talk on Pod 19. We will provide an update in 30 minutes or as soon as we have new information.
17:46 UTC | 10:46 PT
We are happy to report the issues impacting Pod 19 Support and Talk are now fully resolved. Sorry for the inconvenience this may have caused you and your team.
POST-MORTEM
Root Cause Analysis
This incident was caused by one of our hosting provider’s availability zones' connection errors as a result of an unexpected restart of database reader nodes. Those, in each of the clusters, underwent near simultaneous restarts. This caused the database connections to shift to the writer node of each of the clusters, and eventually, connections kept increasing until we reached the relational database management system connection limit. After the two database readers were restarted, a set of our internal services could not correctly connect to the new database nodes.
Resolution
To fix this issue, a manual restart of the services was needed. Affected services stabilized after the restart and the database clusters returned to a healthy state.
Remediation Items
- Perform Zendesk incident retrospectives to identify improvements we can make to our system resilience to vendor outages. [DONE]
- Improve our business continuity plan for third-party vendor failures. [IN PROGRESS]
FOR MORE INFORMATION
For current system status information about your Zendesk, check out our system status page. The summary of our post-mortem investigation is usually posted here a few days after the incident has ended. If you have additional questions about this incident, please log a ticket with us via ZBot Messaging within the Widget.