Service Incident - August 30, 2022 - Support, Talk & Guide | Pod 19 - Degraded Services [continued]

SUMMARY

On August 30, 2022 between 17:46 and 18:14 UTC, a few minutes after the systems were considered stable from the previous incident, customers on Zendesk Support, Talk & Help Center (Guide) on Pod 19 continued experiencing server errors and issues accessing Zendesk products. More specifically, Support and Help Center (Guide) customers could not authenticate or create new tickets, as well as calls were not properly routed to Talk.

Timeline

18:18 UTC | 11:18 PT

We are investigating a continuation of the previous issues impacting Pod 19 Support and Talk. Next update in 30 minutes or when we have new information to share.

18:49 UTC | 11:49 PT

We are continuing to monitor the issues impacting Support and Talk on Pod 19 and are seeing improvements. Next update in 30 minutes or when we have new information.

19:20 UTC | 12:20 PT

We are still working to fully resolve the issues impacting Support and Talk products on Pod 19. We will provide an update when we have new information or confirmation that these issues are completely resolved.

21:37 UTC | 14:37 PT

The issues impacting Pod 19 Support and Talk have stabilized and we will continue to monitor performance. We will provide another update when we have new information.

Aug 31 12:08 UTC | 05:08 PT

We are happy to report that the issue impacting Pod 19 Support and Talk has now been resolved. We apologise for any inconveniences caused and thank you for your patience and partnership.

Aug 31 15:01 UTC | 08:01 PT

We would also like to confirm that some functionalities within Help Center (Guide) were also briefly affected by this issue on Pod 19.

POST-MORTEM

Root Cause Analysis

This incident was caused by one of our hosting provider’s availability zones' connection errors due to an unexpected restart of database reader nodes. Those, in each of the clusters, underwent near simultaneous restarts. This caused the database connections to shift to the writer node of each of the clusters, and eventually, connections kept increasing until we reached the relational database management system connection limit. After the two database readers were restarted, a set of our internal services could not correctly connect to the new database nodes.

Resolution

To fix this issue, a manual restart of the services was needed. Affected services stabilized after the restart and the database clusters returned to a healthy state.

Remediation Items

Perform Zendesk incident retrospectives to identify improvements we can make to our system resilience to vendor outages. [DONE]
Improve our business continuity plan for third-party vendor failures. [IN PROGRESS]

FOR MORE INFORMATION

For current system status information about your Zendesk, check out our system status page. The summary of our post-mortem investigation is usually posted here a few days after the incident has ended. If you have additional questions about this incident, please log a ticket with us via ZBot Messaging within the Widget.