Service Incident - February 8th, 2021 - Pod 19 & 23 Outage

SUMMARY

On February 8th, 2021, during two time periods, 12:56 UTC to 13:15 UTC and 14:38 UTC to 15:52 UTC, we experienced site-wide disruptions across multiple Zendesk services (Support, Guide, and Talk) due to an AWS network outage (US-East-1d). All services were restored by 17:33 UTC.

20:07 UTC | 12:07 PT
We're happy to report the issues affecting Support, Talk and Guide for our customers on Pods 19 and 23 has been resolved. Please reach out to us if you're still experiencing issues.

18:16 UTC | 10:16 PT
We see continued stability on pods 19 & 23 for our customers. We’ll provide another update when our investigation concludes or the situation changes.

17:05 UTC | 09:05 PT
Pods 19 and 23 have continued to remain stable. We will be providing hourly updates moving forward until this issue has been completely resolved.

16:29 UTC | 08:29 PT
We are seeing improvements for our customers on Pods 19 and 23 in regards to the impact on Support, Talk, Guide. Our team continues to investigate this issue.

15:46 UTC | 07:46 PT
Our team are working to resolve the issue that is impacting Support, Talk and Guide for customers on Pods 19 and 23. We will continue to provide updates.

15:17 UTC | 07:17 PT
We are aware of an outage for customers on Pods 19 and 23, we are continuing to investigate and will provide updates.

14:50 UTC | 06:50 PT
We are aware that these issues for customers on Pod 19 and 23 that were stable may now have returned, we are investigating and will provide updates.

14:14 UTC | 06:14 PT
We are happy to report we are seeing stability on Pods 19 and 23 following the issue that impacted Support, Talk and Guide. We believe this was a 3rd party issue but our teams will continue to monitor. We will provide an update once we have additional information to share.

13:42 UTC | 05:42 PT
We are beginning to see some improvements on Pods 19 and 23 for Support, Talk and Guide. We will keep you updated as we find out more. Thank you for patience and apologies for the inconvenience.

13:25 UTC | 05:25 PT
We are currently aware of an issue for customers on Pods 19 and 23 experiencing an error when attempting to access the Support interface, this is also impacting Guide and Talk. We are investigating this and will provide updates.

POST-MORTEM

During these disruptions, it was determined that a number of databases were adversely affected by the network outage, cascading to response delays and errors across a number of dependent processes. Affected AWS services were EC2, EBS, and RDS.

Due to high CPU utilization across at least two of the database clusters in these pods, queries had to be terminated in order to free resources. Two other databases were forcibly restarted by AWS.

Dependent services including Search and Views, were unable to complete requests which resulted in resource exhaustion of our core processing pool, which cascaded to failures and errors across Support, Guide, and Talk. Customers using Views would have seen view counts disabled, which were necessary to help reduce database load.

Support functionality was severely diminished, along with Talk calls being dropped during this time period. Guide functionality and services were also adversely affected.

Root Cause Analysis

A defect in AWS’ deployment system resulting in network failure in the AZ use1-1d. This started cascading failure scenario of MySQL instance restarts and high CPU, ES errors, Riak issues, NLB connection errors, etc. leading to capacity exhaustion until we shed load terminating queries and turning off View Counts.

Remediation Items

Work with AWS to mitigate recurrence.
Limit the number of retries for the edge proxy when Support is unresponsive
Automated detection of load balancing bottlenecks is in place and initiates the manual process to resolve
Create a new monitor to detect TCP connection tracking exhaustion for earlier detection

FOR MORE INFORMATION

For current system status information about your Zendesk, check out our system status page. During an incident, you can also receive status updates by following @ZendeskOps on Twitter. The summary of our post-mortem investigation is usually posted here a few days after the incident has ended. If you have additional questions about this incident, please log a ticket with us.