17:51 UTC | 10:51 PT
We’re happy to report that the performance incident has been resolved. Please reload your Zendesk and let us know if you experience any issues.
17:16 UTC | 10:16 PT
We are seeing continued performance improvement across all affected pods. Please let us know if issues continue to persist so we can investigate further.
16:53 UTC | 09:53 PT
We are seeing some performance improvements on affected pods. We are working to further stabilize response times.
16:03 UTC | 09:03 PT
We are experiencing performance issues on Pod14. We are actively investigating.
On August 08, 2018, customers on Pod 14 experienced an outage when using Zendesk Support, Guide and Chat. Customers on Pod 5 experienced an outage when using Zendesk Support and Guide. This incident was caused by a change event.
Total impact duration: Two hours and nine minutes.
OVERVIEW OF EVENTS
At 14:25 UTC, a change upgrade was initiated in all US data centers. Over the next 30 minutes, hosts in the datacenter installed the new version of the change. Part of the code that handles this upgrade gracefully drains connections from the proxies before restarting.
At 15:18 UTC, HTTP 50x error rates on the proxy tier of Pod 14 began to increase.
At 15:24 UTC, our network operations team reported issues with proxy hosts in Pod 14. The hosts showed multiple symptoms related to networking.
At 15:40 UTC, HTTP 50x error rates on the proxy tier of Pod 5 also began to increase.
At 15:50 UTC, the response team began to look into network issues to pinpoint the cause of the outage. After triaging the incident, we identified the cause as a command used in the upgrade of the kernel module to track the state of all connections for filtering and routing purposes.
To resolve the issue, the team invoked a command on all affected hosts to increase the maximum value the kernel module uses to track connections, removing the bottleneck. The team called the "all clear" for this incident at 17:45 UTC.
A custom provider that was written for these upgrades was used to help clients gracefully drain their connections prior to upgrading. This command had an unknown side effect of loading a kernel module to track the state of all connections for filtering and routing purposes. By default, the kernel module has a maximum number of connections. This change was rolled out to our staging pods and some production pods without any impact; however, on hosts with heavier network traffic, there are more connections at a higher rate, which exhausted the maximum number of connections very quickly, and the kernel started dropping network packets. This led to the intermittent availability issues that customers experienced.
This incident has a few key remediation items that we've put in place to prevent reoccurrence of this issue:
- During the incident, the team increased the maximum value the kernel module can track connections, eliminating the bottleneck.
- We have subsequently set the kernel module max value globally for all hosts.
FOR MORE INFORMATION
For current system status information about your Zendesk, check out our system status page. During an incident, you can also receive status updates by following @ZendeskOps on Twitter. The summary of our post-mortem investigation is usually posted here a few days after the incident has ended. If you have additional questions about this incident, please log a ticket with us.