Service Incident - February 1st, 2021 - Unable to access Zendesk Chat

SUMMARY

On February 1, 2021 from 11:05 UTC to 12:22 UTC, Zendesk Chat customers encountered errors preventing access and use of the Chat product.

Timeline

11:29 UTC | 03:29 PT
Some Zendesk chat customers may receive a “502 bad gateway” error message when attempting to access the chat dashboard. Our teams are investigating the issue.

11:45 UTC | 03:45 PT
We are continuing to investigate '502 bad gateway' errors affecting some Chat customers. More updates to follow

12:04 UTC | 04:04 PT
Some Chat customers are still experiencing '502 bad gateway' errors when accessing the Chat dashboard. We are working on a solution

12:43 UTC | 04:43 PT
We are seeing improvements in the performance of the Chat dashboard and are seeing normal service resume. We are continuing to monitor.

13:41 UTC | 05:41 PT
Normal service has resumed. We are continuing to monitor and investigating the cause. We will update here when we have more information.

15:52 UTC | 07:52 PT
Chat dashboard access has been stable for 2.5 hours as of now. Thanks for your patience while we investigated this. Root cause identification is ongoing and we will post our findings as soon as we have them.

16:22 UTC | 08:22 PT
We would like to confirm that the outage impacting Chat users is now resolved.

POST-MORTEM

Root Cause Analysis

This incident was primarily caused by a misconfiguration with an internal feature gating service that resulted in the inability of that service to take the full scale of production traffic. A perfect storm of influencing factors and secondary causes magnified the impact to our customers:

A low connection overhead parameter in our data store and cache service resulted in increased swap usage further degrading performance
An auth service experienced instability that led to increased error rates in the feature gating service
Aggressive retry logic (without backoffs/timeouts) in some services exacerbated the request spike to the culprit service

Resolution

To fix this issue, our team made a configuration change to the feature gating service and redeployed it to production. Recovery was observed soon after.

Remediation Items

Apply configuration changes for feature gating service [Completed].
Identify and restart other cache nodes with low freeable memory [Completed]
Improve monitoring to identify similar issues earlier [In Progress].
Increase connection overhead in cache service [Scheduled].
Investigate retry logic in cache services [Scheduled].

FOR MORE INFORMATION

For current system status information about your Zendesk, check out our system status page. During an incident, you can also receive status updates by following @ZendeskOps on Twitter. The summary of our post-mortem investigation is usually posted here a few days after the incident has ended. If you have additional questions about this incident, please log a ticket with us.