SUMMARY
On February 1, 2021 from 11:05 UTC to 12:22 UTC, Zendesk Chat customers encountered errors preventing access and use of the Chat product.
Timeline
11:29 UTC | 03:29 PT
Some Zendesk chat customers may receive a “502 bad gateway” error message when attempting to access the chat dashboard. Our teams are investigating the issue.
11:45 UTC | 03:45 PT
We are continuing to investigate '502 bad gateway' errors affecting some Chat customers. More updates to follow
12:04 UTC | 04:04 PT
Some Chat customers are still experiencing '502 bad gateway' errors when accessing the Chat dashboard. We are working on a solution
12:43 UTC | 04:43 PT
We are seeing improvements in the performance of the Chat dashboard and are seeing normal service resume. We are continuing to monitor.
13:41 UTC | 05:41 PT
Normal service has resumed. We are continuing to monitor and investigating the cause. We will update here when we have more information.
15:52 UTC | 07:52 PT
Chat dashboard access has been stable for 2.5 hours as of now. Thanks for your patience while we investigated this. Root cause identification is ongoing and we will post our findings as soon as we have them.
16:22 UTC | 08:22 PT
We would like to confirm that the outage impacting Chat users is now resolved.
POST-MORTEM
Root Cause Analysis
This incident was primarily caused by a misconfiguration with an internal feature gating service that resulted in the inability of that service to take the full scale of production traffic. A perfect storm of influencing factors and secondary causes magnified the impact to our customers:
- A low connection overhead parameter in our data store and cache service resulted in increased swap usage further degrading performance
- An auth service experienced instability that led to increased error rates in the feature gating service
- Aggressive retry logic (without backoffs/timeouts) in some services exacerbated the request spike to the culprit service
Resolution
To fix this issue, our team made a configuration change to the feature gating service and redeployed it to production. Recovery was observed soon after.
Remediation Items
- Apply configuration changes for feature gating service [Completed].
- Identify and restart other cache nodes with low freeable memory [Completed]
- Improve monitoring to identify similar issues earlier [In Progress].
- Increase connection overhead in cache service [Scheduled].
- Investigate retry logic in cache services [Scheduled].
FOR MORE INFORMATION
For current system status information about your Zendesk, check out our system status page. During an incident, you can also receive status updates by following @ZendeskOps on Twitter. The summary of our post-mortem investigation is usually posted here a few days after the incident has ended. If you have additional questions about this incident, please log a ticket with us.