On September 21, 2020 from 16:09 UTC to 16:45 UTC, Zendesk Chat customers experienced gateway errors when attempting to access and use the Zendesk Chat product.
16:37 UTC | 09:37 PT
We’ve received reports of 502 errors while accessing Chat from numerous customers. Our team is actively investigating.
16:52 UTC | 09:52 PT
We’re beginning to see improvement in users’ ability to access Chat. We’ll continue to monitor the situation and provide updates.
17:29 UTC | 10:29 PT
We’ve identified the source of the 502 errors and restored access to Chat. Users should now be able to sign in without issue, but please let us know if you continue to see errors.
Root Cause Analysis
This incident was caused by a lack of rate limits on a Chat API endpoint, allowing a high volume of rapid requests from a “noisy neighbor” account to overwhelm a Chat database. The influx of requests led to many queries being issued to the Chat database that eventually exhausted the database connection limit resulting in the connection failures experienced by our customers.
To fix this issue, our engineers restarted the database services to kill the existing queries allowing connections and queries to run freely again.
- Reconfigure Chat database connection timeout [Completed].
- Reconfigure OAuth endpoint logic [Completed].
- Set rate limits on relevant API endpoints [Scheduled].
- Investigate alternative database connection strategy for all applications [Scheduled].
FOR MORE INFORMATION
For current system status information about your Zendesk, check out our system status page. During an incident, you can also receive status updates by following @ZendeskOps on Twitter. The summary of our post-mortem investigation is usually posted here a few days after the incident has ended. If you have additional questions about this incident, please log a ticket with us.