Summary
On February 7, 2022 from 17:15 UTC to 23:23 UTC, some customers were unable to load the Zendesk Chat Real Time Monitor dashboard. In addition, customers using the Chat RTM API would also have experienced delayed or incorrect statistics.
Timeline
18:35 UTC | 10:35 PT
We are investigating reports of Chat monitor sync delays and Agent Workspace Chat routing issues. We will provide an update as soon as we have more information.
19:01 UTC | 11:01 PT
We are continuing to investigate Chat monitor sync delays and Agent Workspace Chat routing issues. We will provide an update as soon as we have more detailed information.
19:36 UTC | 11:36 PT
The Agent Workspace Chat routing issue is now resolved. Our engineers are continuing to work on the Chat monitor sync delay and we will provide further updates as we work towards resolution.
23:13 UTC | 15:13 PT
We continue to investigate Chat monitor sync delays. We will provide another update in 3 hours or before then, if we have more progress to share.
00:28 UTC | 16:28 PT
The issues causing Chat Monitor sync delays have been resolved. We have confirmation that the real time monitor is presenting up to date data now. Thank you for your patience.
Root Cause Analysis
This incident was caused by a bug that led to a connection spike to a key datastore that exhausted the CPU on the associated shard, leading to the Chat Real-Time Monitoring API being unable to fetch monitoring data for our customers.
Resolution
To fix this issue, our engineering team deployed a bug fix to reduce connection spikes.
Remediation Items
- Pool application connections to the datastore [Scheduled]
- Change reconnection strategy from application to datastore [Scheduled]
- Implement better application rate limiting [Scheduled]
- Implement circuit breaker to protect against traffic spikes [Scheduled]
- Spread RTM knowledge more widely across teams [Scheduled]
FOR MORE INFORMATION
For current system status information about your Zendesk, check out our system status page. During an incident, you can also receive status updates by following @ZendeskOps on Twitter. The summary of our post-mortem investigation is usually posted here a few days after the incident has ended. If you have additional questions about this incident, please log a ticket with us.