SUMMARY
Between January 18, 2022 from 23:10 UTC to January 19, 2022 19:51 UTC, Enterprise customers using the Real Time Monitor (RTM) in the Zendesk Chat dashboard were presented with a ‘Monitor is connecting’ message when accessing Chat real time metrics leading to no load of the expected information.
Timeline
14:50 UTC | 06:50 PT
We have been receiving reports of issues with error “Monitor is connecting” when trying to access to Chat Real Time Metrics in the Chat Dashboard for Enterprise customers and an investigation is ongoing. We appreciate your patience with this matter.
15:32 UTC | 07:32 PT
The team continues to actively investigate and look for solutions to the issues related to Chat Real Time Metrics not loading and not processing information as expected. We shall update more information within one hour or as soon as it becomes available.
16:22 UTC | 8:22 PT
We are still investigating a root cause for the issues related to Chat Real Time Metrics not loading as expected. We will post another update within an hour or as we receive more information.
17:12 UTC | 9:12 PT
We are seeing some improvement and Chat Real Time Metric traffic is beginning to return. Our team continues to monitor to ensure full recovery. We will continue to post hourly updates until resolution.
18:12 UTC | 10:12 PT
We have not seen any additional progress since our last update and Chat Real Time Metric traffic remains in a degraded state. We are actively investigating and will post again within an hour.
19:03 UTC | 11:03 PT
We have identified a potential cause for the issue impacting Chat Real Time Metric functionality and are working towards a fix. We will continue to provide additional updates as we receive new information.
19:51 UTC | 11:51 PT
We are happy to report that the issue impacting Chat Real Time Metric traffic has been resolved and Real Time Metric dashboards should load as expected at this time. Please let us know if you continue to experience any issues.
POST-MORTEM
Root Cause Analysis
This incident was caused by several factors:
- Insufficient server capacity
- Long lived websocket connections resulting in the Zendesk Chat Real Time Monitoring service going unresponsive
- Surge in requests to the chat backend
- Insufficient monitoring on WebSocket traffic
Resolution
To fix this issue, we undertook the following courses of action:
- Temporary suspension of RTM API for accounts hitting rate limits
- Restarting the servers through deployment to clear old connections
- Upscaling capacity
After these were performed, continued monitoring was done before the issues were resolved.
Remediation Items
- Fix non-stable deploy scripts [Done]
- Add server capacity [Done]
- Add Real Time Monitor API capacity [Done]
- Add monitoring for API saturation [Scheduled]
- Improved monitoring for other APIs and services [Scheduled]
FOR MORE INFORMATION
For current system status information about your Zendesk, check out our system status page. During an incident, you can also receive status updates by following @ZendeskOps on Twitter. The summary of our post-mortem investigation is usually posted here a few days after the incident has ended. If you have additional questions about this incident, please log a ticket with us.