SUMMARY
On January 31, 2022 from 16:26 UTC to 19:50 UTC, customers on Pods 17 and 18 using Agent Workspace experienced chat routing issues. Agents were online but chats were not routed to them and were timing out as missed chats.
Timeline
18:04 UTC | 10:04 PT
We are investigating reports of Chats not routing for Agent Workspace customers on Pods 17 and 18. We will provide additional information as soon as we can.
18:34 UTC | 10:34 PT
We have confirmed an issue impacting Chat routing for Agent Workspace customers on Pods 17 and 18. Our team is working to restore full functionality and we will provide another update as soon as we can.
19:27 UTC | 11:27 PT
We are seeing recovery in Pod 17 Chat routing on Agent Workspace but we are still working to restore full functionality in Pod 18. We will provide additional information as soon as we can.
20:09 UTC | 12:09 PT
We are happy to report that Pod 18 Chat routing for Agent Workspace has recovered along with Pod 17. Please let us know if you continue to experience any issues.
POST-MORTEM
Root Cause Analysis
This incident was caused by an outdated backend library version that indirectly resulted in:
- Storage partition rebalancing time taking a long time
- API requests not to be served when rebalancing was ongoing
The liveness probes that monitored for unresponsive services detected the long rebalancing as unresponsive, and restarted the process again, ending up in a loop that delayed recovery.
Resolution
To fix this issue, the liveness probes were disabled, which allowed the rebalancing to complete. We continued to monitor until recovery was gradually completed.
Remediation Items
- Upgrade library version to improve response times when serving API requests [Scheduled]
- Improve the existing monitors for the affected service [Scheduled]
- Improve monitoring and alerting to shorten response times. [Scheduled]
FOR MORE INFORMATION
For current system status information about your Zendesk, check out our system status page. During an incident, you can also receive status updates by following @ZendeskOps on Twitter. The summary of our post-mortem investigation is usually posted here a few days after the incident has ended. If you have additional questions about this incident, please log a ticket with us.
1 Comments
Post-mortem published February 9th, 2022.
Article is closed for comments.