Summary
On February 14, 2022 from 04:48 UTC to 05:54 UTC, Zendesk Support customers on Pod 25 may have intermittently experienced delays with incoming chats and messages in Agent Workspace.
Timeline
06:20 UTC | 22:24 PT
We received reports of delays with incoming chats and messages in Support (Agent Workspace) on Pod 25. Our logs indicate that this occurred between 04:48 UTC and 05:54 UTC. Systems are now in recovery and wait times should be back to normal. We apologize for any inconvenience caused today and appreciate your patience.
Root Cause Analysis
This incident was caused by configuration error that resulted in excessive server node rebalancing following a routine restart. This in turn caused cascading rounds of rebalancing as our health-check endpoints did not respond in time and memory failures occurred.
Resolution
To fix this issue, our engineering team brought forward the deploy of fixes to improve rebalancing strategies on the nodes.
Remediation Items
- Improve node rebalancing strategy [Completed]
- Reconfigure readiness and liveness probes on Pod 25 [Completed]
- Explore additional monitoring on stopped partitions [Scheduled]
FOR MORE INFORMATION
For current system status information about your Zendesk, check out our system status page. The summary of our post-mortem investigation is usually posted here a few days after the incident has ended. If you have additional questions about this incident, please log a ticket with us via ZBot Messaging within the Widget.
1 Comments
Post-mortem posted on February 18, 2022.
Article is closed for comments.