SUMMARY
On July 2nd, 2024, between 18:20 and 18:44 UTC, our Sunshine Conversations service experienced high latency due to an unexpected increase in traffic. This caused our systems to slow down and led to delays in message processing.
Timeline
July 02, 2024 10:12 PM UTC | July 02, 2024 03:12 PM PT
The issue impacting Answer Bot performance on Pod 13 is now fully resolved. Please let us know if you continue to experience issues.
July 02, 2024 09:07 PM UTC | July 02, 2024 02:07 PM PT
We are seeing improvements to Answer Bot performance on Pod 13 and will continue to monitor performance. Will provide a final update once the incident is resolved.
July 02, 2024 08:13 PM UTC | July 02, 2024 01:13 PM PT
Our engineers are investigating the issue causing the Answer Bot degradation on Pod 13. We will provide another update when we have new information to share.
July 02, 2024 07:47 PM UTC | July 02, 2024 12:47 PM PT
We are investigating reports of Answer Bot degradation on Pod 13. We will provide another update when we more information.
POST-MORTEM
The incident was triggered by a significant increase in traffic. This caused our systems to slow down, resulting in delays and temporary service interruptions. We took immediate action to manage the increased load and restore normal operations.
Root Cause Analysis
The main cause was a sudden traffic spike, which doubled our usual traffic and saturated our database, leading to delays. Additionally, our AnswerBot service couldn't handle the increased load, causing further disruptions.
Resolution
To mitigate the issue, we scaled up our database and AnswerBot service, increasing their capacity to handle the surge. This allowed us to restore normal operations and process the backlog of messages.
Remediation Items
1. Enable Auto-Scaling: Implement automatic scaling for critical services to handle sudden traffic spikes.
2. Introduce Circuit Breakers: Prevent overloading of services by temporarily reducing traffic when necessary.
3. Improve Monitoring: Enhance our monitoring systems to detect and respond to similar issues more quickly.
FOR MORE INFORMATION
For current system status information about your Zendesk, check out our system status page. The summary of our post-mortem investigation is usually posted here a few days after the incident has ended. If you have additional questions about this incident, contact Zendesk customer support.