SUMMARY
On August 1, 2024 from 12:10 UTC to 12:33 UTC, a small number of Chat customers on Pod 23 experienced disconnected chats and an inability to initiate new ones.
Timeline
August 01, 2024 12:40 PM UTC | August 01, 2024 05:40 AM PT
We are aware and investigating Chat issues and errors on Pod 23. More information to come.
August 01, 2024 12:58 PM UTC | August 01, 2024 05:58 AM PT
We have restarted the server specifically affecting the Chat service in Pod 23, and we are seeing improvements on the backend. However, you may notice a delay in chat data saving for ongoing chats during this period. Additionally, there is a temporary discrepancy between the status shown in the status switcher in the Agent workspace and the server. As a result, an agent might appear ONLINE in the Agent Workspace but be OFFLINE on the server, which could affect chat ticket routing.
August 01, 2024 01:49 PM UTC | August 01, 2024 06:49 AM PT
To reiterate, there should be no data loss, but the ongoing chats for the affected accounts during this issue may have been dropped prematurely. These chats will be recovered and saved, albeit with a time delay. We also consulted with our infrastructure partner, who confirmed that they had an underlying issue on their end. Thank you for your patience as we worked to fully resolve this issue.
POST-MORTEM
Root Cause Analysis
This incident was caused by an unexpected AWS infrastructure failure that affected a particular compute capacity instance.
Resolution
To resolve this issue, we restarted the affected compute capacity instance, which subsequently migrated the service to another stable AWS host, effectively resolving the problem.
Remediation Items
- Ensure automatic restart of the LiveChat server if it is shut down by AWS, using the service that allows us to run code without provisioning or managing servers.
- Update the alerting system for more accurate notifications.
- Shorten the timing for the Chat Backfill mechanism.
- Conduct resilience testing on the fixes in partnership with the responsible team.
FOR MORE INFORMATION
For current system status information about your Zendesk, check out our system status page. The summary of our post-mortem investigation is usually posted here a few days after the incident has ended. If you have additional questions about this incident, contact Zendesk customer support.
1 comment
Jessica G.
Post-mortem published August 14, 2024.
0