SUMMARY
On January 16th, 2025 from 9:40 UTC to 10:47 UTC some Chat customers on Pod 19 experienced issues viewing recent chats, receiving chat export emails, and creating tickets from chats.
TIMELINE
January 16, 2025 11:26 AM UTC | January 16, 2025 03:26 AM PT
We are pleased to inform you that the issues affecting our Chat service for our customers on POD19 have now been resolved. We sincerely appreciate your patience and understanding during this time.
January 16, 2025 11:00 AM UTC | January 16, 2025 03:00 AM PT
We have made significant progress in recovering functionality, including the ability to view recent chats, receive chat export emails, and create tickets. We will continue to monitor the situation closely and work diligently to enhance your experience. Thank you for your patience and understanding during this time.
January 16, 2025 10:39 AM UTC | January 16, 2025 02:39 AM PT
We are currently experiencing an issue with our chat services on Pod 19, which may prevent you from viewing recent chats, receiving chat export emails, and creating tickets. Our team is actively working to resolve these problems as quickly as possible. Thank you for your patience.
POST-MORTEM
Root Cause Analysis
This incident was caused by a chat service reaching its memory limits, which led to a continuous restart cycle. Each restart generated additional metadata in our in-memory database, causing memory bloat until the system eventually ran out of memory, impacting other services that shared the same database instance.
Resolution
To resolve the issue, the team removed unnecessary metadata and unacknowledged keys from the database to free up memory. Additionally, the instance types were increased to accommodate the load, and a successful deployment of the service was completed.
Remediation Items
- Add Alerts: Implemented alerts for Out of Memory (OOM) conditions in the chat service.
- Adjust Memory Limits: Lowered the threshold for memory limits to allow for earlier intervention before reaching critical levels.
- Runbook Improvements: Enhanced documentation and runbooks for handling the chat service and database key management.
- Database Clustering: Planned to separate the database instances for different services to avoid shared memory issues in the future.
FOR MORE INFORMATION
For current system status information about Zendesk and specific impacts to your account, visit our system status page. You can follow this article to be notified when our post-mortem report is published. If you have additional questions about this incident, contact Zendesk customer support.
1 comment
Bob Novak
Post-mortem published January 29, 2025
0