On November 15, 2021 from 10:13 UTC to 11:19 UTC, Zendesk Chat customers on Pod 17 experienced delays in serving chats, or some chats not being served to them at all.
11:04 UTC | 03:04 PT
Our team is investigating delays in chats being served for some of our customers on POD 17, more details to follow.
11:23 UTC | 03:23 PT
Our investigation continues into the root cause of delays in chats being served for some of our customers on POD 17, we will provide another update in 30 minutes.
11:57 UTC | 03:57 PT
We’re happy to confirm that the issue regarding delay for chats being served on accounts in POD 17 has been resolved. Thank you for your patience.
Root Cause Analysis
This incident was caused by performance regression in the Account Service, where the memory recycling mechanism displayed unusually high activity, causing latency to the Account Service API endpoints. While we had automated contingencies in place, these did not work as designed to prevent service latency degradation.
To fix this issue, we performed a rollback of the Account Service to a previous stable version, and increased its capacity. Recovery was observed shortly after.
- Review autoscaling of the Account Service [In progress]
- Add and fix Account Service alerts [Scheduled]
- Improve monitoring mechanisms of these alerts [Scheduled]
FOR MORE INFORMATION
For current system status information about your Zendesk, check out our system status page. During an incident, you can also receive status updates by following @ZendeskOps on Twitter. The summary of our post-mortem investigation is usually posted here a few days after the incident has ended. If you have additional questions about this incident, please log a ticket with us.