SUMMARY
On February 8, 2021 from 9:10 UTC to 12:41 UTC, customers experienced errors accessing Support, Explore, Talk & Guide in Pod 18. Additionally, some Agent Workspace customers reported they were unable to start new chats.
Timeline
12:47 UTC | 04:47 PT
We are happy to report the outage on Pod 18 has now been resolved. We have re-enabled the automatic ticket view counts, please hard refresh your browser if you do not see improvements. Apologies for the inconvenience caused.
11:36 UTC | 03:36 PT
We are happy to report the outage on Pod 18 has now been resolved. We have still temporarily disabled automatic ticket view counts so please continue to manually view these for accurate counts. Thank you for your patience and please let us know if you see any further issues.
10:54 UTC | 02:54 PT
Pod 18 remains stable. We have temporarily disabled automatic ticket view counts. Please manually select ticket views for up to date counts. We will provide an update when this feature is enabled. We are also investigating performance issues on Pod 17 and will provide updates.
10:29 UTC | 02:29 PT
We are beginning to see some stability for the outage on Pod 18. We will continue to provide updates.
09:57 UTC | 01:57 PT
We are continuing to investigate the outage impacting customers on Pod 18 along with new chat requests for customers on Agent Workspace, we will continue to provide updates.
09:38 UTC | 01:38 PT
Pod 18 customers may not be able to access Support, Talk & Guide. We’re also aware that new chats cannot be initiated for Agent Workspace customers. Our teams are working to understand the cause of the incident. We sincerely apologise for the inconvenience caused.
09:25 UTC | 01:25 PT
We have been alerted about a potential outage impacting Pod 18 customers. Our teams are working to identify impact. We will provide a further update in 15 minutes.
Root Cause Analysis
The incident was caused by high load on the application servers that handle View requests. A higher than expected request volume from account automations, coupled with automatic retries on upstream failures, overwhelmed the Views system holding open connections with Support. This prevented new requests from promptly being served. The overloaded View application servers increased pressure on the compute cluster operating many of the services used by Support, Chat, Explore, Guide, and Talk.
Resolution
At 09:40 UTC, the team disabled View counts, which helped reduce overall load, allowing us to scale the number of application servers and recover most Support capabilities.
At 09:58 UTC, Services in Pod 18 were fully restored. However, the services were disrupted again for approximately 30 minutes when the View count feature was re-enabled. This was due to an error with the mechanism that throttles this feature to a specific percentage. While that did not affect any other Support features, View counts were disabled again and more application servers were added to handle the increased workload required to enable all View counts. During this time, View counts would have been intermittently unavailable.
At 12:41 UTC, the View count feature functionality was fully restored.
Remediation Items
- Improve CPU utilization monitoring and alerting for the Views API servers
- Review and further optimize performance for View Counts requests.
- Perform stress tests to confirm progressive View Counts enablement
- Revaluate Views load trends and adjust the capacity baseline
- Revaluate the View Counts circuit breaker strategy to react faster during such failure modes
- Implement fine grained control over which requests can have automatic retries in the event of failure
- Adjust auto scaling policy for Views API fleet to take CPU levels as a signal to scale.
- Automatic retries for ticket creation after service stabilization
FOR MORE INFORMATION
For current system status information about your Zendesk, check out our system status page. During an incident, you can also receive status updates by following @ZendeskOps on Twitter. The summary of our post-mortem investigation is usually posted here a few days after the incident has ended. If you have additional questions about this incident, please log a ticket with us.
1 Comments
Post-Mortem published February 11, 2021
Article is closed for comments.