02:13 UTC | 18:13 PT
We're happy to confirm that the root cause of the issue impacting Pod 13 has been identified. If you're still experiencing issues, please let us know. Thank you for your patience!
00:57 UTC | 16:57 PT
Between 00:21 UTC and 00:25 UTC we experienced a brief network disruption impacting Pod 13. Issues have resolved, however our teams continue to investigate to identify the root cause. If you still experience any issues, please don't hesitate to contact us.
On November 21, 2019 from 00:20 UTC to 00:25 UTC, Zendesk experienced an outage impacting Support and Chat on Pod 13. Zendesk Support agents experienced error screens and were unable to use the product for the duration of the incident. Chat agents maintained chat sessions, however no new Chat agent logins would have succeeded.
This issue occurred when Kubernetes API proxy servers were cycled as part of an alert cleanup and service discovery failed. The Kubernetes API Proxies were shut down and took time to be replaced causing Nginx to be unable to communicate with many services running on Kubernetes in Pod 13. The incident resolved once new Kubernetes API proxy servers were spun up.
In order to prevent this from happening again in the future, we will improve our service discovery and instance rotation workflow to prevent parallel rotation, create alerts when proxy host group has too few instances, increase api proxy to minimum instances, and investigate automating instance rotation workflow to prevent parallel rotation.
FOR MORE INFORMATION
For current system status information about your Zendesk, check out our system status page. During an incident, you can also receive status updates by following @ZendeskOps on Twitter. The summary of our post-mortem investigation is usually posted here a few days after the incident has ended. If you have additional questions about this incident, please log a ticket with us.