On June 5, 2019 from 22:07 UTC to 2:00 UTC customers on Pod 12 were unable to access Zendesk Support, Guide & Talk.
22:45 UTC | 15:45 PT
We are currently investigating performance issues impacting Zendesk Support, Guide & Talk on Pod 12.
23:14 UTC | 16:14 PT
We are continuing to investigate the performance issues impacting Zendesk Support, Guide & Talk on Pod 12.
23:51 UTC | 16:51 PT
We are currently working to resolve the outage on Pod 12, we have rolled back some changes and are continuing to investigate.
00:17 UTC | 17:17 PT
Our teams are continuing to investigate and work towards resolution for the Pod 12 outage. Next update in 1 hour.
01:17 UTC | 18:17 PT
Investigation is still ongoing for the Pod 12 outage. Next update in 1 hour.
02:08 UTC | 19:08 PT
We are seeing recovery on Pod 12, and email channel tickets may take some time to be processed. Updates to follow. Thanks for your patience while we work towards resolution.
02:33 UTC | 19:33 PT
Services on Pod 12 have been fully resolved. Postmortem will be posted here: https://support.zendesk.com/hc/en-us/articles/360024206974-Service-Incident-June-5th-2019
Root cause Analysis
This incident was caused by an incorrect configuration in the deployment code applied to the Elastic Network Interfaces (ENIs) used by the AWS Container Networking Interface (CNI) Plugin which did not allow containers to connect to the network.
To fix this issue, we first rolled back the problematic configuration, and then drained the affected nodes on Pod 12 until all were cleared to ensure no improperly configured ENIs remained.
- Investigate new tools to identify applications and nodes being rate limited
- Create a runbook to thoroughly document the process to troubleshoot this type of incident
- Update and increase awareness of implementation process
FOR MORE INFORMATION
For current system status information about your Zendesk, check out our system status page. During an incident, you can also receive status updates by following @ZendeskOps on Twitter. The summary of our post-mortem investigation is usually posted here a few days after the incident has ended. If you have additional questions about this incident, please log a ticket with us.