SUMMARY
On September 9, 2020 from 11:19 UTC to 13:55 UTC, Zendesk Support, Guide, Talk, and Explore customers on Pods 17 and 18 experienced server errors or an inability to access their accounts and/or degraded performance within their accounts.
TIMELINE
18:58 UTC | 11:58 PT
We are happy to report that the mitigation work on the backend for this incident affecting Pods 17 and 18 has been completed. As mentioned before, all customer impact should have been resolved as of 13:55 UTC.
17:47 UTC | 10:47 PT
Customers on Pod 18 should no longer be experiencing performance issues. We are continuing to work on the backend to ensure the remaining issues are mitigated. We’ll provide an update when we officially all clear the incident.
15:23 UTC | 08:23 PT
We are seeing notable improvements for customers on Pod 18. Our teams are continuing to work on the backend to ensure the remaining issues are mitigated, so we're not completely clear, but we will provide another update when we have more information on that mitigation work.
14:28 UTC | 07:28 PT
We're continuing to work on improving the performance on Pod 18. We greatly appreciate your patience as we work hard to get things back to a normal state. We'll continue to provide updates as soon as we have more to report.
13:56 UTC | 06:56 PT
Pod 17 remains stable. Pod 18 is online in a degraded state. Our teams are working to fully restore service. We sincerely apologise for the disruption this has caused to your Zendesk service.
13:18 UTC | 06:18 PT
We continue to see stability on Pod 17. We also see improvements on Pod 18. We’ll provide an update once available. Thanks for your ongoing patience.
12:46 UTC | 05:46 PT
We’re seeing some improvements on Pod 17. We are still working towards a full resolution for Pod 18. Please bear with us as we work to fully resolve this issue.
12:18 UTC | 05:18 PT
Our teams continue to work on mitigating the outage which is currently impacting Pods 17 & 18 customers. We truly apologise for the inconvenience.
11:48 UTC | 04:48 PT
Our teams are investigating an outage which is impacting Pod 17 & 18. We will provide further updates ASAP.
POST-MORTEM
Root Cause Analysis
This incident was caused by a scheduled change to our production environment affecting servers that provide internal DNS resolution for our services. The changes were to apply Linux kernel updates and expected to be a routine motion by engineering staff and, as such, performed during normal business hours.
The change was rolled out progressively, initially targeting the internal DNS servers which are used by Zendesk hosts in the EU region (Pods 17 and 18). Each of the DNS servers rebooted successfully in sequence. Shortly afterwards, our engineering team observed a critical service on these servers beginning to fail as the Linux kernel security module was blocking the service due to changes in the kernel update. Editing the configuration of the service allowed the service to start and DNS resolution started functioning again.
While our efforts to restore service on both Pods 17 and 18 were successful for Pod 17, continued impact on Pod 18 was felt by customers due to: (1) A DNS configuration dependency from a core application within our infrastructure, and (2) A capacity shortfall within our views service that prevented it from being able to respond to the thundering herd generated from the internal DNS recovery.
Resolution
Normal performance was restored to Pod 17 at 12:27 UTC when internal DNS resolution was restored. The subsequent issues on Pod 18 were resolved by 13:55 UTC to fully recover performance.
Remediation Items
- Ensure our Linux kernel security module is properly configured for all internal DNS hosts.
- Review and improve monitoring and alerting on critical services impacted.
- Decommission legacy DNS infrastructure in favor of AWS native services (Route53).
- Configure internal DNS servers to have cross region redundancy.
- Automate a circuit breaker to protect the Zendesk views execution service.
- During an incident make sure all affected products and sub products are identified and posted to the Zendesk Status Page.
FOR MORE INFORMATION
For current system status information about your Zendesk, check out our system status page. During an incident, you can also receive status updates by following @ZendeskOps on Twitter. The summary of our post-mortem investigation is usually posted here a few days after the incident has ended. If you have additional questions about this incident, please log a ticket with us.
1 Comments
Post-mortem published September 17, 2020.
Article is closed for comments.