On July 8, 2019, at 12:03 UTC / 5:03 AM PT, Guide Search started not retrieving results for a number of accounts on Pods 13, 14, 18, 19 and 20.
16:58 UTC | 09:58 PT
We're happy to report that the access and search issues affecting Guide for pods 13, 14, 18, 19, and 20 are now resolved.
16:06 UTC | 09:06 PT
Our mitigation efforts are continuing for the aforementioned issues affecting Guide customers, including Search functionality and general 500 errors. Performance should now be improving. We’ll continue to keep you updated!
14:41 UTC | 07:41 PT
We are actively investigating Guide Search on Pod 13 and 14 not returning results after our previous mitigation attempt. Next update in an hour
13:56 UTC | 06:56 PT
We have successfully mitigated the Search issues in Guide on Pod 13, 14 and 20. We are still monitoring for all Search activities, including for Pod 19 and 18, please do let us know if you are still having issues.
13:20 UTC | 06:20 PT
We are seeing improvements to Search for Guide customers on Pod 20. We’re continuing to work toward resolution for Pods 13 and 14. More information to come!
12:58 UTC | 05:58 PT
We are currently investigating Search issues in Guide on Pod 13, 14 and 20. Please bear with us, more info to come.
Root Cause Analysis
It was determined that a change to Help Centers integration with our Search module introducing trace information logging had led to Guide Search nodes rejecting tasks to execute due to workers resources being exhausted.
White restarting Search clusters had a positive effect during the incident and autoscaling services helped to add resources, the root cause was eventually identified as being the introduction of Search trace information logging, the change has been rolled back which re-established working Search service.
-Improved monitoring to fit this edge case scenario.
-Improved alerting for earlier awareness of this type of scenario.
-Amended capacity processes so scenario recurrence does not result in customer impact.
FOR MORE INFORMATION
For current system status information about your Zendesk, check out our system status page. During an incident, you can also receive status updates by following @ZendeskOps on Twitter. The summary of our post-mortem investigation is usually posted here a few days after the incident has ended. If you have additional questions about this incident, please log a ticket with us.