Beginning May 24th, 2019 13:00 UTC to June 4th 20:14 UTC customers across all pods were intermittently unable to load Zendesk Explore dashboards, receiving 5xx errors.
18:27 UTC | 11:27 PT
Our team is investigating performance issues with Explore dashboards, including error messages and general slowness. More information to come.
19:01 UTC | 12:01 PT
We are seeing an improvement in the performance issue with Explore after increasing our capacity. We are continuing to investigate a root cause.
19:46 UTC | 12:46 PT
Our investigation into the Explore performance issues is still underway. We’re looking into numerous options to remedy the issue while we investigate the root cause.
20:29 UTC | 13:29 PT
Normal performance in Explore has been restored. Please let us know if you are still experiencing problems.
Root cause Analysis
This incident was caused by an escaped defect which introduced an outdated JVM docker image. The outdated image referenced less resources than were actually available, which created a backlog of queries that resulted in latency and intermittent loading of Explore dashboards.
To fix this issue, we rolled back to the previous production state.
- Continue to improve our monitoring tools for increased visibility into errors in the query engine
- Investigate configuration optimization of Query engine containers.
FOR MORE INFORMATION
For current system status information about your Zendesk, check out our system status page. During an incident, you can also receive status updates by following @ZendeskOps on Twitter. The summary of our post-mortem investigation is usually posted here a few days after the incident has ended. If you have additional questions about this incident, please log a ticket with us.