Service Incident - March 28th, 2022 - Explore | US Pods - Slowness and errors

SUMMARY

On March 28, 2022 from 13:00 UTC to 17:20 UTC, Explore customers based in US hosted Pods experienced slowness and potentially encountered errors or were unable to load the system when trying to view or work with dashboards and queries.

Timeline

14:43 UTC | 07:43 PT

Our teams are investigating reports of slowness, and potentially running on errors when viewing or working with dashboards and queries in Explore for our customers hosted outside of the EU. More information to follow.

15:12 UTC | 08:12 PT

Our engineers have identified the root cause of the issue causing slowness, and errors when viewing or working with dashboards and queries in Explore for our customers hosted outside of the EU. We will provide a further update within 30 minutes.

15:42 UTC | 08:42 PT

We are beginning to see improvement loading Explore queries and dashboards. We are monitoring the situation and will provide additional updates as the situation resolves.

16:12 UTC | 09:12 PT

We are still seeing some issues loading Explore queries and dashboards for customers based outside of EU. Our team is investigating a potential solution and we will provide another update as soon as we can.

16:42 UTC | 09:42 PT

Our team continues to work towards a potential solution for sluggish behavior in Explore for non-EU customers, but it is taking some time. We will provide additional updates in an hour, or as soon as we have new information to share.

17:31 UTC | 10:31 PT

We are seeing improvement in the ability to load queries and dashboards for Explore customers based outside of EU. Please let us know if you continue to experience any issues.

19:39 UTC | 12:39 PT

We are happy to report the ability to load queries and dashboards for Explore customers based outside of EU has been restored.

POST-MORTEM

Root Cause Analysis

This incident was caused by insufficient system capacity due to full usage during deploy and backfill jobs being done simultaneously. That caused the main API for the Explore service to see degraded performance.

Resolution

To fix this issue, the backfill job was stopped and capacity was increased to accommodate the infrastructure changes being rolled out as well as new requests from the service.

Remediation Items

Improve alerts and monitor CPU utilization. [To Do]
Resume capacity planning for the main API for Explore. [Ongoing]
Revisit Backfill operations execution and requirements. [To Do]
Update instance types to match the upscale. [To Do]

FOR MORE INFORMATION

For current system status information about your Zendesk, check out our system status page. The summary of our post-mortem investigation is usually posted here a few days after the incident has ended. If you have additional questions about this incident, please log a ticket with us via ZBot Messaging within the Widget.