SUMMARY
On February 15, 2022 from 05:40 UTC to 14:00 UTC, Explore customers in the EU region infrastructure data centers experienced slowness in dataset updates.
Timeline
12:49 UTC | 04:49 PT
Our Team is investigating reports of datasets not updating in Explore for our customers hosted on our EU infrastructure. Further updates to follow.
13:15 UTC | 05:15 PT
Our team continues to investigate the root cause of datasets not updating in Explore for our customers hosted on our EU infrastructure. We will provide a further update in 30 minutes.
13:54 UTC | 05:54 PT
We have identified the root cause of the issue impacting our Explore customer hosted on our EU infrastructure, causing dataset delays, and implemented fix. We are monitoring the behaviour and will provide further updates within the hour.
14:49 UTC | 06:49 PT
Our team continues to monitor the recovery for the issue causing dataset update delays in Explore for our customers hosted on our EU infrastructure. As processing the backlogged data continues, we will provide a final update once this has finished.
17:09 UTC | 09:09 PT
We are happy to report that the issue causing dataset update delays in Explore for our customers hosted on our EU infrastructure is resolved, and all backlogged data has been processed. Thank you for your patience.
POST-MORTEM
Root Cause Analysis
This incident was caused by all Explore accounts hosted in one of the EU clusters getting their pipelines jammed. This prevented new pipelines to write in these clusters because of the concurrency limitation of the external system’s clusters.
Resolution
To fix this issue, the team terminated the jammed write pipelines on all clusters. We then saw direct progress of pending queries and after a few hours, all of them were completed and data sync resumed accurately.
Remediation Items
- Redistribute accounts as evenly as possible to reduce the maximum load on the databases at the same time. [Done]
- Track with the external system why the queries were blocked in the EU [In Progress]
- Add timeout data warehouse service’s copy queries [To Do]
- Implement alerts on jammed queries in the data warehouse service [To Do]
FOR MORE INFORMATION
For current system status information about your Zendesk, check out our system status page. During an incident, you can also receive status updates by following @ZendeskOps on Twitter. The summary of our post-mortem investigation is usually posted here a few days after the incident has ended. If you have additional questions about this incident, please log a ticket with us.
1 Comments
Post-mortem published March 11, 2022.
Article is closed for comments.