SUMMARY
Between November 9, 2023 from 3:00 AM UTC to November 10, 2023 10:00 AM UTC, Explore customers within US Region (multiple Pods) experienced data refresh delays for historic datasets.
Timeline
19:18 UTC | 11:18 PT (Nov 9)
Explore accounts in the US region are currently experiencing data refresh delays for historic datasets. The Explore query and reporting features are available as normal, and Realtime datasets are not affected. Investigation into the root cause is still ongoing so we do not have an ETA for when the delay will be resolved.
11:27 UTC | 03:27 PT (Nov 10)
We have implemented a fix for the data refresh delays for historic datasets in the Explore US region and customers should no longer experience any delays. We will now consider this incident as solved. Thank you for your collaboration.
POST-MORTEM
Root Cause Analysis
The incident was caused by the failure of the Account Statistics service to provide correct information about account data volumes for multiple accounts, such that accounts were reported as smaller than their actual size. This, in turn, caused the Explore ETL systems to allocate lower compute resources than needed to process regular delta updates for these accounts, resulting in slower processing and some cases of failed or timed-out data pipelines.
Resolution
To fix this issue, the Account Statistics service was temporarily switched to a fallback data source with correct account data. This restored correct resource allocation in the Explore ETL systems. Explore engineers also manually boosted resources to unblock accounts with long data processing delays.
Remediation Items
- Create additional alerts for spikes in account size classification changes.
- Investigate replacing pipeline resource allocation.
- Add validation phase to data pipelines to detect missing or anomalous source data.
FOR MORE INFORMATION
For current system status information about your Zendesk, check out our system status page. The summary of our post-mortem investigation is usually posted here a few days after the incident has ended. If you have additional questions about this incident, please log a ticket with us via ZBot Messaging within the Widget.
1 comment
Jessica G.
Postmortem published November 28, 2023.
0