SUMMARY
From June 27th from 06:22 UTC to June 29th, 2021 10:08 UTC, some Explore EU-hosted customers had issues with their data not being accurately refreshed when accessing queries and dashboards for that period in time.
Timeline
11:19 UTC | 04:19 PT
We’re happy to confirm that all known issues related to the data refresh in Explore for some customers in Europe have been resolved. Thank you for your patience while we worked through this incident.
10:14 UTC | 03:14 PT
The team has deployed the fix for the Explore data refresh issue in Europe hosted accounts and we already see improvements in the data load for some accounts. We appreciate your patience with this matter. More updates within the next hour or so.
09:59 UTC | 02:59 PT
Our engineering team is currently testing a fix for the Explore incident. Once implemented, it will take a few hours for the data to be fully refreshed in the accounts. We will provide an update as soon as the fix has been deployed.
09:27 UTC | 02:27 PT
We have confirmed reports of a service disruption affecting data refresh in Zendesk Explore for customer accounts hosted in Europe. We are working toward resolution.
09:02 UTC | 02:02 PT
We are currently investigating reports from some customer accounts hosted in Europe about missing data in Zendesk Explore when filtering by the last few days. More info to come.
POST-MORTEM
Root Cause Analysis
This incident was caused because our data processing and analysis tool was unable to process the initial Explore data refresh requests. It did not have enough space to dequeue data pipeline jobs that were failing, in addition to queueing them back as low priority jobs.
Resolution
To fix this issue, the Engineering team applied a hotfix that changed the priority of retried jobs to a higher priority than normal jobs. They were then dequeued normally and correctly processed, leading to full data fresh and reload for the affected accounts.
Remediation Items
- Create an alert on the number of accounts not getting a pipeline started.
- Compute SLO correctly for monitoring and analytics tool metric.
FOR MORE INFORMATION
For current system status information about your Zendesk, check out our system status page. During an incident, you can also receive status updates by following @ZendeskOps on Twitter. The summary of our post-mortem investigation is usually posted here a few days after the incident has ended. If you have additional questions about this incident, please log a ticket with us.