SUMMARY
During this incident, some Explore customers across multiple pods and regions experienced delays and outdated information in the Tickets dataset in Explore.
Timeline
10:38 UTC | 02:38 PT
Following the rollback, all reprocessed sync jobs have completed and Explore data syncs have returned to normal. As such, this incident is resolved. Thank you for your patience during our investigation.
00:18 UTC | 16:18 PT
We continue to manually reprocess sync jobs for impacted customers and will provide another update when we have substantive information to share.
20:58 UTC | 12:58 PT
We have completed the rollback of an Explore update that caused the delay on the Tickets dataset updates. We are manually reprocessing sync jobs. We will provide another update when we have new information to share.
20:11 UTC | 12:11 PT
We have found a potential root cause for the sync delays and errors seen today in the Tickets dataset for some Explore customers in a recently released update. We are working to roll back that update and are monitoring the results. We will provide further updates as soon as we have new findings to share.
19:06 UTC | 11:06 PT
We are still investigating the issues causing sync delays and errors for the Ticket dataset in Explore across multiple pods and regions. Our teams have made some progress but delays are still being seen for some accounts. We will continue to post new information as we find it.
18:10 UTC | 10:10 PT
Our team continues to investigate the issue causing sync delays and errors in the Tickets dataset for some Explore customers across multiple pods and regions. We will provide further updates as the investigation progresses.
17:40 UTC | 09:40 PT
We have confirmed an issue that is causing sync delays and errors in the Tickets dataset in Explore across multiple pods and regions. Our team is investigating and we will post additional information as we learn more.
17:29 UTC | 09:29 PT
We are investigating reports of Explore sync delays in the Tickets dataset across multiple pods and regions. We will provide further updates shortly.
POST-MORTEM
Root Cause Analysis
Background: We have a system (Explore ETL) that regularly collects data for our customers. This system handles the actual process of data collection. The collected data is stored and then further processed. An issue occurred with the data collection for an account with a considerably large amount of data per ticket. This large amount of information led to memory saturation, causing error and slowing down the data processing.
Additionally, a recent server maintenance upgrade changed how the memory garbage collection was executed, leading to an increase in memory consumption, which exacerbated the issue.
Resolution
Upon identifying the issues, we attempted to manage the overload by prioritizing tasks and restarting the servers. After further investigation and assistance from other teams, we identified the server upgrade issue as a culprit and rolled back to a previous version. The task processing then returned to normal.
Remediation Items
Our next steps are to prevent such incidents in the future by taking a number of measures:
1. Restrict the size of the payload for ticket data.
2. Reevaluate the server maintenance upgrade, considering the increased memory consumption with the new garbage collection operation.
3. Enhance our testing environment to mimic the production load better and accurately test for scenarios like this.
FOR MORE INFORMATION
For current system status information about your Zendesk, check out our system status page. The summary of our post-mortem investigation is usually posted here a few days after the incident has ended. If you have additional questions about this incident, please log a ticket with us via ZBot Messaging within the Widget.
1 comment
Jeremy R.
Post-Mortem published December 21, 2023
0