SUMMARY
On Feb 6, 2025 from 18:00 UTC to Feb 7 10:25 UTC, some US Explore customers experienced delays on Explore dashboards data.
TIMELINE
February 07, 2025 11:12 AM UTC | February 07, 2025 03:12 AM PT
We are pleased to inform you that the issue with the Explore dashboard has been resolved as of 10:25 UTC. Thank you for your patience and understanding!
February 07, 2025 10:54 AM UTC | February 07, 2025 02:54 AM PT
We are currently experiencing delays on the Explore dashboard since yesterday at 20:00 UTC. Our engineering team has identified the issue and applied a fix. We are actively monitoring the situation to ensure a smooth experience. Thank you for your patience!
POST-MORTEM
Root Cause Analysis
This incident was caused by insufficient capacity in a processing cluster, triggered by a large export of data using the new Data Exporter service. The query took an excessively long time to execute, leading to multiple retries, which resulted in three concurrent executions of the same problematic query. These queries continued running even after the service that initiated them was stopped, contributing to the CPU usage spike.
Resolution
To resolve the issue, the team manually restarted the processing cluster, which terminated the stuck queries and returned the cluster to normal operational capacity. This action restored the ability of the cluster to process other queries effectively.
Remediation Items
- Implement Query Time Limits: Establish time limits on export queries to prevent excessively long executions from impacting system performance.
- Improve Monitoring: Enhance monitoring systems to trigger alerts for high CPU usage more prominently and sensitively, ensuring quicker responses to potential issues.
- Review and Optimize Queries: Review all queries associated with the Data Exporter to identify and optimize those using JOIN clauses with OR conditions, which are problematic for performance.
FOR MORE INFORMATION
For current system status information about Zendesk and specific impacts to your account, visit our system status page. You can follow this article to be notified when our post-mortem report is published. If you have additional questions about this incident, contact Zendesk customer support.
1 comment
Bob Novak
Postmortem published February 26, 2025
0