SUMMARY
On October 14, 2024 from 13:49 UTC to 15:40 UTC, Customers using Explore in the AMER region experienced "download failed" errors when attempting to export or schedule dashboards and reports.
TIMELINE
October 14, 2024 04:17 PM UTC | October 14, 2024 09:17 AM PT
We are happy to report that we have resolved the issue affecting Explore customers in the Americas, causing "download failed" errors when attempting to export or schedule dashboards and reports. Thank you for your patience during our investigation.
October 14, 2024 04:01 PM UTC | October 14, 2024 09:01 AM PT
We have found a root cause for the issue affecting US Explore customers causing "download failed" errors when attempting to download or schedule dashboards or reports; however, there is a backlog of requests that need to be processed and some delays may be experienced. We will monitor to ensure full resolution. Please let us know if you continue to experience any issues.
October 14, 2024 03:40 PM UTC | October 14, 2024 08:40 AM PT
We have confirmed an issue affecting US Explore customers causing "download failed" errors when attempting to download or schedule dashboards or reports. Our team is investigating and we will post further updates in the next 30 minutes.
October 14, 2024 03:26 PM UTC | October 14, 2024 08:26 AM PT
We are receiving reports of "download failed" errors for US Explore customers when attempting to download or schedule dashboards or reports. We will post additional information shortly.
POST-MORTEM
Root Cause Analysis
This incident was caused by the inadvertent deletion of a secret which was needed for services to authenticate within Explore. The deletion occurred during the cleanup process of Explore resources, where it was mistakenly assumed that the secret was no longer needed since it was available in a new version of the service.
Resolution
To fix this issue, the missing secret was recreated, allowing the service to start successfully again. This involved manual intervention to reapply the secret definitions through the codebase, ensuring that all necessary components were functioning as intended.
Remediation Items
- Increase the required number of reviewers to two on the relevant repository to enhance oversight on changes.
- Document the process for validating whether a secret on our previous version is still in use by other services.
- Develop a documented process for validating risk infrastructure changes using the staging environment and end-to-end tests.
- Establish guidelines for rolling out risk infrastructure changes to production, including appropriate soaking time.
- Investigate and address memory issues related to the Explore services to prevent future occurrences of similar incidents.
FOR MORE INFORMATION
For current system status information about your Zendesk, check out our system status page. The summary of our post-mortem investigation is usually posted here a few days after the incident has ended. If you have additional questions about this incident, contact Zendesk customer support.
1 comment
Bob Novak
Post-mortem published October 30, 2024.
0