Service Incident - August 8, 2024 - Explore | Pods 17, 18, 28, 29 - Errors when creating new reports

SUMMARY

On August 7, 2024 from 17:00 UTC to August 8, 2024, 16:45 UTC, some customers on Pods 17, 18, 28, 29 experienced errors while trying to create new reports using the Explore feature. This impacted their ability to generate insights and access critical data reports.

Timeline

August 08, 2024 03:35 PM UTC | August 08, 2024 08:35 AM PT
We are investigating reports of Explore customers with a large number of datasets being unable to create new reports and datasets. As a workaround, you can create a new report by cloning an existing report and editing that report, however you will still be unable to create new datasets. Next update in 30 minutes or when we have new information.

August 08, 2024 04:00 PM UTC | August 08, 2024 09:00 AM PT
Our engineers continue to investigating an issue impacting the ability to create new Explore reports and datasets. We have narrowed the scope of impact to pods 17, 18, 28, 29, and 31. Next update in one hour or when we have new information to share.

August 08, 2024 04:54 PM UTC | August 08, 2024 09:54 AM PT
Our engineers remain focused on resolving the issue affecting the creation of new Explore reports and datasets. We will provide the next update in 2 hours or when we have new information to share.

August 08, 2024 05:21 PM UTC | August 08, 2024 10:21 AM PT
Our engineers have rolled out a fix and we have confirmed that you can now create new reports and datasets. The issue is now fully resolved. Please let us know if you continue to experience issues.

POST-MORTEM

Root Cause Analysis

This incident was caused by a performance degradation following an upgrade to our database infrastructure provided by our partner. The upgrade removed query caching that our system previously relied upon, significantly slowing down certain queries crucial to the Explore feature.

Resolution

To fix this issue, we analyzed the problematic queries and implemented effective indexing strategies. This immediate action restored the query performance to its expected levels, thereby resolving the errors customers were experiencing.

Remediation Items

Improve the monitoring system with specific alerts centered around SQL query latencies to catch performance issues early.
Communicated the deprecation of query cache to other teams, ensuring they are aware of potential impacts and can take preemptive measures.
Investigate optimizing queries by replacing "not in" statements with boolean values to align with the new database versions.
Investigate implementing caching solutions using elastic cache or ProxySQL for repeated query results to avoid similar issues in the future.
Ensure all changes, improvements, and processes are thoroughly documented in Confluence to share knowledge and prepare for similar future events.

FOR MORE INFORMATION

For current system status information about your Zendesk, check out our system status page. The summary of our post-mortem investigation is usually posted here a few days after the incident has ended. If you have additional questions about this incident, contact Zendesk customer support.