SUMMARY
On March 6 from 13:33 UTC to 14:15 UTC, customers were unable to load Explore, receiving 502 bad gateway errors.
Timeline
15:02 UTC | 07:02 PT
As we received confirmations that customers are no longer receiving server errors while trying to access Explore and our backend no longer presents errors as well, we are considering this incident as resolved. Thank you for your patience while we worked through this disruption.
14:24 UTC | 06:24 PT
We are seeing improvements in the number of page loads and getting confirmation that Explore is now correctly accessible and loading after a page refresh. We kindly request you to reload Explore and let us know if you still face any issues. We appreciate your patience and help.
14:15 UTC | 06:15 PT
Explore customers in Pod 17, 18, 28 and 29 should be the only ones affected by the issues at this point. We continue working on restoring access. More updates in 30 min or when we have further details.
14:04 UTC | 06:04 PT
We are currently investigating reports of issues of Explore not loading for customers across multiple Pods. Investigation is underway.
POST-MORTEM
Root Cause Analysis
On March 6, 2024, users trying to access Explore encountered errors due to a process initiated to update the system in the background. This process caused temporary "locking" issues with our database, resulting in errors for our users. The issue started at 13:22 UTC and was resolved by 14:07:00 UTC.
Our Engineering team was working on a new feature designed to provide users with usage statistics. To make this feature more efficient, a new process was introduced. This process involved updating a table in our database each time a dashboard was accessed, reducing the repeated calculations. The problem arose when this process started to populate historical records for existing dashboards.
The incident was caused primarily by the process that was initiated to populate historical records. This process caused prolonged "locks" in our database, leading to timeouts and errors.
Resolution
Once the stalled queries were cleared and after a restart of the Rails application, Explore's normal operations resumed.
Remediation Items
- Review backfill process
- Update backfill process playbook
- Process dashboard_views records asynchronously
FOR MORE INFORMATION
For current system status information about your Zendesk, check out our system status page. The summary of our post-mortem investigation is usually posted here a few days after the incident has ended. If you have additional questions about this incident, please log a ticket with us via ZBot Messaging within the Widget.