Service Incident - March 6, 2024 - Explore | Pods 17, 18, 28 & 29 - Access Issues

SUMMARY

On March 6 from 13:33 UTC to 14:15 UTC, customers were unable to load Explore, receiving 502 bad gateway errors.

Timeline

15:02 UTC | 07:02 PT
As we received confirmations that customers are no longer receiving server errors while trying to access Explore and our backend no longer presents errors as well, we are considering this incident as resolved. Thank you for your patience while we worked through this disruption.

14:24 UTC | 06:24 PT
We are seeing improvements in the number of page loads and getting confirmation that Explore is now correctly accessible and loading after a page refresh. We kindly request you to reload Explore and let us know if you still face any issues. We appreciate your patience and help.

14:15 UTC | 06:15 PT
Explore customers in Pod 17, 18, 28 and 29 should be the only ones affected by the issues at this point. We continue working on restoring access. More updates in 30 min or when we have further details.

14:04 UTC | 06:04 PT
We are currently investigating reports of issues of Explore not loading for customers across multiple Pods. Investigation is underway.

POST-MORTEM

Root Cause Analysis

On March 6, 2024, users trying to access Explore encountered errors due to a process initiated to update the system in the background. This process caused temporary "locking" issues with our database, resulting in errors for our users. The issue started at 13:22 UTC and was resolved by 14:07:00 UTC.

Our Engineering team was working on a new feature designed to provide users with usage statistics. To make this feature more efficient, a new process was introduced. This process involved updating a table in our database each time a dashboard was accessed, reducing the repeated calculations. The problem arose when this process started to populate historical records for existing dashboards.

The incident was caused primarily by the process that was initiated to populate historical records. This process caused prolonged "locks" in our database, leading to timeouts and errors.

Resolution

Once the stalled queries were cleared and after a restart of the Rails application, Explore's normal operations resumed.

Remediation Items

Review backfill process
Update backfill process playbook
Process dashboard_views records asynchronously

FOR MORE INFORMATION

For current system status information about your Zendesk, check out our system status page. The summary of our post-mortem investigation is usually posted here a few days after the incident has ended. If you have additional questions about this incident, please log a ticket with us via ZBot Messaging within the Widget.