SUMMARY
On February 1, 2022 from 14:55 UTC to 15:25 UTC, some Explore customers in both US and EU hosted data centres experienced Dashboards displaying “Network error” when accessing them - for some customers, it was affecting all Dashboards, for others just the pre-canned ones.
Timeline
15:59 UTC | 07:59 PT
We received reports from Explore customers experiencing “Network Error” messages when accessing their dashboards. Our team has addressed the issue in US hosted data centres and is now working on the EU based ones. Please make sure to refresh your browser and try to access data.
17:11 UTC | 09:11 PT
We are seeing improvement in Explore dashboard performance on both EU and US based data centres. We are monitoring to ensure full recovery and will update you when the issue is fully resolved.
20:20 UTC | 12:20 PT
We are happy to report that the issue causing network errors when accessing Explore dashboards has been resolved. Thank you for your patience during our investigation.
POST-MORTEM
Root Cause Analysis
This incident was caused by two different unanticipated concurrent events that happened: a deploy to the service that computes analytical queries, plus our cloud computing platform provider trying to remove the old tasks from the queue.
Due to this, the latter was trying to both create new tasks for the deploy and create new tasks to replace the ones it removed. This became so resource-intensive, that it was not possible to create them all, resulting in having no actual tasks running during the 30 min window.
Resolution
During the incident, we were in need of server resources to run the service that computes analytical queries. After a while, the infrastructure was stabilized and we were able to find resources to start the service, then respond to customers’ analytical queries.
Remediation Items
- Remove Rails as a mandatory dependency for the service that computes analytical queries health check [Done]
- Change said service deploy strategy to 120% [Done]
- Create monitor to alert on when the service has not had enough tasks [To Do]
- Review Engine services deploy strategy [Backlog]
FOR MORE INFORMATION
For current system status information about your Zendesk, check out our system status page. During an incident, you can also receive status updates by following @ZendeskOps on Twitter. The summary of our post-mortem investigation is usually posted here a few days after the incident has ended. If you have additional questions about this incident, please log a ticket with us.
1 Comments
Post-Mortem published February 9, 2022.
Article is closed for comments.