SUMMARY
On March 23, 2022 from 09:08 UTC to 21:51 UTC, Zendesk Explore customers in the U.S. experienced degraded performance and errors when sharing or scheduling a dashboard.
Timeline
21:00 UTC | 14:00 PT
We have received reports of issues loading the dashboard share modal in Explore. We will provide additional updates as soon as we can.
21:15 UTC | 14:15 PT
We have confirmed issues loading the dashboard share modal for US-based Explore customers. Our team is investigating and we will provide additional information as it becomes available.
21:45 UTC | 14:45 PT
We are happy to report that we are seeing improvement in dashboard sharing functionality for US-based Explore customers. Please let us know if you continue to see any issues.
21:58 UTC | 14:58 PT
Explore dashboard share functionality has been fully restored for all US-based Explore customers. Thanks for your patience during our investigation.
Root Cause Analysis
This incident was caused by a bug in the Explore software that resulted in an internal database user exceeding its maximum number of connections to our database. Resolution time was extended for this incident due to a failed rollback attempt. Rollback initially failed due to new containers being spun up that performed health checks through connections to the database; since the database connections were exceeded, the health checks failed and prevented the initial rollback.
Resolution
To fix this issue, our team completed a failover to a new database and rolled back the offending change in all regions.
Remediation Items
- Create additional monitors and alerts for database connections and CPU usage [Scheduled]
- Create additional smoke tests for the share dashboard workflow [Scheduled]
- Investigate database user connection limits [Scheduled]
- Investigate deploy strategies in the scenario where database connection limits are exhausted [Scheduled]
FOR MORE INFORMATION
For current system status information about your Zendesk, check out our system status page. The summary of our post-mortem investigation is usually posted here a few days after the incident has ended. If you have additional questions about this incident, please log a ticket with us via ZBot Messaging within the Widget.