Summary
On December 3, 2023 from 09:31 until 10:46 UTC, some Support customers on Pod 28 experienced an account-related issue preventing access and causing server errors while using the product. This issue is now resolved.
We would like to express our sincere apologies for the lack of communication during and after this issue. We are investigating remediations for the internal process breakdown that prevented our communication team from being brought in to update our customers while this issue was present. We are also working on providing a post-mortem of the issue that will be completed and posted here once our teams have held a retrospective for this service issue.
Thank you for your patience during this issue and apologies again for our communication error.
POST-MORTEM
Root Cause Analysis
There was a window of 15 minutes where our in database caching facility was unreachable for one of our crucial backend services, as a result of an escaped defect in a code deployment that did not handle such a timeout scenario. This led to a build up of the message queue backlog, resulting in 503 errors being presented to customers.
Resolution
Once the database cache became available again, the processing of service backlog promptly resumed, and we began observing signs of recovery.
Remediation Items
- Adding a specific test case against the caching timeout logic [Scheduled]
- Adding a monitor alert to notify when the caching facility becomes unavailable [Scheduled]
FOR MORE INFORMATION
For current system status information about your Zendesk, check out our system status page. The summary of our post-mortem investigation is usually posted here a few days after the incident has ended. If you have additional questions about this incident, please log a ticket with us via ZBot Messaging within the Widget.
1 comment
Eugene Khoo
Post-mortem published December 19, 2023.
0