SUMMARY
On February 23, 2024 from 08:00 UTC to 17:26 UTC, Support customers across all Pods experienced the issue where no responses were received via the Side Conversation feature.
Timeline
13:12 UTC | 05:12 PT
We’re currently investigating reports of message update issues with Side Conversations across multiple Pods in Support. Investigation is ongoing. Thanks for your patience.
13:34 UTC | 05:34 PT
We have now deployed an older working version of the platform and started seeing normal processing of emails. We continue monitoring. Another update in 30 min or when we have more information to share.
14:11 UTC | 06:11 PT
We’ve successfully reinstated a previous stable version of our platform, and inbound Side Conversation emails are now functioning as expected. The majority of the backlog has been addressed, and we are now operating at full capacity. We continue working on the remaining recovery tasks to confirm if restoration from any older messages that may not have been processed is needed. We will continue to monitor the situation closely and provide another update in 1h, or sooner should there be any significant developments to report. Thank you for your continued understanding.
15:17 UTC | 07:17 PT
We’re maintaining platform stability with no new developments to report at this time. We appreciate your patience as we continue recovery efforts. Updates will now be provided every 4 hours, or as soon as new information becomes available.
POST-MORTEM
Root Cause Analysis
This incident was caused by an escaped defect. New code was deployed for the inbound email service that included changes on how credential tokens were read off rotating tokens. The code that was running referenced an expired token, causing emails to be backlogged.
Resolution
To fix this issue, the deployed code was rolled back to the previous stable version, operations were restored and the email backlog was cleared.
Remediation Items
- Update email service configuration to minimise transient error handling to avoid reprocessing work [Scheduled]
- Update runbook to page the team quicker in the event of a recurrence of this scenario [Scheduled]
FOR MORE INFORMATION
For current system status information about your Zendesk, check out our system status page. The summary of our post-mortem investigation is usually posted here a few days after the incident has ended. If you have additional questions about this incident, please log a ticket with us via ZBot Messaging within the Widget.
1 comment
Eugene Khoo
Post-mortem published March 1, 2024.
0