SUMMARY
On March 08, 2023 from 07:10 UTC to 11:02 UTC, Support customers on Pods 17, 18, 28 and 29 experienced delays in the creation of emails.
Timeline
09:25 UTC | 01:25 PT
Our teams are investigating reports of email tickets being delayed or not being created for some Support customers on Pods 17, Pod 28, Pod 29 . We will provide more information as it is available.
09:45 UTC | 01:45 PT
We continue to investigate reports of email ticket creation being delayed or not being created for customers on Pod 17, 18, 28 and 29. This will also cause the forwarding status for integrated emails to fail. We will provide more information soon.
10:09 UTC | 02:09 PT
We can confirm that emails are eventually delivering, however they are subject to long delays for customers on EU pods. We will share another update in 30 mins or when more information becomes available.
10:39 UTC | 02:39 PT
Inbound emails for Support customers on EU pods are being processed in the backend but may take some time before creating tickets. Outbound emails should not be affected. We continue to work on resolving this and will provide another update in 30 mins or when we have more details.
11:09 UTC | 03:09 PT
We are seeing full processing of inbound emails for customers on Pods 18, 28 and 29 and approximately 2/3 for Pod 17. We will continue to monitor the situation and will update you in 30 minutes or when more information becomes available.
11:41 UTC | 03:41 PT
We are happy to report the full processing of inbound emails in Support, with the service fully restored for Pods 17, 18, 28 & 29. Forwarding status for integrated emails will need to be rechecked manually. Thank you for your patience while we worked through this.
POST-MORTEM
Root Cause Analysis
The incident was caused by an outage to our primary observability tool. When incoming mails were being processed, attempts at writing to logs failed as a result of timeout errors from the outage, specifically in the EU region. These timeouts caused intermittent mail processing errors and retries, causing a backlog to build. This led to email processing delays.
All customer emails that were impacted during the incident were eventually retried and processed.
Resolution
This issue was fixed as stability to our primary observability tool was restored.
Remediation Items
- Improve error handling in the code which tries to connect to our primary observability platform, to handle connection timeouts gracefully [TO DO]
- Investigate ways to provide additional observability in case of future our primary observability platform [TO DO]
FOR MORE INFORMATION
For current system status information about your Zendesk, check out our system status page. The summary of our post-mortem investigation is usually posted here a few days after the incident has ended. If you have additional questions about this incident, please log a ticket with us via ZBot Messaging within the Widget.
1 Comments
Post-mortem published 29 March, 2023
Article is closed for comments.