SUMMARY
On November 16, 2023 from 18:02 UTC to 20:00 UTC some Support customers in Pods 13, 17, 19, 23, 28, and 29 encountered delays or a total stoppage in receiving inbound emails. The delay durations varied from 15 to 60 minutes between the sending of emails and ticket creation in Zendesk via Google-managed services.
Timeline
18:53 UTC | 10:53 PT
We are investigating reports of inbound emails not processing for customers on Pods 28 and 29. We will provide additional information shortly.
18:57 UTC | 10:57 PT
We have confirmed an issue that is causing delays in inbound email processing for customers on Pods 13, 19, 23, 28, and 29. Our team is investigating and we will provide further updates as soon as they're available.
19:33 UTC | 11:33 PT
Our team continues to investigate the issue causing inbound email processing delays on Pods 13, 17, 19, 23, 28, and 29. We are working diligently to mitigate impact, and will be sure to share new information as soon as we can.
19:54 UTC | 11:54 PT
We are beginning to see improvement in the issue causing delays in inbound email processing on Pods 13, 17, 19, 23, 28, and 29. Our team will continue to monitor to ensure full recovery.
21:14 UTC | 13:14 PT
We have resolved the issue causing inbound email delays for customers on Pods 13, 17, 19, 23, 28, and 29, and inbound emails are processing as expected at this time. Thank you for your patience during our investigation.
POST-MORTEM
Root Cause Analysis
This incident was caused by the Mail Fetcher service experiencing connectivity problems with Gmail, disrupting inbound mail processing in Support, where Gmail's 302 Moved responses were interpreted as failures by the liveness probe, indicating to the container orchestrator that the Pods were unhealthy. This led the orchestrator to replace the Pods and halt mail processing in the associated containers, causing inbound mail delays or stoppages.
Resolution
To fix this issue, inbound mail traffic was restored after Gmail stopped blocking those health checks, thus allowing Support inbound email to finish creating its Pods and start processing mail again. Not long after, inbound mail queues caught up and traffic started flowing as normal.
Remediation Items
- Improve existing implementation tools for email health checks.
- Create additional alerts.
- Add correction code lines on specific applications.
FOR MORE INFORMATION
For current system status information about your Zendesk, check out our system status page. The summary of our post-mortem investigation is usually posted here a few days after the incident has ended. If you have additional questions about this incident, please log a ticket with us via ZBot Messaging within the Widget.
1 comment
Jessica G.
Postmortem published November 23, 2023.
0