SUMMARY
Between 08:09 UTC on November 7, 2023, and 2:20 UTC on November 21, 2023, Zendesk customers across all pods experienced delays in outbound email reception for end users with Gmail addresses, or other Google-managed email domains.
Timeline
19:53 UTC | 11:53 PT (Nov 8)
We have confirmed an issue causing some delays in outbound email reception for end users with Gmail addresses, or Google-managed email domains. It will be difficult to diagnose which emails are impacted; however, we are working with Google to mitigate this impact and resolve the issue as soon as possible.
04:23 UTC | 20:23 PT (Nov 10)
We continue to work multiple paths (including with our partners at Google) to diagnose the delayed delivery of some outbound emails to Google-managed email domains for some customers. While the overwhelming majority of emails are being delivered to these domains, we acknowledge the impact that the small proportion of delayed emails can have on some functions and are working hard to move this issue closer to resolution. Thanks for your patience while we continue to work through this issue.
20:02 UTC | 12:03 PT (Nov 13)
Our teams are still working diligently to address the delays in outbound email delivery to some Google-managed email domains. We will provide additional information as our investigation progresses, and we thank you for your continued patience in the meantime.
03:18 UTC | 19:18 PT (Nov 15)
As part of our ongoing investigations, we’ve taken measures to confirm that no undesirable traffic is being generated from our outbound Zendesk IP addresses. We are also actively collaborating with our email service provider to examine and lift any imposed rate limits if present. We understand the impact delayed email delivery would have on your operations and are looking at all avenues to get this resolved as promptly as possible. We request and thank you for your patience.
13:10 UTC | 05:10 PT (Nov 16)
We continue to work closely with our email service provider and our focus remains on addressing the rate limits impacting our service. Recognising the significant implications that delayed email delivery can have on your operations, we keep exploring all possible solutions to rectify this issue. We appreciate your understanding and thank you for your patience as we work towards a resolution.
13:24 UTC | 05:24 PT (Nov 17)
No significant changes since our last update. We’re still in active discussions with our email partners to address the email delivery rate limits and get emails processed and delivered as expected as soon as possible. We anticipate receiving more detailed information during AMER hours later today. Your patience is appreciated as we work towards a resolution.
20:10 UTC | 12:10 PT (Nov 17)
We have made some progress working with our providers and have discovered a solution which we believe should address the delays in processing outbound emails. Since its release, we have seen some significant recovery and improvement, and we are hopeful that this progress will persist through to full resolution. There is still a backlogged queue which we are observing drain consistently, so some slight delays may still be experienced until the full backlog has been processed. Our teams are closely monitoring the situation, and we will provide additional information as we continue our work to resolve this incident.
23:10 UTC | 15:10 PT (Nov 17)
As we track and observe further improvement in successful outbound email processing rate, we are growing in confidence that the fix Google released earlier today could resolve the issue faced in recent days. Our teams will extend their monitoring throughout the weekend, and early next week, work to pursue additional measures to expedite processing the existing backlog of outbound emails. We plan to follow up with our results on Monday. Thank you for your continued patience as we work to resolve this incident.
22:32 UTC | 14:32 PT (Nov 20)
After monitoring through the weekend, we are still seeing improved outbound email processing success rate to Google-managed email domains, and further confirmation that the fix Google released is restoring expected behavior. Our teams are completing work to implement additional measures to accelerate the draining of any backlogged emails, and ensure that all outbound messages are received. We thank you again for your patience during this investigation, and we will follow up with additional updates as we move into the week and look to resolve this incident.
14:48 UTC | 06:48 PT (Nov 21)
We would like to inform that the Gmail backlog has been fully processed. Google is no longer rate-limiting us, and all outbound emails are being delivered as expected. We will continue monitoring until full resolution. Thank you for your understanding during this period.
16:35 UTC | 8:34 PT (Nov 21)
Since addressing the remainder of the Gmail backlog, our teams are confident with our continued observed successful delivery of outbound emails to Google-managed domains, and as such this incident will now be resolved. Thank you for your patience as we completed our investigation.
POST-MORTEM
Root Cause Analysis
In the past, a subset of our IP addresses were listed on a blocklist for a very short period of time. As a result, Google applied a rate limit to all of our email traffic originating from all of our global IPs, rather than just the subset of IP addresses listed on the blocklist.
Resolution
We worked with Google and their team was able to apply a fix which alleviated the strain from the rate-limiting that had been previously applied.
Remediation Items
- Add an internal rate limit to Google domains to more accurately control traffic going to Gmail.
- Create a feedback loop that intelligently analyzes and re-tries customer traffic that is marked as spam.
- Refine release process that mitigates the impact of spam.
- Create new alerts that are more sensitive to error codes.
- Improve Zendesk’s spam detection capabilities.
FOR MORE INFORMATION
For current system status information about your Zendesk, check out our system status page. The summary of our post-mortem investigation is usually posted here a few days after the incident has ended. If you have additional questions about this incident, please log a ticket with us via ZBot Messaging within the Widget.
1 comment
Billy Macken
Postmortem published December 5, 2023.
0