Service Incident - November 9, 2023 - Support | Pod 28 - Outbound email delays and failures

SUMMARY

On November 9, 2023 from 14:00 UTC to 16:34 UTC, a subset of Support customers in Pod 28 experienced delays of approximately 2 hours when sending tickets and system-generated email notifications.

Timeline

15:24 UTC | 07:24 PT
We are currently aware of outbound email delays and failures for customers on Pod 28. All emails will be retried once we have a fix in place. Investigation is underway and we will update you shortly with more information.

15:41 UTC | 07:41 PT
We have found the root cause of the issue and are working on a potential fix. We will provide another update in 30 mins or as soon we have more information.

16:08 UTC | 08:08 PT
Our team continues to actively work on a fix for the email issue for customers on Pod 28 and we will continue to provide updates as we have them.

16:28 UTC | 08:28 PT
We are beginning to see some improvement in the issue causing outbound email delays and failures for customers on Pod 28. Our team will continue to monitor the situation to ensure full recovery.

17:56 UTC | 09:56 PT
We are still recovering from the issue causing delays and failures in outbound email delivery on Pod 28; however, we are working through a backlog and some delays are still expected. Our team is still monitoring the situation to ensure services are fully restored.

18:26 UTC | 10:26 PT
We have fully recovered from the issue causing delays and failures in outbound email delivery for customers on Pod 28. Thank you for your patience during our investigation.

POST-MORTEM

Root Cause Analysis

This incident was caused by a delay in sending emails originating from a sudden increase in email processing jobs coming from one specific account, where the system was not equipped with enough memory and CPU resources to handle that load at that point.

Resolution

To fix this issue, we worked on increasing the system's memory and CPU resources and reducing the number of emails that could be processed at the same time. This allowed the system to handle the increased load and process the queued emails.

Remediation Items

Adjusted the system's resources to ensure it can handle the load.
Create additional pageable alerts for monitoring.

FOR MORE INFORMATION

For current system status information about your Zendesk, check out our system status page. The summary of our post-mortem investigation is usually posted here a few days after the incident has ended. If you have additional questions about this incident, please log a ticket with us via ZBot Messaging within the Widget.