SUMMARY
On July 13, 2021 from 12:05 UTC to 19:55 UTC, there were issues impacting Zendesk Chat and Email (Support) delivery. Customers using Zendesk Chat experienced one of several issues:
- Chats were not loading
- Chat real time monitoring dashboard was not having any data
- Chat transcripts were delayed
Other customers experienced delays with email delivery.
Timeline
12:31 UTC | 05:31 PT
Our teams are investigating an incident which is impacting Zendesk Chat, Messaging & Email Delivery Delays for Pod 18 customers. We will provide additional information shortly.
12:53 UTC | 05:53 PT
We have confirmed a service disruption affecting Zendesk Chat, Messaging & Email Deliverability due to a provider failure. This incident is affecting multiple pods. Our team is actively working on solving this issue.
13:10 UTC | 06:10 PT
The team is working with our provider in order to have the partial outage for Chat, Messaging & Agent Workspace fixed. The Email Deliverability issue for Pod 18 customers is now stable. We will keep you informed as we progress.
13:44 UTC | 06:44 PT
We are seeing an improvement on the partial outage of Chat, Messaging & Agent Workspace affecting multiple Pods and our team is still actively working with our provider to solve the root cause of this issue. We are closely monitoring the progress and will keep you informed.
14:17 UTC | 07:17 PT
We continue to work on stabilising the issues affecting Messaging, Chat & Agent workspace. We have made a partial recovery which has shown positive impact to some customers and others that confirmed full recovery. Our team is still actively working with our provider to fully resolve the issue.
14:03 UTC | 08:03 PT
Although we are seeing improvements for most accounts regarding the outage affecting Messaging, Chat & Agent Workspace, the team is still working with our provider on full recovery and fix. Next update within 1 hour or when we have more details.
16:04 UTC | 09:04 PT
Our team is still working with our provider on full recovery and fix for the outage affecting Messaging, Chat & Agent Workspace. Next update within 1 hour or when we have more details.
17:00 UTC | 10:00 PT
We are seeing recovery in Messaging, Chat, & Agent Workspace. Some customers may see some latency when retrieving Chat History as backfills complete and behavior returns to normal.
20:58 UTC | 13:58 PT
Messaging, Chat, & Agent Workspace functionality has been restored. Chat transcripts that were delayed as a result of the incident have been processed and should appear in Chat History without any issues at this time.
POST-MORTEM
Root Cause Analysis
This incident was caused by connectivity issues to an Availability Zone in our data centre in the EU-Central-1 region, where our chat infrastructure is provisioned.
Resolution
To fix this issue, our data centre partner undertook procedures to fix our data instances, whilst Zendesk restarted the related services upon reconnecting to our data centre. Chat routing as well as email delivery functionality recovery was observed thereafter. Zendesk also increased the capacity of the chat history data cluster, to improve the issue of chat history search latency. Chat RTM dashboards gradually recovered and were showing up to date data.
Remediation Items
- Upgrade the capacity of the Chat History search data cluster [Completed]
- Improve resilience and faster recovery by expanding the Fault Domains for Chat services [Scheduled]
- Review Runbook on how to more efficiently bring Chat History services back online to process traffic backlog [Scheduled]
- Performing more regular chaos testing for Fault Domain failures [Scheduled]
FOR MORE INFORMATION
For current system status information about your Zendesk, check out our system status page. During an incident, you can also receive status updates by following @ZendeskOps on Twitter. The summary of our post-mortem investigation is usually posted here a few days after the incident has ended. If you have additional questions about this incident, please log a ticket with us.