SUMMARY
On January 23rd, between the hours of 17:00 and 17:56 UTC, Sunshine Conversations customers, in the North America region, experienced degraded performance when processing inbound messages. During this time, performance was degraded for all customers using Agent Workspace, Sunshine Conversations APIs, and customers trying to initiate chat connections with Zendesk customer support (as well as Zendesk agents responding to customers via chat). Webhooks were also degraded and delayed during this time.
Inbound messages to the Sunshine Conversations API and social channels are enqueued within a 3 node cluster. At approximately 17:00, there was a network partition between node 1 and the other two nodes (2 and 3) of the queueing cluster. In this configuration, node 1 closed all of its connections and restarted. Subsequent to the node restart, the queues were rebalanced, which blocked inbound requests until that action completed. However, due to a delay in node 1 coming back online, it was forcibly rebooted at 17:28 UTC.
18:19 UTC | 10:19 PT
From 17:00 to 17:51 UTC we experienced a Sunshine Conversations and social messaging add-on degradation, resulting in delayed sending and receiving of messages. This impacted our customers’ ability to submit chat requests to support.zendesk.com and our agents’ ability to respond. Service is now restored and the issue is now resolved.
POST-MORTEM
Root Cause Analysis
The exact cause of the network partition is unknown and an investigation is ongoing. Likewise, the degraded behavior of the offline node, including the forced restart and behavior of blocking client requests while rebalancing, is also being investigated.
Resolution
Node 1 was rebooted which restored service.
Remediation Items
- Enqueue messages in a secondary queueing system, when the primary cluster is unavailable or returning errors
- Monitor and alert on existing connections and channels to the queueing cluster, when they exceed certain predefined thresholds
- Document when and how an internal setting is adjusted, to switch from messaging to Email when messaging is degraded in the instance Zendesk uses for customer support
FOR MORE INFORMATION
For current system status information about your Zendesk, check out our system status page. During an incident, you can also receive status updates by following @ZendeskOps on Twitter. The summary of our post-mortem investigation is usually posted here a few days after the incident has ended. If you have additional questions about this incident, please log a ticket with us.