SUMMARY
On November 17, 2022 from 15:24 UTC to November 18 22:30 UTC, Sunshine Conversations customers experienced delays in connecting to websockets and receiving messages over those connections, resulting in slow initial loading of the Sunshine Conversations widget and message delivery from the Sunshine Conversation messenger widget. This also affected any customer using a Sunshine Conversations SDK or Zendesk Messaging SDK integration.
Timeline
19:01 UTC | 11:01 PT
We are aware of an issue causing delays in Sunshine Conversations Messenger Widget initial load and responsiveness, and our team is already investigating. We will provide additional information as soon as it's available.
20:16 UTC | 12:16 PT
We are still investigating delays in Sunshine Conversations Messenger Widget initial load and responsiveness, including the Zendesk ZBot widget to contact Zendesk. We will temporarily redirect requests to our team to the Web Form, and provide additional updates as we learn more.
22:22 UTC | 14:22 PT
We are seeing improvements with the Sunshine Conversations Messenger Widget initial load time and responsiveness and will continue to monitor performance. We've also re-enabled the Zendesk ZBot widget on support.zendesk.com.
Nov 18 - 11:30 UTC | 03:30 PT
We appreciate your patience while we continue monitoring Sunshine Conversations Messenger Widget initial load time and responsiveness issues. We will provide more details as we have them.
Nov 18 - 22:51 UTC | 14:51 PT
Our team continues to monitor the issue affecting Sunshine Conversations Messenger Widget responsiveness and is seeing additional improvements. We will continue to monitor through the weekend and provide an update on Monday.
Nov 21 - 21:48 UTC | 13:48 PT
We're happy to report that the issue affecting Sunshine Conversations Messenger Widget responsiveness is now resolved. Thank you for your patience as we worked to resolve the issue.
POST-MORTEM
Root Cause Analysis
The issue was primarily caused by an overused Redis cluster on Pod 23. Our alerting and monitoring did not identify the CPU usage due to an incorrect metric to monitor the Redis cluster utilization.
Resolution
Once the issue was identified, additional Redis capacity was added and performance stabilized. After additional monitoring over the weekend, additional upgrades were installed to Redis on Pod 23 for improved reliability.
Remediation Items
- Add additional Redis capacity.
- Upgrade Redis clusters.
- Fall back to long polling when a client disconnects from websockets too often.
- Update and create additional Redis monitoring.
FOR MORE INFORMATION
For current system status information about your Zendesk, check out our system status page. The summary of our post-mortem investigation is usually posted here a few days after the incident has ended. If you have additional questions about this incident, please log a ticket with us via ZBot Messaging within the Widget.
1 Comments
Post-Mortem published November 29, 2022
Article is closed for comments.