SUMMARY
On November 21, 2024 from 21:02 UTC to 21:56 UTC, some customers using Sunshine Conversations hosted on Pod 17 experienced slowness and performance issues.
TIMELINE
November 24, 2024 10:23 PM UTC | November 24, 2024 02:23 PM PT
We are pleased to announce that the latency issues impacting Sunshine Conversations for some of our customers on POD 17 have now been resolved. Thank you very much for your patience!
November 24, 2024 10:09 PM UTC | November 24, 2024 02:09 PM PT
We believe we have identified the root cause of the performance issues impacting SunCo for our customer on Pod17. We are now seeing improvements and will continue to monitor the behaviour.
November 24, 2024 09:53 PM UTC | November 24, 2024 01:53 PM PT
We continue to investigate performance issues from Pod 17. These may cause slowness in Sunshine Conversations. We will provide further updates as soon.
November 24, 2024 09:36 PM UTC | November 24, 2024 01:36 PM PT
We are investigating potential performance issues impacting some of our customers hosted on Pod 17. We will post an update with further details soon.
POST-MORTEM
Root Cause Analysis
This incident was caused by an unexpected surge in traffic on Pod17, which more than doubled in the preceding week and almost tripled the day of the incident. The Unity SDK utilized by a customer was making excessive requests to the SunCo API to retrieve unread message counts, leading to increased load on the system. The resource auto-scaler was already at maximum capacity, preventing the addition of more resources to handle the increased traffic. Consequently, this overload resulted in slower response times and ultimately triggered health checks that initiated restarts, compounding the issue.
Resolution
To resolve the performance issues, we increased the maximum number of replicas for the SunCo API on Pod17. This adjustment allowed the system to better handle the increased traffic and restored normal response times for all customers.
Remediation Items
- Investigate the Unity SDK to understand why it is sending an excessive number of requests to SunCo and implement optimizations.
- Document backend interaction patterns in embeddables to clarify usage and identify potential inefficiencies.
- Evaluate the implementation of a caching strategy for SDK APIs in SunCo to reduce the number of requests made.
- Add monitoring to detect abnormal traffic growth over specified periods to proactively address potential overloads.
FOR MORE INFORMATION
For current system status information about your Zendesk, check out our system status page. The summary of our post-mortem investigation is usually posted here a few days after the incident has ended. If you have additional questions about this incident, contact Zendesk customer support.
1 comment
Bob Novak
Postmortem published on Dec 4, 2024
0