Service Incident - October 1st, 2020 - Talk Degradation in Pod 17

SUMMARY

From October 1, 2020 at 8:08 AM UTC to 12:58 PM UTC, Talk customers on Pod 17 experienced latency, intermittent dropped calls, and degraded dashboard functionality.

Timeline

13:20 UTC | 06:20 PT
Talk performance on Pod 17 is now stable. Thanks for your patience.

12:23 UTC | 05:23 PT
We continue to work on the issue with dropped calls on Pod 17. We have seen some improvement but are still working on a full resolution

11:00 UTC | 04:00 PT
We are seeing fewer calls dropped in pod 17, but are continuing to investigate the cause of this issue. We’ll follow up when an update is available.

10:21 UTC | 03:21 PT
We are continuing our efforts to resolve issues with Talk on Pod 17. More to follow.

09:45 UTC | 02:45 PT
We are seeing improvements in Talk performance on Pod 17. We continue to monitor and will post another update soon

09:14 UTC | 02:14 PT
We’ve received reports of calls dropping on Pod 17 and are investigating. More to follow.

Root Cause Analysis

This incident occurred when overall cluster usage increased beyond expected capacity, resulting in insufficient resources for Talk.

Resolution

To fix this issue, resources were scaled up for the Talk application while the backend dependency using excessive resources returned to a stable state.

Remediation Items

Investigate methods to make Talk Dashboard more resilient to degradation
Create additional latency alerts for shared clusters.

FOR MORE INFORMATION

For current system status information about your Zendesk, check out our system status page. During an incident, you can also receive status updates by following @ZendeskOps on Twitter. The summary of our post-mortem investigation is usually posted here a few days after the incident has ended. If you have additional questions about this incident, please log a ticket with us.