Service Incident - December 10th, 2021 - Dropped calls and latency in Talk on all Pods

Summary

On December 10, 2021 from approximately 20:15 UTC until 21:50 UTC, Talk customers on all Pods may have experienced dropped calls, latency and other errors.

Timeline

21:03 UTC | 13:03 PT

We are investigating reports of Talk dropped calls, increased latency, and timeout errors. We will provide an update as soon as we have more information to share.

21:24 UTC | 13:24 PT

From 22:15UTC to 23:00 UTC, Talk experienced a degradation and partial outage resulting in dropped calls, increased latency, the inability to make outbound calls, and calls being rerouted. We are seeing recovery and will provide an update once fully resolved.

22:58 UTC | 14:58 PT

We’re happy to report that issues with Talk have been fully resolved. Please note, the corrected time of impact was 20:15 UTC to 21:00 UTC.

Root Cause Analysis

This incident was caused by a failure in our service provider’s new rate-limiting service. When the rate-limiting servers were replaced, a surge in traffic to our provider’s API Gateway resulted in backpressure and high memory consumption, leading to timeouts for network requests and to the symptoms experienced by our customers.

Resolution

To fix this issue, our service provider increased the API Gateway capacity.

Remediation Items

Our service provider is working through a number of steps to ensure this issue doesn’t resurface. This items include:

Using test harness to ensure the gateway layer is more resilient and fault-tolerant
Memory tuning
Increased logging and monitoring for the impacted service

FOR MORE INFORMATION

For current system status information about your Zendesk, check out our system status page. During an incident, you can also receive status updates by following @ZendeskOps on Twitter. The summary of our post-mortem investigation is usually posted here a few days after the incident has ended. If you have additional questions about this incident, please log a ticket with us.