Summary
On April 7, 2021 from 08:45 UTC to 09:08 UTC, agents using Zendesk Chat through Agent Workspace in Support were disconnected from all new and existing chat sessions.
Timeline
15:44 UTC | 08:44 PT
The issue with call connectivity for Zendesk Talk is now resolved. Please let us know if you’re still experiencing these issues with Talk. Thank you for your patience.
14:42 UTC | 07:42 PT
The issue with call connectivity for Zendesk Talk is stable for now as we haven’t received new reports from customers. We are gathering more information with our provider and will keep you informed as we learn more.
13:59 UTC | 06:59 PT
We continue to investigate the probable cause of the call connection issues within Zendesk Talk. We are seeing call quality improving and continue to monitor.
13:37 UTC | 06:37 PT
Between 13:00-13:16 UTC we observed issues with calls disconnecting in Zendesk Talk. Our engineering teams are monitoring the situation. We will provide further updates as soon as they are available.
POST-MORTEM
Root Cause Analysis
This incident was caused by a recent deploy for Agent Workspace chat functionality. This deployment contained an upgrade to the application that changed the DNS resolution mechanism. The upgraded library was unable to handle the truncation of the DNS response; this response truncation occurs only when the service tries to resolve domain names with a large number of host IPs that are deployed on production and hence they didn't surface on the staging setup. This resulted in the failed connections observed by chat agents in Agent Workspace.
Resolution
To fix this issue, our engineers rolled back to the last known working version of the application. Recovery was observed soon after rollback.
Remediation Items
- Additional diagnostics on underlying API endpoint to monitor future deploys [Done]
- Review wider usage of the culprit Node.js image and migrate away from it [Done]
- Additional soak stage for websocket connection management [Done]
- Implement alternative Node.js base image in application [Scheduled]
- Improve alerting around request rates [Scheduled]
FOR MORE INFORMATION
For current system status information about your Zendesk, check out our system status page. During an incident, you can also receive status updates by following @ZendeskOps on Twitter. The summary of our post-mortem investigation is usually posted here a few days after the incident has ended. If you have additional questions about this incident, please log a ticket with us.