SUMMARY
On April 7, 2021 from 08:45 UTC to 09:08 UTC, agents using Zendesk Chat through Agent Workspace in Support on all Pods were disconnected from all new and existing chat sessions.
Timeline
09:23 UTC | 02:23 PT
Some Agent Workspace customers may have experienced connection issues with Zendesk Chat. Our engineering team has deployed a fix. If you continue to experience this issue, please refresh your browser.
09:56 UTC | 02:56 PT
The issue regarding Chat connectivity for Agents within Agent Workspace for some customers has been resolved and a summary will follow. We strive to provide the best experience for our customers and appreciate your patience with this.
POST-MORTEM
Root Cause Analysis
This incident was caused by a recent deploy for Agent Workspace chat functionality. This deployment contained an upgrade to the application that changed the Domain Name System (DNS) resolution mechanism. The upgraded library was unable to handle the truncation of the DNS response; this response truncation occurs only when the service tries to resolve domain names with a large number of host IPs that are deployed on production and hence they didn't surface on the staging setup. This resulted in the failed connections observed by chat agents in Agent Workspace.
Resolution
To fix this issue, our engineers rolled back to the last known working version of the application. Recovery was observed soon after rollback.
Remediation Items
- Additional diagnostics on underlying API endpoint to monitor future deploys [Scheduled]
- Review wider usage of the culprit JavaScript platform image and migrate away from it [Scheduled]
- Additional soak stage for WebSocket connection management [Scheduled]
- Implement alternative JavaScript platform base image in the application [Scheduled]
- Improve alerting around request rates [Scheduled]
FOR MORE INFORMATION
For current system status information about your Zendesk, check out our system status page. During an incident, you can also receive status updates by following @ZendeskOps on Twitter. The summary of our post-mortem investigation is usually posted here a few days after the incident has ended. If you have additional questions about this incident, please log a ticket with us.