SUMMARY
On June 8, 2020 from 14:40 UTC to 19:20 UTC, Zendesk Chat customers experienced an issue with Chat Monitor metrics not being updated.
TIMELINE
17:44 UTC | 10:44 PT
We are investigating issues with the Chat Monitor dashboard not reliably updating values such as queue size and wait time across many accounts. We will update as soon as we have more info.
18:28 UTC | 11:28 PT
Our Engineers continue to investigate the cause of the Chat Monitor dashboard discrepancies. We will provide our next update in one hour or as we have any discoveries to share.
19:28 UTC | 12:28 PT
Our Engineers have found the likely source of the issue and have taken the appropriate remediation steps. We will continue to monitor as the systems recover. We will provide our next update when we are fully recovered or have additional observations to share.
19:46 UTC | 12:46 PT
We’re happy to report that the remediation taken by our Engineers has worked as expected, and the metrics streaming into the Chat Monitor dashboard are back to normal as of 19:30 UTC. Please reach out should the issues persist.
POST-MORTEM
Root Cause Analysis
This incident was caused by an internal error in a data processing job in our Chat Real-time Monitoring Service. A secondary factor that influenced the impact duration was the lack of wide visibility into our logging system by our on-call teams.
Resolution
To fix this issue, our engineers performed a soft restart of the job service which resulted in a resumption of the data processing job without errors. Chat Monitor metrics became current within minutes.
Remediation Items
- Update runbooks for Chat Monitor maintenance [Completed]
- Update logs access for data processing jobs [Completed].
FOR MORE INFORMATION
For current system status information about your Zendesk, check out our system status page. During an incident, you can also receive status updates by following @ZendeskOps on Twitter. The summary of our post-mortem investigation is usually posted here a few days after the incident has ended. If you have additional questions about this incident, please log a ticket with us.
1 Comments
Post-mortem published on December 16, 2020. We greatly apologise for the lengthy delay in providing the post-mortem details above; improvements are in the pipeline to minimise such delays in future.
Article is closed for comments.