On July 18, 2023 from 14:17 UTC to 16:57 UTC, customers using Zendesk Explore, Support and Admin Center across multiple Pods experienced multiple outage/degradation symptoms at different times within this period.
Between 14:17 UTC and 16:20 UTC for customers on all Pods, the Explore product was unavailable and a blank screen would have been shown. After 14:20 UTC the product was available but stale data (from the start of the incident) was shown for some customers. Data processing was fully caught up and displayed correctly for all customers by 19:43 UTC.
Within the Support product, Batch and Bulk jobs for Tickets and Users were delayed in Pod 13 between 14:17 UTC and 16:48 UTC. Additionally some API endpoints for these types of jobs may have returned 429 errors during this time on Pod 13.
Support also saw some delay in email processing during this time in Pods 28 and 29 only. Ticket creation from inbound email was delayed during the incident in the affected Pods between 14:53 UTC and 17:46 UTC.
Within the Support product, SLA badges on tickets were also impacted on Pods 13, 23, 28 and 29 between 14:17 and 16:27 UTC. Users would have seen stuck badges, where time continued to count even after performing actions that should have achieved/paused/activated them. During the incident, SLA badges were not displayed as expected for some tickets. Those badges resurfaced after the incident was resolved. Automations using SLAs might not have fired correctly for the duration of the incident and Views that listed tickets based on SLA may have been inaccurate. For tickets that were created during the course of the incident, a backfill job has updated and corrected the inaccurate SLA’s.
Admin Center was impacted on Pods 13, 23, 28 and 29. Between 14:24 and 15:00 UTC customers would have been unable to create and update accounts, or create and update products. For customers in Pod 28 only, the impact extended to 16:24 UTC.
15:43 UTC | 08:43 PT
We have confirmed an issue across multiple products causing Explore syncing delays and access issues, latency in Support, delays in SLAs applying after ticket updates, and bulk job processing delays. Our team is engaged and we will provide further updates soon.
15:49 UTC | 08:49 PT
In addition to the above, we have confirmed the issue is causing disconnections from calls in Talk as well as impact to ticket merging functionality. We will provide additional information as we learn more.
16:09 UTC | 09:09 PT
Our team continues to investigate an issue causing delays across Support and ticketing, Explore and dashboards, SLA application, bulk job processing, webhooks firing, and Talk disconnections. We will provide further updates as we gather new information.
16:27 UTC | 09:27 PT
We are still working to resolve the issue causing delays across Support, Explore, SLAs, and bulk jobs. In the meantime, we have found that Talk and webhooks are not impacted by this issue. We will continue to provide updates as the investigation progresses.
16:59 UTC | 09:59 PT
We are beginning to see recovery across several products affected by the latency and delays seen today. Some lingering effects remain and our team is actively monitoring the situation until full recovery.
17:50 UTC | 10:50 PT
We are happy to report that the issue causing latency on Support, Explore, SLAs, and bulk jobs has been resolved. Explore data may be outdated until your next scheduled sync, and SLAs may appear inaccurate as events catch up. Please let us know if you experience continued issues.
Root Cause Analysis
This incident was caused by a change to an internal configuration service that impacted the performance profile of the application when placed under heavy load from software deployments.
To fix this issue, our team downgraded the service to a working version. Multiple attempts to restore service were hampered by the root issue in this incident causing the impact to be extended. The final attempt was successful after network traffic was limited to the affected service to allow it to restart gracefully.
Please note: the backfill/restoration of data that was run to resolve broken SLAs on Open tickets had a side effect of completely removing SLA data on Closed tickets, which results in ‘Null’ SLA data in Explore.
- Improve testing program with additional load tests [Completed]
- Investigate Explore resiliency and dependency on internal service [Scheduled]
- Investigate additional capacity mechanisms [Scheduled]
- Explore additional non-preventative measures to reduce impact of similar issues [In Progress]
FOR MORE INFORMATION
For current system status information about your Zendesk, check out our system status page. The summary of our post-mortem investigation is usually posted here a few days after the incident has ended. If you have additional questions about this incident, please log a ticket with us via ZBot Messaging within the Widget.