Service Incident - June 10th, 2021 - Pod 18, Chat and Sell issues

Summary

Between June 10, 20:20 UTC and June 11, 03:08 UTC, we experienced a significant incident impacting Chat, Sell and multiple services on Pod 18.

The incident had multiple phases of impact. During the initial 5 minutes, we saw widespread impact for all Pod 18 services while Sell customers may have experienced errors for the first hour.

Chat remained stable until 22:20 UTC when a critical subsystem became impaired resulting in an hour and a half outage of Chat Standalone and Agent Workspace globally.

During the incident the decision was made to disable access to archived Tickets and View counts for all customers in Pod 18 to not risk stability of the rest of the system until the underlying systems they depend on were fully restored.

Timeline

04:16 UTC | 21:16 PT
We’re happy to report that all Chat and Pod 18 services have been restored. We expect full recovery of Agent Workspace chat transcripts by 06:00 UTC today. Thanks for your patience & understanding through today's issues.

03:10 UTC | 20:10 PT
We are seeing progress into bringing ticket archiving back online for Support customers on Pod 18. Next update within the next hour. Thank you for your extended patience.

02:05 UTC | 19:05 PT
We are working to bring ticket archiving back online for Support customers on Pod 18. We are also recovering Agent Workspace chat transcripts on Pod 18 (no ETA available but we will keep you updated). All other services are stable. Next update in 1 hour.

01:36 UTC | 18:36 PT
Agent Workspace is stable and chat backfills are now complete. We are working to restore service on View counts and archived tickets for Pod 18. Thanks again for your patience as we continue to work through this.

01:10 UTC | 18:10 PT
Agent Workspace has recovered in Support on Pod 18, while Support View counts are steadily recovering. Ticket archiving and exports are currently delayed. We are working to backfill chat tickets that were missed during this incident. Thank you for your patience.

00:41 UTC | 17:41 PT
We're happy to report that the Chat issues across all Pods have now been resolved. Pod 18 services remain stable however Support View counts, Agent Workspace and archived ticket access continue to remain degraded. Thank you again for for your continuing patience.

00:13 UTC | 17:13 PT
We are observing recovery of Chat with agents and visitors resuming chat sessions. Pod 18 services are stable while Support View counts and archived ticket access remaining degraded on that Pod. Stay tuned for further updates soon.

00:00 UTC | 17:00 PT
The main Chat problem has been restored while we work on remaining services across all Pods. We are also seeing improvements on all products on Pod 18 and are continuing to monitor. We greatly appreciate your ongoing patience and will continue to provide updates as we have more

22:18 UTC | 15:18 PT
We are seeing improvements with Chat and Agent Workspace accounts outside of Pod 18. We continue remediation efforts and will continue to provide updates as we have more information.

21:39 UTC | 14:39 PT
We continue to investigate an outage affecting multiple Pods and products. We are seeing partial recovery for some services and will continue to send updates as we have more information.

21:05 UTC | 14:05 PT
We continue to investigate an outage now affecting multiple Pods and products. We are seeing partial recovery for some services and will continue to send updates as we have more information.

20:45 UTC | 13:45 PT
We are investigating an outage on Pod 18. We will provide updates as we have more information.

Root Cause Analysis

The incident was caused by a climate system failure in one of AWS’s data centers in Frankfurt that resulted in server and network hardware powering off due to the rise in ambient temperature at the location. This impacted Zendesk services hosted on the affected servers. The incident was prolonged due to fire prevention systems activating and requiring the fire department to clear the room and oxygen to be restored before they could respond.

Resolution

Affected Zendesk services were progressively migrated away from the impaired data center leading to eventual recovery of each service. On the AWS side, once cooling was restored to the data center and the servers and network equipment were re-powered, affected instances recovered quickly.

Remediation Items

Follow up with AWS for their root cause analysis (RCA) and preventative actions for future recurrence [DONE]
After AWS is all clear, rebalance workload back into the affected data center [DONE]
Perform Incident Retrospective and document all the failure modes and remediation actions for data center failure [DONE]
Improve data center resilience and faster recovery by completing all the remediation actions identified in all affected services [IN PROGRESS]
- Multiple retrospectives have been held by Zendesk teams covering all affected services. We have documented and scheduled work across all services to improve monitoring, alerting, failover and other features to ensure our services are more resilient to these types of issues in future.

FOR MORE INFORMATION

For current system status information about your Zendesk, check out our system status page. During an incident, you can also receive status updates by following @ZendeskOps on Twitter. The summary of our post-mortem investigation is usually posted here a few days after the incident has ended. If you have additional questions about this incident, please log a ticket with us.