Service Incident - October 20th, 2025 | Multiple pods - AWS US East issues impacting several Zendesk services (Explore, Omnichannel, ...)

SUMMARY

On October 20, 2025, between 06:49 UTC and 23:41 UTC, we received 1,308 reports from customers experiencing issues across multiple Zendesk products. These disruptions were caused by failing application integrations during a significant AWS US East outage.

Timeline

October 20, 2025 07:59 AM UTC | October 20, 2025 12:59 AM PST

We are aware that we have an issue across multiple Zendesk services. Our engineering team is now doing their best to resolve the issue. We will provide an update after 30 mins. Thank you for your patience.

October 20, 2025 08:32 AM UTC | October 20, 2025 01:32 AM PST

We sincerely apologize for the ongoing disruption and our engineers are actively troubleshooting this incident. We will provide an update as soon as we have significant information to share. Thank you for your understanding and patience.

October 20, 2025 09:49 AM UTC | October 20, 2025 02:49 AM PST

Our engineers have identified an issue originating from our upstream provider that is impacting multiple Zendesk products, including Chat, Voice, Analytics, SunCo, Sunshine Platforms, Contact Center, and Support. We are seeing improvements but customers may experience a period of performance degradation. We appreciate your patience and will provide updates as they become available.

October 20, 2025 11:08 AM UTC | October 20, 2025 04:08 AM PST

We have observed a partial recovery in our Zendesk products following the issue caused by our upstream provider. Our engineering team continues to work diligently to restore full service across all affected areas. We apologize for any inconvenience this may cause and appreciate your patience. Updates will be provided as they become available.

October 20, 2025 02:28 PM UTC | October 20, 2025 07:28 AM PST

We have observed significant recovery across most Zendesk products; however, AMER and APAC Explore customers may continue to experience stale data in both live and historical Analytics reports. Additionally, there are ongoing issues with call sessions and data access linked to an upstream provider problem. Our engineering team is working closely with the provider to accelerate remediation and is proactively taking steps to fully restore all services ahead of peak usage periods. We apologize for any disruption this may cause and sincerely appreciate your continued patience. Further updates will be provided as they become available.

October 20, 2025 03:20 PM UTC | October 20, 2025 08:20 AM PST

We are actively addressing an outage with our cloud provider affecting multiple Zendesk products and pods, primarily in pods 19 and 23. Additional impacts include Explore in AMER and APAC, Talk across all pods, AI Agents, Sunshine Conversations, and some degradation in Omnichannel Routing and Chat. We apologize for any earlier missed notifications and will provide updates within the hour or as soon as new information arises.

October 20, 2025 04:30 PM UTC | October 20, 2025 09:30 AM PST

We continue to work with our Cloud provider on the issues impacting multiple Zendesk products. We apologize that we do not have a substantive or positive update regarding full recovery, but we want to keep you up to date on the latest. We thank you for your patience and understanding while we work through this severe service interruption. We'll send updates as they become available.

October 20, 2025 10:05 PM UTC | October 20, 2025 3:05 PM PST

Our partner cloud provider has indicated that they are seeing significant improvement, and our monitoring and logging is showing nearly full recovery in Zendesk products. While we are approaching resolution from a stability perspective, there is a sizable backlog of activity from the window of impact that is still being processed. Explore data and Talk call recordings will gradually backfill over the next few hours, and we will follow up when we have confirmed that we have reached full resolution. Thank you for your continued patience during our investigation.

October 20, 2025 11:35 PM UTC | October 20, 2025 4:35 PM PST

All Zendesk services have been restored and are stable. Explore data will continue to update over the next several hours as we process the backlog created during the incident. No customer action is required. Explore reports remain available as normal, though data freshness may lag until the backlog is cleared. Thank you for your patience as we worked through this issue.

Root Cause Analysis

This incident was caused by a significant outage in AWS US East (us-east-1), which led to failures in resolving network addresses and shortages in system capacity, disrupting Zendesk’s core infrastructure services. Additionally, resource imbalances in certain pods arose due to limitations within AWS availability zones.

Resolution

To resolve the issue, the engineering team coordinated efforts with AWS support and implemented various fixes including resource scaling, manual clearances, and restarting key data processes. Throughout the response, customers were kept informed, and full recovery of all core services has been confirmed.

Remediation Items

Add timeouts to database calls to prevent delays and ensure failed calls don’t hang the system.
Develop fallback methods for fetching app versions and assets to handle database outages gracefully.
Investigate job failures caused by missing data and improve validation to avoid such errors; ensure related metrics are monitored and alerts active.
Improve the ability to easily scale processing pipelines up or down to catch up on delayed work.
Implement features to allow the system to degrade gracefully rather than showing errors or blank pages during incidents.
Add extra capacity buffers to clusters and align maintenance schedules with peak traffic times.
Explore temporarily reducing resources used by non-critical services to prioritize essential applications.
Create a checklist for handling capacity failures to prevent unexpected pod shutdowns or scaling down.
Set minimum size limits for managed node groups to maintain sufficient resources.
Investigate backup and failover options to improve service reliability.
Complete relocating accounts to reduce exposure to regional failures.
Look into reducing unnecessary API calls to minimize user impact during platform failures.
Limit event ingestion to only those visible in the interface to reduce database load during incidents.
Review impact scope to understand why customers outside affected regions experienced issues.
Confirm dependencies on third-party services and their failover capabilities.
Update on-call guides with relevant backup and alert procedures.
Ensure on-call guides are accessible during all incidents.
Improve deployment monitoring tools and freeze policies to prevent faulty releases.
Engage with cloud providers to improve alert accuracy and reduce noise in monitoring.
Increase memory allocation for critical proxies to improve stability.
Separate no-data alerts from job processing systems to prevent false alarms.

FOR MORE INFORMATION

For current system status information about Zendesk and specific impacts to your account, visit our system status page. You can follow this article to be notified when our post-mortem report is published. If you have additional questions about this incident, contact Zendesk customer support.