SUMMARY
On December 11, 2024 from 23:39 UTC to 6:30 UTC on December 12, customers using Zendesk AI features such as Advanced AI, Talk, AI Agents and other generative AI features experienced disruptions in functionality due to service provider outage.
TIMELINE
December 12, 2024 04:05 AM UTC | December 11, 2024 08:05 PM PT
We are observing recovery of all AI features and continue to monitor our systems for full recovery. We look forward to providing a final update when systems are fully stable.
December 12, 2024 01:53 AM UTC | December 11, 2024 05:53 PM PT
Our team has been working with our service provider on an issue impacting Zendesk AI features. The impact may be visible through Advanced AI, Talk, AI Agents and other generative AI features. Due to initial attempts failing to resolve the problem, teams continue to push forward at the highest priority to resolve this issue. We will pass on updates when they become available.
POST-MORTEM
Root Cause Analysis
The root cause of the incident was a new configuration for a telemetry service that unexpectedly generated a massive load on a service provider’s API across large clusters. This excessive load overwhelmed and disrupted DNS-based service discovery, leading to failed requests to our provider’s services.
Resolution
The incident was resolved once the service provider identified the issue and implemented corrective measures to alleviate the load on the API. Zendesk maintained communication with our service provider throughout the incident to ensure a coordinated response.
Remediation Items
- Support Level Agreement with LLM service teams: Work with internal customers to understand their performance and availability expectations, which will help in proposing fallback strategies and adjusting monitoring thresholds.
- Fallback Strategies for Generative AI Features: Develop fallback strategies for GenAI features, which will involve adding features to proxy systems and collaborating with feature owners to determine the best strategies for their respective cases.
- Premium Support from our service provider: Negotiate additional support from the service provider to ensure faster resolution and assistance during incidents.
Preventive Measures
To prevent similar incidents in the future, the following actions will be taken:
- Enhance monitoring and alerting systems to better detect abnormal loads on the API.
- Establish clearer communication channels and support agreements with our service provider to ensure rapid response during incidents.
- Implement fallback strategies for critical AI features to maintain service availability even during provider outages.
FOR MORE INFORMATION
For current system status information about Zendesk and specific impacts to your account, visit our system status page. You can follow this article to be notified when our post-mortem report is published. If you have additional questions about this incident, contact Zendesk customer support.
1 comment
Bob Novak
Post-mortem published December 20, 2024
0