SUMMARY
From 20:27 to 21:06 UTC, some Zendesk users experienced increased latency, errors, or timeouts. The issue was caused by an incident our CDN provider experienced.
Timeline
21:30 UTC | 13:30 PT
From 20:27 to 21:06 UTC, some Zendesk users experienced increased latency, errors, or timeouts. The issue was caused by an incident our CDN provider experienced. The issue is now resolved and services have resumed normal operation.
POST-MORTEM
Root Cause Analysis
The impact was experienced during a software version rollback of the CDN service. The rollback process resulted in an influx of internal calls to the Addressing API which flooded the internal API endpoint resulting in an API timeout. The API timeout triggered a bug resulting in the cron job setting an empty key value in our CDN provider’s key-value configuration distribution system.
Resolution
Once the issue was identified, the correct key-value key value was set and service restored.
Remediation Items
- Improvement to the edge DDoS mitigation system to assure empty key values cannot be applied
- Increasing the capacity of the database so it can handle the flood of requests from the internal API.
- Add additional internal alerting to the DDoS system.
- Additional alert/logging efficiency improvements
FOR MORE INFORMATION
For current system status information about your Zendesk, check out our system status page. During an incident, you can also receive status updates by following @ZendeskOps on Twitter. The summary of our post-mortem investigation is usually posted here a few days after the incident has ended. If you have additional questions about this incident, please log a ticket with us.
1 Comments
Post-Mortem published December 30, 2021
Article is closed for comments.