SUMMARY
On March 6, 2023 from 16:19 UTC to 16:34 UTC , Support customers experienced 5xx errors and Talk and Chat customers experienced disconnections. Some customers were unable to pull up tickets from views.
Timeline
17:50 UTC | 09:50 PT
Between 16:19 and 16:34 UTC on March 6, 2023, Pod 13 customers experienced server errors and degraded functionality across all products. Our engineering team has already released a fix for the issue and the errors are no longer present.
We apologize for any interruption this may have caused.
POST-MORTEM
Root Cause Analysis
The Network team deployed a change to the service mesh with the goal of helping to achieve full availability zone affinity in production. This change resulted in an immediate surge in HTTP 403’s coming from an internal service. It was decided to quickly roll back to prevent an incident. The rollback was performed out of sequence, resulting in all traffic to the internal proxy NLB being dropped. This caused a degradation of several products for about 15 minutes until the rollback was fully completed.
Resolution
Once the issue was identified, completion of the rollback resolved the issue.
Remediation Items
- Update deploy standards for these types of changes
- Create playbook for rollback procedure for proxy NLB bypass
FOR MORE INFORMATION
For current system status information about your Zendesk, check out our system status page. The summary of our post-mortem investigation is usually posted here a few days after the incident has ended. If you have additional questions about this incident, please log a ticket with us via ZBot Messaging within the Widget.