Service Incident - December 3rd, 2021 - Pod 19 Performance issues

SUMMARY

On December 3rd, 2021 from 01:25 AM UTC to 05:24 AM UTC, Support customers on Pod 19 experienced system slowness, inaccessibility of pages like Views or the error ‘Your request experienced a server error’.

Timeline

06:50 UTC | 22:50 PT

We are pleased to inform you that issues causing slowness and inaccessibility for customers using Zendesk Support on Pod 19 have been resolved. We apologise for any inconvenience this may have caused your operations. Thank you so much for your understanding and continued partnership.

06:00 UTC | 22:00 PT

We have observed improvement in the performance of Zendesk Support on Pod 19 in the last 60 minutes. We continue to monitor and will provide a final update at full resolution. If you are experiencing any remaining issues in Support on Pod 19, please reach out to us.

04:38 UTC | 20:38 PT
We are still investigating the issues causing system slowness and inaccessibility for customers using Zendesk Support on Pod 19. We seek to provide you another update when more information is available, and thank you for your patience in the meantime.

03:36 UTC | 19:36 PT

We are aware of reports of system slowness and inaccessibility for customers using Zendesk Support on Pod 19. Our teams are still investigating the root cause of this. We appreciate your patience in the meantime, and thank you for your understanding. We seek to update you in 60 minutes or as soon as information is available.

03:05 UTC | 19:05 PT

We are aware of issues causing system slowness and possible inaccessibility to some customers on Pod 19. We are currently investigating, more information in 30 minutes.

POST-MORTEM

Root Cause Analysis

This incident was caused by a large data export instigated by one of our customers fetching multiple terabytes of their attachments via API calls, that ran through some of our dedicated network servers.
Due to the increased load, our servers started throttling their own performance to mitigate, which in turn caused a degradation for request responses on Pod 19. This meant performance and / or availability issues for our customers on the Pod.

Resolution

To fix this issue, our engineers have increased the available CPU nodes for processing. Once the tasks balanced on all available nodes, performance returned to normal.

Remediation Items

During the incident our engineers permanently increased the available CPU nodes to prevent any recurrence of an issue of this type.
Created additional alerts, and specific monitoring for similar behaviour.
Implement an automated rate limiter for high usage endpoints for the impacted servers.
Investigate automatic performance scaling for the impacted servers.
Review download attachment architecture.

FOR MORE INFORMATION

For current system status information about your Zendesk, check out our system status page. During an incident, you can also receive status updates by following @ZendeskOps on Twitter. The summary of our post-mortem investigation is usually posted here a few days after the incident has ended. If you have additional questions about this incident, please log a ticket with us.