Recent searches

No recent searches

Service Incident - August 27, 2024 - Support | Pods 19, 20 & 27 - Webhooks and triggers' firing delays

Edited Sep 10, 2024

SUMMARY

On August 27, 2024 from 16:30 UTC to 22:30 UTC, Support customers in Pods 19, 20, and 27, experienced delays in Webhooks and triggers firing, which impacted ticket updates and communication with end users.

Timeline

August 27, 2024 08:03 PM UTC | August 27, 2024 01:03 PM PT

We are investigating reports of Support delayed triggers and webhooks firing. Next update in 30 minutes or when we have new information to share.

August 27, 2024 08:27 PM UTC | August 27, 2024 01:27 PM PT

The webhook and trigger delays are impacting Support customers on pods 19, 20, and 27. Our engineers are currently engaged and investigating. Next update in 30 minutes or when we have new information to share.

August 27, 2024 08:56 PM UTC | August 27, 2024 01:56 PM PT

Our engineers continue to investigate the webhook and trigger delays impacting Support customers on pods 19, 20, and 27. Next update in 1 hour or when we have new information to share.

August 27, 2024 09:24 PM UTC | August 27, 2024 02:24 PM PT

We are seeing improvements to webhook delays on Pod 19 and continuing to work on processing the backlog of webhooks on pods 20 and 27. Next update in 1 hour or when we have new information to share.

August 27, 2024 10:03 PM UTC | August 27, 2024 03:03 PM PT

The backlog of webhooks on pods 19 and 20 have been fully processed and there should no longer be any delays on those pods. We are still processing the backlog of webhooks on pod 27 and will provide an update once that backlog is clear.

August 27, 2024 10:40 PM UTC | August 27, 2024 03:40 PM PT

The backlog of webhooks on pods 19, 20, and 27 have been fully processed and there should no longer be any delays on those pods. The issue is now fully resolved.

POST-MORTEM

Root Cause Analysis

The incident was primarily caused by a sudden surge in traffic due to a mass user import by a large customer. This surge resulted in the Webhooks system hitting its throughput limit, leading to significant delays. Additionally, in Pod 27, the autoscaling mechanism failed to adequately handle the increased traffic, further exacerbating the delays.

Resolution

To fix this issue, the Webhooks dispatcher and the Untrusted Egress Zone (UEZ) were scalably ramped up to handle the traffic surge. Furthermore, the specific customer was requested to slow down their operations. Once the necessary scaling adjustments were made, the backlog started to decrease, and normal service was gradually restored across all affected pods.

Remediation Items

Define horizontal auto-scaling policies for Webhooks services. [IN PROGRESS]
Investigate enhancing rate-limiting logic to account for a single customer with many subdomains. [SCHEDULED]
Investigate and fix the secure egress tier auto-scaling issue in Pod 27. [SCHEDULED]
Streamline the deployment and configuration change process to reduce friction during emergency resolutions. [IN PROGRESS]
Implement subdomain-specific kill switches for Webhooks. [IN PROGRESS]
Add monitoring alerts to flag when the Webhooks backlog or delivery latency becomes too large. [SCHEDULED]
Publicly document Webhooks rate limits to inform customers and preemptively manage traffic. [SCHEDULED]

FOR MORE INFORMATION

For current system status information about your Zendesk, check out our system status page. The summary of our post-mortem investigation is usually posted here a few days after the incident has ended. If you have additional questions about this incident, contact Zendesk customer support.

1 comment

Date

Jessica G.

Zendesk Customer Care

Sep 10, 2024

Postmortem published September 10, 2024.

Article is closed for comments.

Service Incident - August 27, 2024 - Support | Pods 19, 20 & 27 - Webhooks and triggers' firing delays

SUMMARY

Timeline

POST-MORTEM

Root Cause Analysis

Resolution

Remediation Items

FOR MORE INFORMATION

1 comment

ADDITIONAL CONTENT

Common topics

Role-based guides

Additional resources

Service Incident - August 27, 2024 - Support | Pods 19, 20 & 27 - Webhooks and triggers' firing delays

SUMMARY

Timeline

POST-MORTEM

Root Cause Analysis

Resolution

Remediation Items

FOR MORE INFORMATION

1 comment

ADDITIONAL CONTENT

Related articles