Summary
From February 14, 2023 20:03 UTC until February 15, 2023 05:41 UTC, a small subset of Support customers using the Webhooks feature may have experienced unexpected timeouts and retries for some webhook requests. This may have resulted in request failures and duplicate ticket events during the impact period.
Timeline (UTC)
- 2023-02-14 19:00 - Configuration change rolls out to canary environment
- 2023-02-14 19:30 - Configuration change commences rolling out to production environment
- 2023-02-14 23:00 - Configuration change rollout completed
- 2023-02-14 23:43 - First customer report of the issue
- 2023-02-15 04:22 - Issue escalated to engineering team
- 2023-02-15 04:33 - Change identified and rolled back
- 2023-02-15 05:05 - Observed recovery across multiple Pods
- 2023-02-15 05:13 - Issue fully resolved
- 2023-02-15 06:16 - Retroactive public communication posted
- “Between February 14, 2023 at 20:03 UTC and February 15, 2023 at 04:41 UTC, some Support customers with active webhooks may have encountered unexpected timeouts and retries for webhook requests. This may have resulted in errors and duplicate ticket events for those customers during the impact period. This issue is now resolved. We sincerely apologize for any inconvenience this has caused.”
Root Cause Analysis
This incident was caused by a configuration change to internal service routing that set the request timeout to a value that was too short for a subset of webhook traffic. The identification and resolution time was extended due to a lack of monitoring across our production environments (including staging) that caused the downstream impact to go unnoticed by our internal teams. Visibility was further reduced due to the minimal customer reports of the issue, delaying identification of the wider problem.
Resolution
To fix this issue, we completed a full rollback of the configuration change.
Remediation Items
- Additional monitoring and alerts across all pre-production and production environments to ensure we detect future issues before the impact is felt by our customers [Completed]
- Readjusted internal service timeout to default value [Completed]
FOR MORE INFORMATION
For current system status information about your Zendesk, check out our system status page. The summary of our post-mortem investigation is usually posted here a few days after the incident has ended. If you have additional questions about this incident, please log a ticket with us via ZBot Messaging within the Widget.