Summary
From November 2, 2023, 11:54 UTC to November 6, 2023, 19:00 UTC, a subset of Guide customers across all Pods, using host-mapped subdomains, may have experienced the inability to load their Help Centers.
Root Cause Analysis
The root cause of the incident was an outage at our primary CDN vendor, affecting an internal service in charge of updating the domain routing configuration. This incident occurred over two days, November 2 and 3, 2023.
A secondary influencing factor was that our domain routing configuration service was not checking the status of custom hostnames before marking them as properly configured. This led to errors when the custom hostnames were not validated in a timely manner due to our primary CDN vendor’s outage.
At the beginning of this incident, the CDN's vendor API started malfunctioning, which prevented our domain routing configuration from updating TLS certificates.
Existing host-mapped subdomains were unaffected unless their TLS certificates had expired, and new subdomains were still functional as they were not routed through the impacted CDN.On November 3, the CDN vendor remediated the API issues, but there were errors validating custom hostnames (the unique part of a domain name). This resulted in new domains facing errors as their custom hostnames were not validated promptly.
Resolution
To fix this issue, several steps were involved:
1. The domain routing configuration was temporarily disabled to prevent further issues.
2. The domains potentially impacted by the issue were preemptively added to a CDN-bypass list to prevent them from being affected.
3. A fix was deployed on the Zendesk proxies to validate custom hostnames.
4. The domain routing configuration was reenabled and began processing the backlog of requests.
The errors were fully resolved by November 6th, 2023, with our domain routing configuration resuming normal operation.
Remediations
- Implement additional checks in the domain routing configuration service to ensure host-mapped subdomains are correctly configured.
- Review and update internal documentation to include detailed steps to handle similar errors.
- Improve communication with customers during such incidents to keep them informed about the situation and expected resolution time.