Service Incident - July 23, 2024 - Multiple Products | Pod 29 - Admin Center and Product Tray access issues

Summary

On July 23, 2024 from 10:58 UTC to 14:57 UTC, customers on Pod 29 experienced an inability to access Zendesk products (including Admin Center) through the Product Tray. About 1% of customer requests returned 503 errors when accessing authenticated features within Guide, Talk, Chat, Explore, and Support, and customers were unable to open Zendesk’s Product Tray to switch between products. Several errors were presented in the Product Tray as well as in the main web browser page.

Timeline

July 23, 2024 11:48 AM UTC | July 23, 2024 04:48 AM PT
We’re aware of and working on mitigating issues for customers on Pod 29 who are unable to load the Admin Center and are getting “The page isn’t working” error. Other products also seem to be unavailable from the Product tray with “Can’t load Zendesk products. Try again” error. Next update in 30 minutes, or when we have more to share.

July 23, 2024 12:19 PM UTC | July 23, 2024 05:19 AM PT
We continue to work towards resolving the access issues surrounding multiple products for customers on Pod 29. Your patience is greatly appreciated.

July 23, 2024 01:22 PM UTC | July 23, 2024 06:22 AM PT
We are continuing to work on possible solutions for the access issues affecting multiple products for customers on Pod 29. Thank you for your patience with us during this time.

July 23, 2024 01:46 PM UTC | July 23, 2024 06:46 AM PT
We have implemented a potential fix and we’re noticing a decrease in errors, along with some improvements when loading test accounts on Pod 29. We kindly ask you to clear your cache and cookies, and then try to load Zendesk again.

July 23, 2024 02:06 PM UTC | July 23, 2024 07:06 AM PT
Although we’ve been receiving some positive confirmations that things are working, we continue to monitor for possible new spikes in errors. We appreciate your patience while we wait to mark this issue as fully resolved.

July 23, 2024 03:19 PM UTC | July 23, 2024 08:19 AM PT
We have identified the root cause of the issue and have rolled back the change to prevent further problems. After additional monitoring, we have confirmed no further errors and are marking this incident as fully resolved.

POST-MORTEM

Root Cause Analysis

This incident was caused by the rollout of the new manage team members permission. This release allows agents in custom roles to be granted permission to view and manage other team members and their role assignments as a standalone permission (announcement). This rollout led to a large increase in requests to the underlying internal permissions service, resulting in capacity saturation of its database cluster. As a result of this traffic, the cluster reached its maximum network bandwidth capacity, causing a networking failure between the cluster and our service’s app servers.

Resolution

To fix this issue, our team initially increased the capacity of the permissions service’s database instance to provide short-term recovery. Once the root cause was identified, our engineers rolled back the permissions feature code change.

Remediation Items

Reduce network traffic from permissions checks [In Progress]
Additional monitors and alerts to detect traffic increases [Scheduled]
Investigate right-sizing permissions service database capacity [Scheduled]

For more information

For current system status information about your Zendesk, check out our system status page. The summary of our post-mortem investigation is usually posted here a few days after the incident has ended. If you have additional questions about this incident, contact Zendesk customer support.