SUMMARY
On December 1, 2024 from 4:00 UTC to December 3, 20:00 UTC, Sell customers in multiple pods experienced issues with features including data visibility in Smart Lists, lead conversion with deal creation, and outbound calls, with the latter experiencing intermittent failures. Once functionality was restored, a backlog of requests had to be processed, which took until December 18, 2024 at 16:22 UTC to complete.
TIMELINE
December 18, 2024 04:22 PM UTC | December 18, 2024 08:22 AM PT
Thank you for your patience while we reprocessed Sell data that was missed or affected during the window of impact. At this time all data should be correct. Please reach out if you continue to see any issues.
December 13, 2024 11:26 PM UTC | December 13, 2024 03:26 PM PT
Our engineering team has made significant progress to backfill and reprocess Sell data that was missed or affected during the window of impact; however, a small subset of requests requiring more manual involvement to backfill still remains. We are spending additional time and effort to ensure that all data reaches the appropriate location, and will continue our work next week to confirm full recovery. Thank you for your continued patience in the meantime.
December 09, 2024 10:16 PM UTC | December 06, 2024 02:16 PM PT
Our team continues to work to backfill the Sell data affected during the window of impact; however, given the volume and our level of care and diligence in ensuring the correct data is included accurately, this will take some additional time to complete. We will be sure to provide further updates as the backfill progresses.
December 06, 2024 02:06 PM UTC | December 06, 2024 06:06 AM PT
We would like to provide an update regarding the incident impacting our Sell customers on December the 3rd, 2024. Our team continues to work trough the data backlog that occurred during the incident. We will continue to provide updates as soon as possible.
December 04, 2024 10:27 AM UTC | December 04, 2024 02:27 AM PT
Our team is actively exploring the most effective approach to the backlog of actions resulting from yesterday's incident affecting Sell. We will share additional updates as soon as they become available.
December 03, 2024 11:44 PM UTC | December 03, 2024 03:44 PM PT
Our engineering team has stabilized Sell functionality, and new requests are being processed as expected at this time. We are working through our options to process requests that may have timed out during the window of impact and will provide further information when this investigation continues tomorrow.
December 03, 2024 09:47 PM UTC | December 03, 2024 01:47 PM PT
Our team continues to work to reduce the backlog and restore expected Sell functionality. We are working to increase capacity to speed up recovery, but some latency and delays are still expected. We will provide further updates when we have new information to share.
December 03, 2024 05:09 PM UTC | December 03, 2024 09:09 AM PT
We are beginning to see some improvement in the issues affecting Sell; however, there is a significant backlog we are working to address, and some latency may still be experienced. We will continue to monitor the situation to ensure full recovery.
December 03, 2024 03:35 PM UTC | December 03, 2024 07:35 AM PT
Our team continues to work on the issues currently impacting Sell. These can manifest as issues with data visibility in Smart Lists, lead conversion with deal creation, and intermittent outbound call failures. We will provide any further updates as they are available.
December 03, 2024 02:01 PM UTC | December 03, 2024 06:01 AM PT
We want to keep you informed regarding the ongoing issue affecting certain features, including data visibility in Smart Lists, lead conversion with deal creation, and intermittent outbound call failures. While we don’t have new developments to share at this time, please know that our team is working diligently to resolve the matter as quickly as possible.
December 03, 2024 12:14 PM UTC | December 03, 2024 04:14 AM PT
Our team is actively addressing the service degradation affecting specific features. Currently, data visibility in Smart Lists, lead conversion with deal creation, and outbound calls are impacted, with the latter experiencing intermittent failures. While most core services remain operational, some issues can often be resolved by reloading or retrying.
December 03, 2024 11:23 AM UTC | December 03, 2024 03:23 AM PT
Our team is actively addressing the service degradation impacting specific features, including data visibility in Smart Lists and lead conversion with deal creation. Most core services remain operational, and issues with some functionalities can often be resolved by reloading or retrying.
December 03, 2024 10:53 AM UTC | December 03, 2024 02:53 AM PT
We are currently investigating an issue where stale data may be appearing in our systems. Additionally, attempts to update data during this time may result in errors. Our team is working diligently to resolve these issues as a priority.
POST-MORTEM
Root Cause Analysis
This incident was caused by a sudden increase in request volume that led to high memory usage across Sell infrastructure. This resulted in alerts due to excessive load, and caused multiple queues to fill up to their maximum capacity. The system responsible for managing these request flows was restarting frequently and could not keep up with the demand, leading to a growing backlog and preventing new requests from processing.
Resolution
To address the issue, we first attempted to scale up additional infrastructure, but this also quickly filled up to capacity. We then set up a new cluster with additional resources to effectively manage live traffic. This allowed us to stabilize operations and restore normal functionality while we worked on clearing the backlog of requests in the old infrastructure.
Remediation Items
- Remove Outdated Notification Queues: We decided to eliminate unnecessary notification queues that were not needed for customer communication. This reduces the number of requests processed by the relevant infrastructure.
- Enhance Message Processing Tools: Improvements were made to existing tools to increase efficiency in handling messages, again providing more capacity to process requests.
- Establish Additional Alerts: New monitoring alerts were created to keep track of system performance and prevent high memory usage.
- Set Connection Limits: We implemented limits on the number of connections to specific applications to prevent overload and ensure smoother traffic management.
FOR MORE INFORMATION
For current system status information about your Zendesk, check out our system status page. The summary of our post-mortem investigation is usually posted here a few days after the incident has ended. If you have additional questions about this incident, contact Zendesk customer support.
1 comment
Bob Novak
Post-mortem published January 7, 2025
0