SUMMARY
On January 13th, 2025 from 11:07 UTC to 12:07 UTC, customers on Pod 17 experienced issues with Messaging Triggers not executing.
TIMELINE
January 13, 2025 12:24 PM UTC | January 13, 2025 04:24 AM PT
The recent Messaging issue has been fully resolved, and our services are back to full operability! Thank you for your patience during this time. Our team will continue to monitor our systems closely to ensure everything runs smoothly. We appreciate your support and are here for any questions or feedback you may have!
January 13, 2025 11:51 AM UTC | January 13, 2025 03:51 AM PT
We are investigating issues with Messaging Triggers executing for our customers on POD17.
POST-MORTEM
Root Cause Analysis
This incident was caused by premature terminations of consumers for the Messaging ticket log events service, which occurred while the service was still running. As a result, the consumers were unable to process incoming events, leading to a complete halt in the evaluation and execution of Messaging Triggers on Pod 17.
Resolution
To resolve this issue, we identified the configuration error that set the maximum number of records to be processed in a single batch to 500 instead of the intended 250. By correcting this typo and reducing the max records value, we aimed to decrease the likelihood of consumer terminations due to timeout issues.
Remediation Items
- Implement a health check to detect premature terminations of consumers.
- Create a monitor to track the number of running consumers.
- Establish a monitor to monitor stopped partitions for the Tessaging ticket log events consumer.
- Add a consumer lag status widget to the Messaging Trigger Service dashboard.
- Create a new metric to measure the time taken to process a batch of messages from the messaging ticket log events topic.
These remediations are designed to enhance monitoring and prevent similar incidents in the future, ensuring the stability and reliability of the Messaging Trigger Service.
FOR MORE INFORMATION
For current system status information about Zendesk and specific impacts to your account, visit our system status page. You can follow this article to be notified when our post-mortem report is published. If you have additional questions about this incident, contact Zendesk customer support.
1 comment
Bob Novak
Post-mortem published January 29, 2025
0