SUMMARY
On February 13, 2025 from 15:50 UTC to 19:48 UTC, Guide customers on Pod 13 experienced issues publishing article updates.
TIMELINE
February 13, 2025 07:56 PM UTC | February 13, 2025 11:56 AM PT
We are happy to report that the issue preventing new Guide article updates from being published in pod 13 has been resolved, and Guide article updates are correctly being reflected in the Help Center. Thank you for your patience during our investigation.
February 13, 2025 07:46 PM UTC | February 13, 2025 11:46 AM PT
We have confirmed an issue preventing new Guide article updates from being published on pod 13. Our team is investigating and we will provide new information as soon as it's available.
February 13, 2025 07:31 PM UTC | February 13, 2025 11:31 AM PT
We are receiving reports of issues publishing changes to Guide articles on Pod 13. We will provide further updates shortly.
POST-MORTEM
Root Cause Analysis
This incident was caused by a consumer that stopped processing updates but did not crash, which meant it was not automatically restarted. As a result, updates to articles were not propagated to the Help Center, affecting user access to the latest information.
Resolution
To fix this issue, the team redeployed the Guide Article Service to Pod 13. This action restarted the consumer, allowing it to resume its function and propagate updates to the Help Center effectively.
Remediation Items
- Exit Consumer Process on Non-Retriable Errors: Implement a process that ensures the consumer exits in the event of non-retriable errors, allowing for automatic restarts and minimizing downtime.
- Enhance Monitoring: Review and improve existing monitoring systems to ensure that similar consumer issues are detected and addressed proactively.
- Alerting Improvements: Refine alerting criteria to ensure timely notifications of consumer failures, enabling quicker response times.
- Documentation of Consumer Behavior: Create comprehensive documentation outlining expected consumer behavior during various failure scenarios to aid in troubleshooting.
- Regular Health Checks: Establish regular health checks for consumers to detect issues early and prevent impacts on service delivery.
FOR MORE INFORMATION
For current system status information about Zendesk and specific impacts to your account, visit our system status page. You can follow this article to be notified when our post-mortem report is published. If you have additional questions about this incident, contact Zendesk customer support.
1 comment
Bob Novak
Postmortem published Feb 25, 2025
0