SUMMARY
From 2024-02-09 20:32 UTC to 2024-02-09 22:29 UTC, Support customers on Pod 13 experienced an issue which resulted in some tickets not showing SLA badges.
POST-MORTEM
Root Cause Analysis
During this incident, one out of sixteen Kubernetes pods in Pod 13 had an unplanned restart and malfunctioned. The error message indicated issues with the 'connection string authority', disrupting the 'redis' host, a critical dependency for our Metric Event Service (MES). This disruption led to processing complications for ticket events, notably causing Service Level Agreement (SLA) events to be absent or delayed. We suspect that the kpod was inadvertently restarted due to a deploy or configuration change. When the issue occurred, our immediate goal was to fix the main service, which required a quick system reset. This process didn't give us time to record details from the malfunctioning system unit right away. However, later, we managed to reproduce the error in a safe testing environment by deliberately introducing a flaw, which helped us better understand the problem.
Resolution
Once the issue was identified, the kpod was redeployed which resolved the issue. Missing SLA events were then backfilled.
Please note: the backfill/restoration of data that was run to resolve broken SLAs on Open tickets had a side effect of completely removing SLA data on Closed tickets, which results in ‘Null’ SLA data in Explore.
Remediation Items
- Explore better ways to organize and pass environment variables to ensure readiness whenever system units restart
- Improve the turnaround time to fix broken Service Level Agreements (SLAs) by updating our "funfiller"
- Review monitoring and alerts
- Reinvestigate the method for passing environment variables to ensure their availability whenever system units restart
FOR MORE INFORMATION
For current system status information about your Zendesk, check out our system status page. The summary of our post-mortem investigation is usually posted here a few days after the incident has ended. If you have additional questions about this incident, please log a ticket with us via ZBot Messaging within the Widget.