Service Incident - December 16, 2024 - Tymeshift and Workforce Management Access Issues

SUMMARY
On December 16, 2024 from 1:16 UTC to 4:44 UTC, some Tymeshift and Workforce Management customers experienced errors and access issues.

TIMELINE

December 16, 2024 05:25 AM UTC | December 15, 2024 09:25 PM PT
We are happy to report that the Tymeshift and Workforce Management access issue is now resolved. Thanks for your patience while we worked through this issue today.

December 16, 2024 04:51 AM UTC | December 15, 2024 08:51 PM PT
We have identified the potential cause of the issue impacting Tymeshift / Workforce Management and deployed a fix. We are currently monitoring our systems for recovery. If you have a ticket with our support team, please reply to it reporting any improvements you may be seeing.

December 16, 2024 03:44 AM UTC | December 15, 2024 07:44 PM PT
We continue to investigate the access errors affecting Tymeshift and Workforce Management across multiple pods. We will provide the next update when we have new information to share. Thanks for your patience while we work through this issue.

December 16, 2024 03:01 AM UTC | December 15, 2024 07:01 PM PT
We have received reports of errors and access issues in Tymeshift and Workforce Management. Our team is looking into this issue at the highest priority. More information to come soon.

POST-MORTEM

Root Cause Analysis

The root cause of the incident was identified as a failure to properly close or deallocate prepared statements in an internal service. In specific cases, which are still under investigation, prepared statements accumulated to the point where the database reached its limit, causing it to stop responding.

Resolution

To resolve the incident, the team implemented a temporary workaround by scheduling daily redeployments of the affected service to prevent the issue from recurring until a permanent fix could be deployed. This approach allowed the system to regain functionality while a thorough investigation into the root cause was conducted.

Remediation Items

Investigate Prepared Statements: Conduct a detailed investigation to determine why prepared statements were not being closed or deallocated properly and implement a fix.
Implement Monitoring and Alerts: Develop and implement monitors and alerts to detect when the number of prepared statements approaches the limit.
Review Error Monitor Thresholds: Review and adjust the thresholds for error monitoring to ensure timely detection of similar issues in the future.
Prevent Recurrence: Schedule daily redeployments of the service until a permanent fix is implemented to prevent the issue from happening again.
Increase Resource Allocation: Increase the CPU and memory allocation for the US1 Tymeapp TymeShift production instance to handle higher loads.

Preventive Measures

To prevent similar incidents in the future, we will:

Enhance code reviews to ensure proper management of prepared statements.
Implement robust monitoring systems that can detect and alert the team to potential issues before they lead to service outages.
Conduct regular audits of database performance and resource utilization.

FOR MORE INFORMATION

For current system status information about Zendesk and specific impacts to your account, visit our system status page. You can follow this article to be notified when our post-mortem report is published. If you have additional questions about this incident, contact Zendesk customer support.