Service Incident - June 27, 2025 - Pod 19| SLA Targets are not working

Summary

On June 27, 2023, from 20:17 UTC to June 28, 2023, at 17:20 UTC, customers on Pod 19 faced issues with their ticket SLAs. When an SLA policy was assigned to a ticket or when the SLA target was supposed to be met, the system did not update the target badge as it should have. This meant that the ticket kept measuring time without correctly reflecting whether the SLA target had been met.

Timeline

June 27, 2025 10:52 PM UTC | June 27, 2025 03:52 PM PT

We are receiving reports of SLAs not applying correctly for customers on Pod 19. We will provide further updates shortly.

June 27, 2025 11:03 PM UTC | June 27, 2025 04:03 PM PT

We have confirmed an issue on Pod 19 causing SLA targets and badges to not update when an SLA policy is applied, or a target is met following a ticket update. We are investigating and will provide additional information in the next 30 minutes.

June 27, 2025 11:22 PM UTC | June 27, 2025 04:22 PM PT

Our team continues to investigate an issue for Pod 19 customers causing SLA targets and badges to not update properly when ticket updates are submitted. We will provide further information when we have a substantive update to share.

June 28, 2025 01:28 AM UTC | June 27, 2025 06:28 PM PT

Our engineers continue to investigate SLA issues on Pod 19. We will keep you informed with any progress in our efforts.

June 28, 2025 03:25 AM UTC | June 27, 2025 08:25 PM PT

We have identified and fixed the issue causing SLA targets and badges to not update on Pod 19. All updates have now been processed, and SLAs should appear properly at this time. Thank you for your patience.

Root Cause Analysis

This incident was caused by a corrupted message that was sent into the system that handles ticket events, which stopped all processes from working. A user object was incorrectly encoded in the event’s description field, which led to a failure in processing SLAs for ticket events.

Resolution

To fix this issue, we bypassed the corrupted messages by adjusting the partition offset to skip the problematic messages. We then restarted the consumers to resume normal event processing.

Remediation Items

Create a plan in the messaging system to manage errors that occur when reading data correctly.
Improve existing implementation tools for better resilience against corrupted messages.
Create additional monitoring alerts for enhancing detection of unhealthy service states.
Establish proper connection limits on specific applications to avoid cascading failures.

FOR MORE INFORMATION

For current system status information about Zendesk and specific impacts to your account, visit our system status page. You can follow this article to be notified when our post-mortem report is published. If you have additional questions about this incident, contact Zendesk customer support.