SUMMARY
On February 15, 2022 from 14:24 UTC to 14:49 UTC, customers using Social Messaging via the Channel Framework Connector experienced issues with inbound and outbound messages delivery.
Messages sent by end-users degraded and were likely delayed, yet we do not expect any of them to have failed to be delivered to Zendesk. Messages sent by agents were also degraded, and in some cases, they were not delivered at all - however, the impacted agents were notified with an alert in the corner of their screen as well as an error in the events tab of the ticket about the failed delivery.
Timeline
14:59 UTC | 06:59 PT
We are investigating reports of some outbound messages failing for our Sunshine Conversations customers via the Channel Framework Connector. More updates to follow.
15:18 UTC | 07:18 PT
Our customers using Social Messaging between 14:24 UTC and 14:49 UTC may have seen outbound messages failing.
POST-MORTEM
Root Cause Analysis
This incident was caused by a deployment on the zendesk-channel-framework-connector (CFC) that caused elevated CPU usage on the application Ingress Controller (IC). The CFC is an internal service that powers messaging for the Social Messaging for the Zendesk marketplace app.
This deployment caused the IC replicas to become unresponsive and enter a cycle of crash and loop. Being unable to reply to our automated health check system, the IC replicas were deemed unhealthy and restarted.
Resolution
To fix this issue, the team scaled the IC replicas from 2 to 5, which reduced the pressure on the individual replicas and allowed the application to recover.
Remediation Items
- Increase IC resources [Done]
- Enable autoscaling on the IC deployment [To Do]
- Create additional tests and alerts to prevent prolonged downtime [To Do]
- Create and update internal documentation for the specific case scenario [To Do]
FOR MORE INFORMATION
For current system status information about your Zendesk, check out our system status page. During an incident, you can also receive status updates by following @ZendeskOps on Twitter. The summary of our post-mortem investigation is usually posted here a few days after the incident has ended. If you have additional questions about this incident, please log a ticket with us.