Service Incident - February 15th, 2022 - Social Messaging | Multiple Pods - Channel Framework Connector issues

SUMMARY

On February 15, 2022 from 14:24 UTC to 14:49 UTC, customers using Social Messaging via the Channel Framework Connector experienced issues with inbound and outbound messages delivery.
Messages sent by end-users degraded and were likely delayed, yet we do not expect any of them to have failed to be delivered to Zendesk. Messages sent by agents were also degraded, and in some cases, they were not delivered at all - however, the impacted agents were notified with an alert in the corner of their screen as well as an error in the events tab of the ticket about the failed delivery.

Timeline

14:59 UTC | 06:59 PT

We are investigating reports of some outbound messages failing for our Sunshine Conversations customers via the Channel Framework Connector. More updates to follow.

15:18 UTC | 07:18 PT

Our customers using Social Messaging between 14:24 UTC and 14:49 UTC may have seen outbound messages failing.

POST-MORTEM

Root Cause Analysis

This incident was caused by a deployment on the zendesk-channel-framework-connector (CFC) that caused elevated CPU usage on the application Ingress Controller (IC). The CFC is an internal service that powers messaging for the Social Messaging for the Zendesk marketplace app.

This deployment caused the IC replicas to become unresponsive and enter a cycle of crash and loop. Being unable to reply to our automated health check system, the IC replicas were deemed unhealthy and restarted.

Resolution

To fix this issue, the team scaled the IC replicas from 2 to 5, which reduced the pressure on the individual replicas and allowed the application to recover.

Remediation Items

Increase IC resources [Done]
Enable autoscaling on the IC deployment [To Do]
Create additional tests and alerts to prevent prolonged downtime [To Do]
Create and update internal documentation for the specific case scenario [To Do]

FOR MORE INFORMATION

For current system status information about your Zendesk, check out our system status page. During an incident, you can also receive status updates by following @ZendeskOps on Twitter. The summary of our post-mortem investigation is usually posted here a few days after the incident has ended. If you have additional questions about this incident, please log a ticket with us.