SUMMARY
Between 2:50 and 6:47 UTC on August 4, 2021, some WhatsApp customers included in the critical WhatsApp Business API Clients maintenance between August 3rd, 2021 at 23:00 UTC and August 4th, 2021 at 11:00 UTC (4:00 PDT) experienced inbound and outbound WhatsApp messaging delays longer than 5 minutes initially indicated by the maintenance description.
06:50 UTC | 23:50 PT
We are happy to report that the issues causing delays with WhatsApp messages in relation to the planned scheduled maintenance with WhatsApp Business API has been resolved. Apologies for the inconvenience caused. Please let us know if you continue to experience any issues.
05:32 UTC | 22:32 PT
We are seeing recovery with the WhatsApp platform, as our engineers continue to scale backend processes. We expect gradual to full recovery in approximately 2 hours. More updates to follow.
04:03 UTC | 21:03 PT
Some customers on the Zendesk WhatsApp platform may experience delays with both inbound and outbound messages beyond the 5 mins published initially in our scheduled maintenance article. We will update as more information avails.
POST-MORTEM
Root Cause Analysis
During a maintenance of our infrastructure, we had to restart many containers of the coreapp and the web in sequence. Towards the end of the migration, deployments on two specific database clusters starter to throw errors, too many connections due to pods in CrashLoopBackOff state lead deployments to be unable to initialize and deliver WhatsApp messages. Too many WhatsApp containers initializations caused too many database connections and 3rd party software behavior possibly created a lock on SHOW SCHEMAS queries.
Resolution
In order to promptly restore the service, we scaled down all affected deployments and slowly rolled them back out to avoid a large pressure on the database.
Remediation Items
- Improve Whatsapp platform monitoring
- Review Whatsapp platform deployment process
- Add ProxySQL
FOR MORE INFORMATION
For current system status information about your Zendesk, check out our system status page. During an incident, you can also receive status updates by following @ZendeskOps on Twitter. The summary of our post-mortem investigation is usually posted here a few days after the incident has ended. If you have additional questions about this incident, please log a ticket with us.
1 Comments
Post-Mortem published August 10, 2021
Article is closed for comments.