SUMMARY
On December 3, 2024 from 21:09 UTC to 3:36 UTC Dec 7, 2024, some customers using the mobile SDK experienced 400 errors when creating tickets. Due to a change, newly created OAuth tokens were assigned a default expiration time of 8 hours. This change inadvertently broke the legacy mobile SDKs, which were unable to retrieve new tokens if their existing tokens became invalid, leading to a frustrating user experience. The issue was resolved by reverting the change.
Timeline
December 6, 2024 6:20 PM UTC | December 6, 2024 10:20 AM PT
We are happy to report that the issue causing some customers to experience 400 errors when creating tickets via the SDK has been resolved. We apologize for any disruption this may have caused, and thank you for your patience during our investigation.
December 6, 2024 12:06 PM UTC | December 6, 2024 04:06 AM PT
Our team continues to work to address the behaviour causing 400 errors on ticket submissions via the API trough our Mobile SDK, for now, if end-users encounter this error they can restart the app tickets will be created as normal.
December 6, 2024 09:45 AM UTC | December 6, 2024 01:45 AM PT
We are aware that some of our customers may experience 400 errors while attempting to create tickets through our Mobile SDK. If you face this error, please restart the app to fix the issue.
POST-MORTEM
Root Cause Analysis
This incident arose from an oversight in assessing how authentication tokens were utilized across different products before rolling out a change in their expiration time. The legacy SDKs by design cannot obtain new OAuth tokens when existing tokens expire, but this aspect was not fully taken into account during the planning and integration stages. Enhanced collaboration and a more thorough evaluation of token usage could have helped avoid this disruption.
Resolution
To resolve the issue, the Authentication team first disabled the backfill process that added expiration times to existing tokens. Subsequently, they deployed a pull request that reverted the expiration settings for new tokens and initiated a backfill to remove expiration from existing tokens. This action restored functionality for the majority of affected customers.
Remediation Items
- Establish a clear communication protocol between teams to ensure that known defects are properly documented and reviewed before implementing significant changes.
- Improve existing implementation tools to better manage the authentication flow and reduce technical debt associated with legacy SDKs.
- Create additional alerts and monitoring systems to detect similar issues in the future, particularly focusing on OAuth token failures.
- Introduce connection limits on specific applications to prevent excessive token generation and mitigate database size inflation.
FOR MORE INFORMATION
For current system status information about your Zendesk, check out our system status page. The summary of our post-mortem investigation is usually posted here a few days after the incident has ended. If you have additional questions about this incident, contact Zendesk customer support.
1 comment
Bob Novak
Post-mortem published December 17, 2024
0