Recent searches

No recent searches

Attila Takacs

Joined Apr 14, 2021

Last activity Feb 05, 2025

Following

Followers

Total activity

Vote

Subscriptions

ACTIVITY OVERVIEW

BADGES

ARTICLES

POSTS

COMMUNITY COMMENTS

ARTICLE COMMENTS

ACTIVITY OVERVIEW

Latest activity by Attila Takacs

Attila Takacs created an article, February 05, 2025 11:29

ArticleService notifications

Service Incident - February 5th, 2025

SUMMARY

February 05, 2025 11:48 AM UTC | February 05, 2025 03:48 AM PT
We are pleased to inform you that the login issues with our guide service have been identified and resolved by our engineering team. You should now be able to access the service without any problems. Thank you for your patience and understanding during this incident. If you continue to experience any issues, please reach out to our support team.

February 05, 2025 11:29 AM UTC | February 05, 2025 03:29 AM PT
We are experiencing service degradation with our guide service login from mobile browsers. If you encounter errors while accessing your account, please try logging in from a desktop. Our team is actively working to resolve the issue.

POST-MORTEM

TBD

FOR MORE INFORMATION

For current system status information about Zendesk and specific impacts to your account, visit our system status page. You can follow this article to be notified when our post-mortem report is published. If you have additional questions about this incident, contact Zendesk customer support.

Edited Feb 05, 2025 · Attila Takacs

Followers

Vote

Comments

Attila Takacs created an article, February 01, 2025 00:57

ArticleService notifications

Service Incident - February 1, 2025 - Pod 26 | Support - Access issues for archived tickets

SUMMARY

On February 1st, 2025 from 00:13 UTC to 00:59 UTC, customers on POD 26 experienced issues with accessing archived tickets. During this time, multiple database reader nodes were unable to open a table due to a defect in the database system. This resulted in failed queries for archived tickets.

TIMELINE

February 01, 2025 01:13 AM UTC | January 31, 2025 05:13 PM PT
We are happy to report, that the issue causing errors impacting a group of our Support customers on POD 26 has now been resolved. Thank you for your patience during our investigation.

February 01, 2025 12:57 AM UTC | January 31, 2025 04:57 PM PT
Our engineers believe they have identified the root cause of the errors impacting a group of our Support customers on POD 26 and are working to address the issue.

February 01, 2025 12:57 AM UTC | January 31, 2025 04:57 PM PT
We are investigating potential errors for our Support customers hosted on POD 26.

POST-MORTEM

Root Cause Analysis

This incident was caused by a defect in the database system that prevented cluster reader nodes from accessing an archived tickets table. The defect was confirmed by our vendor technical support and was specific to the database version installed at the time.

Resolution

To resolve this issue, our engineers halted a deployment to other shards, and allowed the ongoing modifications to complete on the impacted shards. At that point the database table was accessible. Subsequently, the team plans to upgrade to a new version of our database system, which includes a patch for the identified defect.

Remediation Items

Upgrade to the patched version or later before resuming schema changes.
Split column additions and index drops into separate actions to minimize risk during deployments.
Update the run-book to require that large migrations reach only one cluster initially before expanding to others.
Implement a regular review process (at least annually) of database system patches and establish an upgrade cadence.

FOR MORE INFORMATION

Edited Feb 06, 2025 · Attila Takacs

Followers

Votes

Comment

Attila Takacs created an article, January 13, 2025 11:51

ArticleService notifications

Service Incident - January 13, 2025 - Messaging | Pod 17 - Messaging Triggers not executing

SUMMARY
On January 13th, 2025 from 11:07 UTC to 12:07 UTC, customers on Pod 17 experienced issues with Messaging Triggers not executing.

TIMELINE

January 13, 2025 12:24 PM UTC | January 13, 2025 04:24 AM PT
The recent Messaging issue has been fully resolved, and our services are back to full operability! Thank you for your patience during this time. Our team will continue to monitor our systems closely to ensure everything runs smoothly. We appreciate your support and are here for any questions or feedback you may have!

January 13, 2025 11:51 AM UTC | January 13, 2025 03:51 AM PT
We are investigating issues with Messaging Triggers executing for our customers on POD17.

POST-MORTEM

Root Cause Analysis

This incident was caused by premature terminations of consumers for the Messaging ticket log events service, which occurred while the service was still running. As a result, the consumers were unable to process incoming events, leading to a complete halt in the evaluation and execution of Messaging Triggers on Pod 17.

Resolution

To resolve this issue, we identified the configuration error that set the maximum number of records to be processed in a single batch to 500 instead of the intended 250. By correcting this typo and reducing the max records value, we aimed to decrease the likelihood of consumer terminations due to timeout issues.

Remediation Items

Implement a health check to detect premature terminations of consumers.
Create a monitor to track the number of running consumers.
Establish a monitor to monitor stopped partitions for the Tessaging ticket log events consumer.
Add a consumer lag status widget to the Messaging Trigger Service dashboard.
Create a new metric to measure the time taken to process a batch of messages from the messaging ticket log events topic.

These remediations are designed to enhance monitoring and prevent similar incidents in the future, ensuring the stability and reliability of the Messaging Trigger Service.

FOR MORE INFORMATION

Edited Jan 29, 2025 · Attila Takacs

Followers

Votes

Comment

Attila Takacs created an article, December 09, 2024 13:15

ArticleService notifications

Service Incident - December 9, 2024 - All Pods | Issues with Multi-Brand Ticket Creation

SUMMARY

On December 9, 2024 from 8:08 UTC to 13:03 UTC, customers using the multi-brand functionality experienced errors when trying to create tickets or update ticket titles, but changes were saved.

Timeline

December 09, 2024 10:21 PM UTC | 02:21 PM PT

We are happy to inform you that the recent change has been successfully reverted, and the issue with multi-brand ticket creation has now been resolved.
If you are still experiencing any issues, please try performing a hard refresh (Ctrl + F5) or clear your browser cache and cookies, as this can help resolve any lingering problems.
Feel free to reach out if you continue to encounter difficulties. Thank you for your patience!

December 09, 2024 12:58 PM UTC | 04:58 AM PT
We're currently experiencing issues with multi-brand ticket creation. Our engineering team is aware of the problem and is actively working on resolving it as quickly as possible.
We will provide updates as soon as more information becomes available. Thank you for your patience and understanding!

POST-MORTEM

Root Cause Analysis

The root cause of the incident was a flaw in the ticket initialization logic. When the requester ID was undefined, the system attempted to retrieve it based on the ticket key, leading to errors whenever any field change occurred in the ticket UI. The introduction of new logic to read the user ID from the user profile page and set it to the requester ID inadvertently set the requester object to undefined, which broke the expected functionality.

Resolution

To resolve the issue, the problematic update was reverted.

Remediation Items

Fix flawed code: Ensure that the system correctly sets requester data when creating tickets to prevent blank entries.
Add automatic testing: Create a test to check that the ticket save process correctly handles requester information.
Confirm Manual Testing: Require deployers to manually test changes on “canary” PODs and confirm that everything works before deployment.
Improve Monitoring: Set up monitoring to alert on browser errors, such as “something went wrong,” to quickly identify issues.

FOR MORE INFORMATION

For current system status information about your Zendesk, check out our system status page. The summary of our post-mortem investigation is usually posted here a few days after the incident has ended. If you have additional questions about this incident, contact Zendesk customer support.

Edited Dec 17, 2024 · Attila Takacs

Followers

Votes

Comments

Attila Takacs created an article, December 06, 2024 10:15

ArticleService notifications

Service Incident - December 6, 2024 - All Pods | Mobile SDK 400 errors

SUMMARY

On December 3, 2024 from 21:09 UTC to 3:36 UTC Dec 7, 2024, some customers using the mobile SDK experienced 400 errors when creating tickets. Due to a change, newly created OAuth tokens were assigned a default expiration time of 8 hours. This change inadvertently broke the legacy mobile SDKs, which were unable to retrieve new tokens if their existing tokens became invalid, leading to a frustrating user experience. The issue was resolved by reverting the change.

Timeline

December 6, 2024 6:20 PM UTC | December 6, 2024 10:20 AM PT

We are happy to report that the issue causing some customers to experience 400 errors when creating tickets via the SDK has been resolved. We apologize for any disruption this may have caused, and thank you for your patience during our investigation.

December 6, 2024 12:06 PM UTC | December 6, 2024 04:06 AM PT
Our team continues to work to address the behaviour causing 400 errors on ticket submissions via the API trough our Mobile SDK, for now, if end-users encounter this error they can restart the app tickets will be created as normal.

December 6, 2024 09:45 AM UTC | December 6, 2024 01:45 AM PT

We are aware that some of our customers may experience 400 errors while attempting to create tickets through our Mobile SDK. If you face this error, please restart the app to fix the issue.

POST-MORTEM

Root Cause Analysis

This incident arose from an oversight in assessing how authentication tokens were utilized across different products before rolling out a change in their expiration time. The legacy SDKs by design cannot obtain new OAuth tokens when existing tokens expire, but this aspect was not fully taken into account during the planning and integration stages. Enhanced collaboration and a more thorough evaluation of token usage could have helped avoid this disruption.

Resolution

To resolve the issue, the Authentication team first disabled the backfill process that added expiration times to existing tokens. Subsequently, they deployed a pull request that reverted the expiration settings for new tokens and initiated a backfill to remove expiration from existing tokens. This action restored functionality for the majority of affected customers.

Remediation Items

Establish a clear communication protocol between teams to ensure that known defects are properly documented and reviewed before implementing significant changes.
Improve existing implementation tools to better manage the authentication flow and reduce technical debt associated with legacy SDKs.
Create additional alerts and monitoring systems to detect similar issues in the future, particularly focusing on OAuth token failures.
Introduce connection limits on specific applications to prevent excessive token generation and mitigate database size inflation.

FOR MORE INFORMATION

Edited Dec 20, 2024 · Attila Takacs

Followers

Votes

Comment

Attila Takacs created an article, November 24, 2024 21:36

ArticleService notifications

Service Incident - November 24, 2024 - SunCo | Pod 17 - Performance Issues in Sunshine Conversations

SUMMARY

On November 21, 2024 from 21:02 UTC to 21:56 UTC, some customers using Sunshine Conversations hosted on Pod 17 experienced slowness and performance issues.

TIMELINE

November 24, 2024 10:23 PM UTC | November 24, 2024 02:23 PM PT
We are pleased to announce that the latency issues impacting Sunshine Conversations for some of our customers on POD 17 have now been resolved. Thank you very much for your patience!

November 24, 2024 10:09 PM UTC | November 24, 2024 02:09 PM PT
We believe we have identified the root cause of the performance issues impacting SunCo for our customer on Pod17. We are now seeing improvements and will continue to monitor the behaviour.

November 24, 2024 09:53 PM UTC | November 24, 2024 01:53 PM PT
We continue to investigate performance issues from Pod 17. These may cause slowness in Sunshine Conversations. We will provide further updates as soon.

November 24, 2024 09:36 PM UTC | November 24, 2024 01:36 PM PT
We are investigating potential performance issues impacting some of our customers hosted on Pod 17. We will post an update with further details soon.

POST-MORTEM

Root Cause Analysis

This incident was caused by an unexpected surge in traffic on Pod17, which more than doubled in the preceding week and almost tripled the day of the incident. The Unity SDK utilized by a customer was making excessive requests to the SunCo API to retrieve unread message counts, leading to increased load on the system. The resource auto-scaler was already at maximum capacity, preventing the addition of more resources to handle the increased traffic. Consequently, this overload resulted in slower response times and ultimately triggered health checks that initiated restarts, compounding the issue.

Resolution

To resolve the performance issues, we increased the maximum number of replicas for the SunCo API on Pod17. This adjustment allowed the system to better handle the increased traffic and restored normal response times for all customers.

Remediation Items

Investigate the Unity SDK to understand why it is sending an excessive number of requests to SunCo and implement optimizations.
Document backend interaction patterns in embeddables to clarify usage and identify potential inefficiencies.
Evaluate the implementation of a caching strategy for SDK APIs in SunCo to reduce the number of requests made.
Add monitoring to detect abnormal traffic growth over specified periods to proactively address potential overloads.

FOR MORE INFORMATION

Edited Dec 04, 2024 · Attila Takacs

Followers

Votes

Comment

Attila Takacs created an article, November 15, 2024 12:16

ArticleService notifications

Service Incident - November 15, 2024 - Support | Pods 25, 30 - SLA Performance Issues

SUMMARY

Between 23:30 UTC on November 12, 2024, and 11:26 UTC on November 15, 2024, Support customers using SLAs in Pods 25 and 30 experienced delayed SLA calculations, and the SLA badges on their tickets were not appearing as expected after applicable ticket updates.

TIMELINE

November 15, 2024 01:00 PM UTC | November 15, 2024 05:00 AM PT
We are pleased to report that the issues impacting Metrics SLA performance on Pod 25 and 30 have now been resolved. Thank you for your patience.

November 15, 2024 12:16 PM UTC | November 15, 2024 04:16 AM PT
We are now seeing improvements to the issue impacting Metrics SLA performance on Pod 25 and 30. We continue to monitor and will provide further updates as soon as we have them.

POST-MORTEM

Root Cause Analysis

This incident was caused by a misconfigured secret for the metric event service. This meant that when Zendesk deployed an update with additional validation, the service failed to initialize for Asia-Pacific deployments, leading to processing delays.

Resolution

To fix this issue, a "default" value was added for the affected secret on November 15, 2024. This allowed the metric event service to initialize properly and resume normal operations. Zendesk also identified and set a default value for a secret of the talk transcription service to mitigate any future risks.

Remediation Items

Conduct a thorough audit of all secrets to ensure that values are set for all localities, especially in Asia-Pacific regions.
Improve existing implementation tools to prevent similar misconfigurations in the future.
Create additional alerts to notify relevant teams of initialization failures and issues.
Investigate the tracking of failure metrics to ensure that such incidents trigger alerts for timely resolutions.

By implementing these remediations, we aim to enhance the resilience of our services and prevent similar incidents in the future.

FOR MORE INFORMATION

Edited Dec 04, 2024 · Attila Takacs

Followers

Votes

Comment

Attila Takacs created an article, November 14, 2024 15:18

ArticleService notifications

Service Incident - November 14, 2024 - Network errors accessing Explore dashboards and reports

SUMMARY

On November 14, 2024 from 14:15 UTC to 23:20 UTC, customers using Explore experienced network errors accessing reports and dashboards.

Timeline

November 14, 2024 06:11 PM UTC | November 14, 2024 10:11 AM PT
We are happy to report that the issue causing "Network Error" messages when loading Explore dashboards or reports has been solved. Thank you for your patience during our investigation.

November 14, 2024 03:18 PM UTC | November 14, 2024 07:18 AM PT
Some Explore users may see a "Network Error" message when loading dashboards or reports. If this happens, clearing your browser’s cookies and cache should resolve the issue. We apologise for the inconvenience.

POST-MORTEM

Root Cause Analysis

This incident was caused by a network change that increased the number of headers sent to the Explore service beyond its configured limit. This led to the service failing silently, resulting in the dashboards not loading for customers.

Resolution

To resolve this issue, the network team was paged to investigate the changes. Upon identifying the problem, they reverted the changes that added excessive headers, which restored normal functionality for the Explore Dashboards.

Remediation Items

Improve existing implementation tools: Review and enhance the tools used for managing header limits in API requests to prevent similar incidents.
Create additional alerts: Set up alerts for monitoring header counts and other critical metrics to detect issues before they impact customers.
Add connection limits on specific applications: Implement connection limits on APIs to ensure they do not exceed operational thresholds, reducing the risk of future incidents.

FOR MORE INFORMATION

Edited Dec 03, 2024 · Attila Takacs

Follower

Votes

Comment