Recent searches
No recent searches
![Erin McKeown's Avatar](https://secure.gravatar.com/avatar/7332dad3264e2a98a7caf137ce4199f6?default=https%3A%2F%2Fassets.zendesk.com%2Fhc%2Fassets%2Fdefault_avatar.png&r=g)
Erin McKeown
Joined Apr 14, 2021
·
Last activity Oct 21, 2021
Following
0
Followers
0
Total activity
6
Votes
0
Subscriptions
0
ACTIVITY OVERVIEW
BADGES
ARTICLES
POSTS
COMMUNITY COMMENTS
ARTICLE COMMENTS
ACTIVITY OVERVIEW
Latest activity by Erin McKeown
Erin McKeown created an article,
This is part 3 of the Overview of incident management at Zendesk. This guide contains the following parts:
- Part 1: How Zendesk service issues become service incidents
- Part 2: How Zendesk manages service incidents
- Part 3: Monitoring a public Zendesk service incident (this article)
- Part 4: Post resolution incident analysis and reporting
In this article, part 3, you'll learn how to stay up to date through key communication channels.
As an incident gets resolved at Zendesk, customers are able to access information about incidents through a series of different channels.
- Check the system status page. Real-time information about public Zendesk service incidents can be found by checking the System Status Page using your account subdomain.
- Enable email notifications for service incidents. To help you monitor your account status, admins and agents can choose to receive emails when a service incident affects your account by subscribing to incident email notifications. You can subscribe directly from the Zendesk status page or from your Support account. The Subscribing to status notifications for your account article provides more detailed information about how customers can track and monitor Zendesk service incidents.
- Connect to the Zendesk component status API. Customers that utilize Zendesk APIs in their customer service solutions can automate their access to the status of those components by connecting to the Zendesk component status API.
Here is an example of what the System Status Page contains:
Zendesk System Status Page
Learn more
This completes Part 3, Monitoring a public service incident, of the Overview of incident management at Zendesk.
If you'd like to learn more, you can move on to the next part of this guide: Part 4: Post resolution incident analysis and reporting.
.
Edited Nov 01, 2024 · Erin McKeown
0
Followers
13
Votes
0
Comments
Erin McKeown created an article,
This is part 2 of the Overview of incident management at Zendesk. This guide contains the following parts:
- Part 1: How Zendesk service issues become service incidents
- Part 2: How Zendesk manages service incidents (this article)
- Part 3: Monitoring a public Zendesk service incident
- Part 4: Post-resolution incident analysis and reporting
In this article, part 2, you'll get an understanding of how the Zendesk teams respond to service incidents within our products based on severity levels. Zendesk takes a comprehensive approach in understanding an incident--from its root cause to the total impact to affected customers--and communicates the appropriate level of detail.
This article contains the following sections:
- Incident severity
- Incident response team structure
- Communication timelines for incidents
- Low severity incident process
Incident severity
One of the key decisions made when a service incident is created is assigning the incident’s severity. The severity of an incident determines how and which Zendesk teams address the issue and how it is communicated to customers who are affected.
Zendesk uses a system that groups service incidents into 5 severity levels based on the characteristics of the incident:
Zendesk Severity Rating System
Different escalation paths and teams are engaged to investigate, communicate and remediate the incident based on severity level. This ensures the right level of rigor is given to each incident. The diagram below describes the key activities that happen during and after an incident is cleared based on its severity level:
Process by Severity Level
While high severity incidents go through rigorous analysis and remediation activities, every incident - regardless of severity level - goes through a real-time response and investigation process. That produces:
- Updates to the Zendesk status page when the incident is public
- Root cause analysis and incident remediations
- Zendesk (internal) incident report
Zendesk service availability incident example
Here is an example of how Incident Severity is set by Zendesk and how Zendesk teams respond internally:
Incident Discovery and Response Example
In the example, you see the following workflow:
- The Zendesk Network Operations Center (ZNOC) identified an issue when system alerts showed service nodes in Pod 17 could not be reached by requests. The Zendesk Network Engineering team verified the access issues were affecting customer services directly and quickly realized the Support, Guide and Talk services for multiple customers were not operating as expected. A new Zendesk service incident was created.
- This incident was known to affect two customers when it was initially created, but because of the nature of the outage, more customers were experiencing the issue and began to raise their issues with Zendesk Customer Support. The incident was assigned a severity rating of 1 by the Engineering team - a high priority incident that requires immediate attention.
- The Incident Response on-call team was paged immediately. Within minutes of incident creation, an Incident Manager gathered information and assembled additional engineering resources to troubleshoot and fix the service incident.
Incident response team structure
Zendesk has a dedicated global Incident Response team to ensure that every incident is shepherded through the service incident management process and escalated to the appropriate levels of Zendesk leadership, as warranted.
Incident Management Roles and Responsibilities
This team structure enables Zendesk to conduct a thorough analysis of the incident with technical resources and communicate in real time to customers through Zendesk Customer Support.
Communication timelines for incidents
Zendesk is invested in making sure incidents are properly communicated and resolved in appropriate timeframes for the customer. We have established internal timelines for the distribution of incident details. The timeline is based on the severity level of the incident and the service incident management stages.
Stage |
Response Timelines |
Public Announcement |
Within 15 minutes of the incident called |
Incident Updates |
Every 30 minutes until service is restored or as new information becomes available |
Event Analysis (for Severity 0 and 1 incidents) |
Within 48 hours of incident resolution |
Root Cause Analysis |
Within 72 hours of incident resolution |
Public Incident Retrospective |
Within 96+ hours of incident resolution |
Incidents that have a Severity rating of 0 - 2 are considered high severity incidents. When a high severity incident occurs, the global on-call Incident Response Team is available 24/7 to respond to these incidents. The team consists of the following roles:
Incident Response Team Roles
Global Incident Response Team Locations
As the on-call team is paged, the incident diagnosis starts within minutes of the incident being declared. A Slack channel and Zoom call are created to enable response team communication in real-time. As the Incident Response team triages and scopes the incident, on-call engineering teams are paged based on what products and services are affected.
A public post on the Zendesk status page is made within 15 minutes of incident creation to keep customers informed about the known incident. Updates are posted every 30 minutes thereafter until resolution as new information becomes available. Depending on the issue and how much new information is identified, this cadence may be reduced or lengthened as needed. Customers can monitor active service incidents on the Zendesk Status page - that process is described in part 3 of this guide.
In addition to our global on-call incident response teams, Zendesk has established processes for leadership notification and escalation. If a high severity incident fits certain criteria, we enable the next level of response, which is Crisis Management.
Zendesk service availability incident example continued
In continuation of using the service availability incident as an example, this is how the incident response progressed through the Incident Management process at Zendesk:
Service Availability Incident Response Timeline
As you can see in the example, once the incident was created in the Zendesk Incident Portal, a series of automated actions were taken:
- The Incident Response on-call team was paged to respond to the incident
- An incident Slack channel was automatically created and the Incident Response on-call team was added to the dedicated Slack channel
- A Zoom call was automatically started and posted to the Slack channel for all responders to join
- An Event Summary Document was automatically created to document the incident and share progress internally to Zendesk
On the Zoom call, the Incident Manager validated the initial severity and confirmed the scope and impact of the issue.
It was quickly determined multiple container nodes in Pod 17 were not accessible and could not be used by dependent services including Support, Guide and Talk products. One node type in particular had no available replicas in other pods. This would eventually cause these products to become unresponsive for multiple customers.
The ZNOC paged the appropriate Network engineering team to the Zoom call to begin investigating how to solve the immediate problem of restoring service and API access to customers. Edge engineering SMEs were also paged and joined the call. Within 5 minutes, a fix was identified and deployed so the affected nodes were again accessible to API calls and services.
Zendesk Customer Support created a problem ticket to track the customer reports. This ticket was added to the incident Slack channel to quickly allow for new reports to be added as they came in.
While the investigation was continuing, the Incident Escalation Manager created and published the first public update to the Zendesk Status Page 12 minutes after incident creation.
First Service Availability Incident Post on Zendesk System Status Page
While the teams investigated the incident, customer reports that came in were linked to the main problem ticket associated with the incident. This allowed the Incident Response team to send updates to all impacted customers when they made public notifications.
The Network engineering team determined a change to how certificates were generated and used was responsible for the incident and took the following actions to restore service to affected customers:
- Deactivated unreachable nodes
- Created new service nodes with properly referenced certificates
- Verified that new service nodes were accessible for services and through API calls
- Monitored inbound traffic to see that inbound requests were now being handled appropriately
As the incident progressed, two more public updates were made: One 14 minutes after incident creation and another 63 minutes after incident creation. The public communication history along with published incident retrospective information can be found on the Service Notifications page for the incident.
As shown in the example, high severity incidents go through a rigorous process where root causes are determined and remediation items are created for Product Engineering teams to fix the underlying problem that caused the incident. This analysis happens during our incident retrospective and is discussed in more detail in the Post Resolution Incident Analysis section.
Low severity incident process
Lower severity service incidents (level 3-4) are less critical because they affect a smaller number of customers and do not prevent those customers from using business critical functions of Zendesk products. These incidents are addressed according to the guidelines above, but are not posted to public channels.
Severity 3 incidents are handled in much the same way as severity 0-2 incidents. Expected response times are extended because of the reduced business impact. Even though the on-call team is not paged, these incidents are handled through specific Zendesk incident Slack channels associated with the supporting product engineering team(s), and the teams tend to respond as quickly as higher severity incidents. Most severity 3 incidents do not use public communication channels. Instead, Zendesk Customer Support teams reach out to customers using proactive notifications if specific action is required from a subset of customers.
Severity 4 incidents do not directly affect customer use of Zendesk services, but have the potential to do so if not addressed. These incidents are created as proactive responses to potential issues. Product engineering teams engage the same way as they do with the severity 3 process.
Learn more
This completes Part 2, How Zendesk manages service incidents, of the Overview of incident management at Zendesk.
If you'd like to learn more, you can move on to the next part of this guide: Part 3: Monitoring a public Zendesk service incident.
Edited Nov 01, 2024 · Erin McKeown
1
Follower
7
Votes
0
Comments
Erin McKeown created an article,
This is part 4 of the overview of incident management at Zendesk. This guide contains the following parts:
- Part 1: How Zendesk service issues become service incidents
- Part 2: How Zendesk manages service incidents
- Part 3: Monitoring a public Zendesk service incident
- Part 4: Post resolution incident analysis and reporting (this article)
In this article, part 4, you'll learn how the incident response team conducts a retrospective that includes root cause analysis and remediation of service incidents and then assigns remediation items to the engineering team(s) that have ownership.
By conducting these activities, Zendesk Customer Support can share incident details and next steps with affected customers.
This article contains the following sections:
- Conducting a service incident retrospective
- Assigning remediation Items
- Closing out a service incident
Conducting a service incident retrospective
Zendesk conducts a reflective exercise with all team members involved with the incident to examine and document the causes of the incident, the incident’s impact to customers and actions taken to mitigate or resolve it. The team reviews the identified root cause(s), and follow-up actions that will prevent the incident from recurring. This is known as a service incident internal retrospective. Incident retrospectives are shared publicly only for high severity incidents.
To ensure transparency and inclusion for all Zendesk teams, a Zendesk internal retrospective calendar is available so they can attend the internal retrospective meeting and get more information regarding service incidents and root causes. Outcomes of incidents are shared with all engineering teams and significant incident outcomes are highlighted and reviewed in the Zendesk weekly engineering meeting.
There are four main activities performed in a service incident retrospective:
- Review the incident details contained in the Incident Document to anchor and orient the participants to the incident
- Review and validate the Root Cause Analysis findings contained in the Incident Document
- Identify and categorize any remediation work needed for Zendesk engineering teams to fully address the root causes that lead to the service incident. All remediation items are agreed to with consensus by the retrospective attendees
- Assign remediation work to the appropriate engineering teams with clear and appropriate SLAs defined.
High severity incident analysis
Once a high severity incident is resolved, the Incident Manager schedules a retrospective meeting that includes:
- All team members who participated in the incident response
- Engineers from teams whose products or services were affected by the incident
- Teams who have ownership or invested interest such as:
- Zendesk Customer Support
- Product teams
- Leaders who own affected products, services and areas of support
Every effort is made to hold the incident retrospective meeting within 72 hours of incident resolution, understanding that the timing of the meeting will depend on the complexity of the root cause and availability of team members across geographical regions.
After scheduling the incident retrospective, the engineering owner documents the root cause analysis and creates The Incident Document based on the following categories:
- Incident Overview
- Customer Impact
- Technical Description
- Root Cause and Service Information
- Incident Details and Timings
- Remediations
The Incident Document guides the incident retrospective and captures any remediation work that is identified to fully resolve the underlying issues that caused the incident.
There is an additional analysis phase conducted for severity 0-3 incidents known as Root Cause Analysis. This analysis gives the Engineering team a chance to understand and document the incident and define the work needed to fully fix the issues. This information is captured in the Incident Document.
Zendesk Incident Root Cause Analysis Process
Low severity incident analysis
Low severity incidents go through a leaner root cause and reporting phase than high severity incidents. While a formal incident retrospective meeting is not completed (unless requested by the Product Engineering owner) for low severity incidents, an Incident Document is created by the Product Engineering owner.
Root causes are identified, classified and shared with Engineering teams, and remediation items are added to the Product Engineering team backlog with SLAs. As with higher severity incidents, Zendesk seeks to learn and improve our engineering processes as a result of thoroughly investigating low severity incidents.
Since Severity 3 incidents have a minor impact on customers, the issue status and identified remediations are shared with affected customers who reached out about the incident via Zendesk Customer Support through a Zendesk ticket.
Severity 4 incidents by definition do not have direct customer impact. Post incident analysis is not communicated to customers, but the root causes are identified and remediations are addressed internally using the processes and procedures described above.
Assigning remediation Items
In order to ensure remediation items are completed, the Product Engineering team reviews the validated remediation items in the retrospective and performs the following actions:
- Classify remediations as Preventive or General:
- Preventive items are ones that would actively prevent a recurrence of the incident
- General items are not solely preventive on their own but would resolve a core part of the incident’s circumstances
- Prioritize the remediations to set the response SLAs. This exercise goes through the following activities:
- Identify the engineering teams responsible for working the remediation item
- Link the remediation item to the identified root cause that it addresses
- Add the remediation item to the work backlog of the responsible engineering team
- Add the remediation item to the engineering SLA report to track SLA achievement
Below is a chart that Product Engineering teams use to determine when a remediation is prioritized and planned for their work effort.
Zendesk Remediation Item Priority SLA
The Zendesk Customer Support team attending the retrospective creates the customer-facing descriptions of incident, root causes, and any remediations identified. This is posted to the Help Center article associated with the incident.
Service Availability Incident Example Continued
Here’s how an incident retrospective was conducted for this incident.
4 business days after the incident occurred, the Incident Response team and Engineers gathered to review the incident, collaborate on the root causes, and create or update the remediation items. All remediation items are agreed to by consensus of the meeting attendees.
Each person involved in the incident played a role in the incident retrospective:
The details reviewed and discussed in the meeting included:
Area |
Example |
Timeline |
|
Root Causes |
|
Influencing Factors |
|
Remediations |
|
For there to be a thorough analysis to generate concrete actions for the Engineering team, all team members provided input to recount the incident and develop remediation tasks. Once all questions or issues were addressed by the Incident Response team, the incident retrospective was considered complete.
The Zendesk Customer Support lead responsible for the public facing incident retrospective was asked at the end of the internal retrospective meeting if she had any questions or needed any additional information from the team to create the public documentation. She had no further questions and added the retrospective information below to the public service incident article in the Service Notifications section in our help center.
Public Retrospective Information for the Service Availability VM Incident
Three important outcomes of this incident retrospective that have improved Zendesk products and services were:
- The root causes of the incident were identified and will be considered by all Zendesk product teams in future development moving forward
- The remediations were identified and assigned to engineering teams with SLAs
- The public retrospective was published by Zendesk Customer Support to the Help Center and was sent to affected customers who submitted tickets
Closing out a service incident
As a best practice, Zendesk closes any open tickets with customers to make sure everything is properly documented and communicated for the incident.
All completed service incidents are summarized in a weekly service incident digest report which is shared widely across Zendesk. Incident descriptions, customer impact and important remediations are included in this report and are also in a bi-weekly Operations Review report that is shared with Zendesk’s Executive team.
After retrospective information is published to the Help Center and open tickets are updated with the results from the retrospective, the analysis and reporting phase for the service incident is considered complete. Zendesk Customer Support links those tickets to the service incident and they are marked as closed.
Edited Nov 01, 2024 · Erin McKeown
1
Follower
6
Votes
0
Comments
Erin McKeown created an article,
Zendesk’s Global Business Resilience Program’s mission is to ensure Zendesk has the ability to rapidly adapt and respond to business disruptions, safeguard people and assets, while maintaining continuous business operations.
Zendesk achieves resiliency through four principal areas of focus: Business Continuity, Disaster Recovery, Incident Management and Crisis Management. Zendesk maintains our readiness by proactively assessing operational risks, establishing contingency plans, and administering incident response and crisis management training.
Zendesk assesses and mitigates potential business disruptions through our Business Resilience (BR) Program. Under this program, all critical business functions and locations are required to maintain and exercise alternate operation strategies. The Resilience Program team validates that each business unit’s resiliency strategies are effective and meet the policy established by the program. For critical business operations, we conduct internal/external audits of business continuity plans and moderate annual exercises to ensure their plan efficiently mitigates realistic disruptions and meets compliance certifications and memberships
Zendesk maintains a risk framework that accounts for the evaluation of our facilities, technology, applications, data, processes and overall organization to ensure our risk mitigation strategy operates at multiple levels with broad coverage.
Within the Business Resilience program we maintain governance through our support model, Business Resilience Steering Committee and Incident Management Council.
In the event of a business disruption, we have plans designed to allow us to continue operations of critical functions, we accomplish this in part by:
- Using redundant processing capacity at other locations.
- Designing our technology and systems to support the recovery processes for critical business functions.
- Using business and technology teams that are responsible for activating and managing the recovery process.
- Exercising our recovery procedures and testing those procedures on a regular basis.
When it comes to disaster recovery, as part of our strategy, Zendesk leverages rigorous business impact and risk analysis to identify applications/services that are critical to each of our products. Amazon Web Services (AWS) is an Advanced Technology Partner of Zendesk. By building within the AWS environment, we benefit from all the partnership has to offer. Our applications/services are hosted in separate Availability Zones (AZ) using industry-standard practices to copy data across multiple AZs in real time.
In addition to all the business continuity and disaster recovery efforts for all Zendesk customers, some may prefer an additional level of redundancy and recoverability, which can be found through our Enhanced Disaster Recovery.
Edited Oct 28, 2021 · Erin McKeown
4
Followers
33
Votes
2
Comments
Erin McKeown created an article,
This is part 1 of the Overview of incident management at Zendesk. This guide contains the following parts:
- Part 1: How Zendesk service issues become service incidents (this article)
- Part 2: How Zendesk manages service incidents
- Part 3: Monitoring a public Zendesk service incident
- Part 4: Post resolution incident analysis and reporting
In this article, part 1, you'll get an understanding of the service incident life cycle at Zendesk, starting from when an incident is detected or reported to the ways Zendesk teams communicate and escalate the incident internally to how incident remediation works.
Before a service incident is created at Zendesk, our Engineering team may receive an alert, or tickets might be raised to Zendesk Customer Support team that indicates something unusual is happening.
Service incident creation workflow
These issues generally come from two sources:
1. Zendesk Network Operating Center (ZNOC) receives an alert, which is then reviewed and validated by the Zendesk Product Engineering team for affected products (e.g., Support, Guide, Chat, Talk)
The ZNOC team has monitoring tools and processes to alert the Product Engineering team when there are issues with Zendesk products such as a product responding more slowly than expected, increased error rates, or if a service’s volume is changing at a greater rate than expected. Usually these alerts are the first indication that something is not working properly. The majority of service incidents are first discovered through the monitoring that Zendesk has built into our systems.
2. Customer Reports, routed through Zendesk Customer Support
Customer reports are another first indication that something isn't working properly. When customers notice Zendesk service issues, they report the issue using Zendesk's own internal support system or chatting with Zendesk Customer Support.
To make sure customer reported issues are handled appropriately, the Zendesk Customer Support team evaluates the reported issue and determines if it is part of an existing service incident, if a new service incident should be created, or if the issue reported should follow our standard troubleshooting path.
- If there is an existing incident, the customer ticket will be linked to that incident and the customer will be informed of the current status.
- If it's a new report, a new service incident will be created based on the severity scale (more in that in Part 2) and regular status updates sent to all linked tickets.
- If the issue is not caused by a potential Zendesk service incident, Zendesk Customer Support will help the customer troubleshoot the issue and resolve or escalate as appropriate.
Learn more
This completes Part 1, How Zendesk service issues become service incidents, of the Overview of incident management at Zendesk.
If you'd like to learn more, you can move on to the next part of this guide: Part 2: How Zendesk manages service incidents.
Edited Nov 01, 2024 · Erin McKeown
1
Follower
6
Votes
0
Comments
Erin McKeown created an article,
Zendesk provides business critical functions for our customers. When the service of these products is interrupted—causing disruptions known as service incidents—Zendesk takes action and initiates an investigation and remediation process.
This process includes detection, reporting, analysis, and mitigation of incidents, as well as documentation and remediation steps to ensure that we learn from them. Zendesk seeks to restore the full function of services quickly and thoroughly to provide a trusted and reliable experience for customers.
The service incident management process has four main goals:
- Restore normal operations of Zendesk services as quickly as possible
- Provide meaningful information to customers during an incident to mitigate impact where possible and provide updates on remediation status
- Perform detailed root cause analysis and identify permanent fixes once service is restored and share this analysis with customers to maintain trust in Zendesk services
- Share lessons learned across engineering teams and track incident causes and remediations
This guide describes the Zendesk incident management process in the following parts:
Edited Nov 01, 2024 · Erin McKeown
7
Followers
21
Votes
0
Comments