Recent searches

No recent searches

Erin McKeown

Joined Apr 14, 2021

Last activity Oct 21, 2021

Following

Followers

Total activity

Votes

Subscriptions

ACTIVITY OVERVIEW

BADGES

ARTICLES

POSTS

COMMUNITY COMMENTS

ARTICLE COMMENTS

ACTIVITY OVERVIEW

Latest activity by Erin McKeown

Erin McKeown created an article, October 21, 2021 11:49

ArticleZendesk programs and services

Incident Management part 3: Monitoring a public Zendesk service incident

This is part 3 of the Overview of incident management at Zendesk. This guide contains the following parts:

Part 1: How Zendesk service issues become service incidents
Part 2: How Zendesk manages service incidents
Part 3: Monitoring a public Zendesk service incident (this article)
Part 4: Post resolution incident analysis and reporting

In this article, part 3, you'll learn how to stay up to date through key communication channels.

As an incident gets resolved at Zendesk, customers are able to access information about incidents through a series of different channels.

Check the system status page. Real-time information about public Zendesk service incidents can be found by checking the System Status Page using your account subdomain.
Enable email notifications for service incidents. To help you monitor your account status, admins and agents can choose to receive emails when a service incident affects your account by subscribing to incident email notifications. You can subscribe directly from the Zendesk status page or from your Support account. The Subscribing to status notifications for your account article provides more detailed information about how customers can track and monitor Zendesk service incidents.
Connect to the Zendesk component status API. Customers that utilize Zendesk APIs in their customer service solutions can automate their access to the status of those components by connecting to the Zendesk component status API.

Here is an example of what the System Status Page contains:

Zendesk System Status Page

Learn more

This completes Part 3, Monitoring a public service incident, of the Overview of incident management at Zendesk.

If you'd like to learn more, you can move on to the next part of this guide: Part 4: Post resolution incident analysis and reporting.

Edited Nov 01, 2024 · Erin McKeown

Followers

Votes

Comments

Erin McKeown created an article, October 16, 2021 05:29

ArticleZendesk programs and services

Incident Management part 2: How Zendesk manages service incidents

This is part 2 of the Overview of incident management at Zendesk. This guide contains the following parts:

Part 1: How Zendesk service issues become service incidents
Part 2: How Zendesk manages service incidents (this article)
Part 3: Monitoring a public Zendesk service incident
Part 4: Post-resolution incident analysis and reporting

In this article, part 2, you'll get an understanding of how the Zendesk teams respond to service incidents within our products based on severity levels. Zendesk takes a comprehensive approach in understanding an incident--from its root cause to the total impact to affected customers--and communicates the appropriate level of detail.

This article contains the following sections:

Incident severity
Incident response team structure
Communication timelines for incidents
Low severity incident process

Incident severity

One of the key decisions made when a service incident is created is assigning the incident’s severity. The severity of an incident determines how and which Zendesk teams address the issue and how it is communicated to customers who are affected.

Zendesk uses a system that groups service incidents into 5 severity levels based on the characteristics of the incident:

Zendesk Severity Rating System

Different escalation paths and teams are engaged to investigate, communicate and remediate the incident based on severity level. This ensures the right level of rigor is given to each incident. The diagram below describes the key activities that happen during and after an incident is cleared based on its severity level:

Process by Severity Level

While high severity incidents go through rigorous analysis and remediation activities, every incident - regardless of severity level - goes through a real-time response and investigation process. That produces:

Updates to the Zendesk status page when the incident is public
Root cause analysis and incident remediations
Zendesk (internal) incident report

Zendesk service availability incident example

Here is an example of how Incident Severity is set by Zendesk and how Zendesk teams respond internally:

Incident Discovery and Response Example

In the example, you see the following workflow:

The Zendesk Network Operations Center (ZNOC) identified an issue when system alerts showed service nodes in Pod 17 could not be reached by requests. The Zendesk Network Engineering team verified the access issues were affecting customer services directly and quickly realized the Support, Guide and Talk services for multiple customers were not operating as expected. A new Zendesk service incident was created.
This incident was known to affect two customers when it was initially created, but because of the nature of the outage, more customers were experiencing the issue and began to raise their issues with Zendesk Customer Support. The incident was assigned a severity rating of 1 by the Engineering team - a high priority incident that requires immediate attention.
The Incident Response on-call team was paged immediately. Within minutes of incident creation, an Incident Manager gathered information and assembled additional engineering resources to troubleshoot and fix the service incident.

Incident response team structure

Zendesk has a dedicated global Incident Response team to ensure that every incident is shepherded through the service incident management process and escalated to the appropriate levels of Zendesk leadership, as warranted.

Incident Management Roles and Responsibilities

This team structure enables Zendesk to conduct a thorough analysis of the incident with technical resources and communicate in real time to customers through Zendesk Customer Support.

Communication timelines for incidents

Zendesk is invested in making sure incidents are properly communicated and resolved in appropriate timeframes for the customer. We have established internal timelines for the distribution of incident details. The timeline is based on the severity level of the incident and the service incident management stages.

Stage	Response Timelines
Public Announcement	Within 15 minutes of the incident called
Incident Updates	Every 30 minutes until service is restored or as new information becomes available
Event Analysis (for Severity 0 and 1 incidents)	Within 48 hours of incident resolution
Root Cause Analysis	Within 72 hours of incident resolution
Public Incident Retrospective	Within 96+ hours of incident resolution

Incidents that have a Severity rating of 0 - 2 are considered high severity incidents. When a high severity incident occurs, the global on-call Incident Response Team is available 24/7 to respond to these incidents. The team consists of the following roles:

Incident Response Team Roles

Global Incident Response Team Locations

As the on-call team is paged, the incident diagnosis starts within minutes of the incident being declared. A Slack channel and Zoom call are created to enable response team communication in real-time. As the Incident Response team triages and scopes the incident, on-call engineering teams are paged based on what products and services are affected.

A public post on the Zendesk status page is made within 15 minutes of incident creation to keep customers informed about the known incident. Updates are posted every 30 minutes thereafter until resolution as new information becomes available. Depending on the issue and how much new information is identified, this cadence may be reduced or lengthened as needed. Customers can monitor active service incidents on the Zendesk Status page - that process is described in part 3 of this guide.

In addition to our global on-call incident response teams, Zendesk has established processes for leadership notification and escalation. If a high severity incident fits certain criteria, we enable the next level of response, which is Crisis Management.

Zendesk service availability incident example continued

In continuation of using the service availability incident as an example, this is how the incident response progressed through the Incident Management process at Zendesk:

Screenshot

Service Availability Incident Response Timeline

As you can see in the example, once the incident was created in the Zendesk Incident Portal, a series of automated actions were taken:

The Incident Response on-call team was paged to respond to the incident

An incident Slack channel was automatically created and the Incident Response on-call team was added to the dedicated Slack channel

A Zoom call was automatically started and posted to the Slack channel for all responders to join

An Event Summary Document was automatically created to document the incident and share progress internally to Zendesk

On the Zoom call, the Incident Manager validated the initial severity and confirmed the scope and impact of the issue.

It was quickly determined multiple container nodes in Pod 17 were not accessible and could not be used by dependent services including Support, Guide and Talk products. One node type in particular had no available replicas in other pods. This would eventually cause these products to become unresponsive for multiple customers.

The ZNOC paged the appropriate Network engineering team to the Zoom call to begin investigating how to solve the immediate problem of restoring service and API access to customers. Edge engineering SMEs were also paged and joined the call. Within 5 minutes, a fix was identified and deployed so the affected nodes were again accessible to API calls and services.

Zendesk Customer Support created a problem ticket to track the customer reports. This ticket was added to the incident Slack channel to quickly allow for new reports to be added as they came in.

While the investigation was continuing, the Incident Escalation Manager created and published the first public update to the Zendesk Status Page 12 minutes after incident creation.

First Service Availability Incident Post on Zendesk System Status Page

While the teams investigated the incident, customer reports that came in were linked to the main problem ticket associated with the incident. This allowed the Incident Response team to send updates to all impacted customers when they made public notifications.

The Network engineering team determined a change to how certificates were generated and used was responsible for the incident and took the following actions to restore service to affected customers:

Deactivated unreachable nodes

Created new service nodes with properly referenced certificates

Verified that new service nodes were accessible for services and through API calls

Monitored inbound traffic to see that inbound requests were now being handled appropriately

As the incident progressed, two more public updates were made: One 14 minutes after incident creation and another 63 minutes after incident creation. The public communication history along with published incident retrospective information can be found on the Service Notifications page for the incident.

As shown in the example, high severity incidents go through a rigorous process where root causes are determined and remediation items are created for Product Engineering teams to fix the underlying problem that caused the incident. This analysis happens during our incident retrospective and is discussed in more detail in the Post Resolution Incident Analysis section.

Low severity incident process

Lower severity service incidents (level 3-4) are less critical because they affect a smaller number of customers and do not prevent those customers from using business critical functions of Zendesk products. These incidents are addressed according to the guidelines above, but are not posted to public channels.

Severity 3 incidents are handled in much the same way as severity 0-2 incidents. Expected response times are extended because of the reduced business impact. Even though the on-call team is not paged, these incidents are handled through specific Zendesk incident Slack channels associated with the supporting product engineering team(s), and the teams tend to respond as quickly as higher severity incidents. Most severity 3 incidents do not use public communication channels. Instead, Zendesk Customer Support teams reach out to customers using proactive notifications if specific action is required from a subset of customers.

Severity 4 incidents do not directly affect customer use of Zendesk services, but have the potential to do so if not addressed. These incidents are created as proactive responses to potential issues. Product engineering teams engage the same way as they do with the severity 3 process.

Learn more

This completes Part 2, How Zendesk manages service incidents, of the Overview of incident management at Zendesk.

If you'd like to learn more, you can move on to the next part of this guide: Part 3: Monitoring a public Zendesk service incident.

Edited Nov 01, 2024 · Erin McKeown

Follower

Votes

Comments

Erin McKeown created an article, October 16, 2021 03:57

ArticleZendesk programs and services

Incident Management part 4: Post resolution incident analysis and reporting

This is part 4 of the overview of incident management at Zendesk. This guide contains the following parts:

Part 1: How Zendesk service issues become service incidents
Part 2: How Zendesk manages service incidents
Part 3: Monitoring a public Zendesk service incident
Part 4: Post resolution incident analysis and reporting (this article)

In this article, part 4, you'll learn how the incident response team conducts a retrospective that includes root cause analysis and remediation of service incidents and then assigns remediation items to the engineering team(s) that have ownership.

By conducting these activities, Zendesk Customer Support can share incident details and next steps with affected customers.

This article contains the following sections:

Conducting a service incident retrospective
Assigning remediation Items
Closing out a service incident

Conducting a service incident retrospective

Zendesk conducts a reflective exercise with all team members involved with the incident to examine and document the causes of the incident, the incident’s impact to customers and actions taken to mitigate or resolve it. The team reviews the identified root cause(s), and follow-up actions that will prevent the incident from recurring. This is known as a service incident internal retrospective. Incident retrospectives are shared publicly only for high severity incidents.

To ensure transparency and inclusion for all Zendesk teams, a Zendesk internal retrospective calendar is available so they can attend the internal retrospective meeting and get more information regarding service incidents and root causes. Outcomes of incidents are shared with all engineering teams and significant incident outcomes are highlighted and reviewed in the Zendesk weekly engineering meeting.

There are four main activities performed in a service incident retrospective:

Review the incident details contained in the Incident Document to anchor and orient the participants to the incident
Review and validate the Root Cause Analysis findings contained in the Incident Document
Identify and categorize any remediation work needed for Zendesk engineering teams to fully address the root causes that lead to the service incident. All remediation items are agreed to with consensus by the retrospective attendees
Assign remediation work to the appropriate engineering teams with clear and appropriate SLAs defined.

High severity incident analysis

Once a high severity incident is resolved, the Incident Manager schedules a retrospective meeting that includes:

All team members who participated in the incident response
Engineers from teams whose products or services were affected by the incident
Teams who have ownership or invested interest such as:

Zendesk Customer Support
Product teams
Leaders who own affected products, services and areas of support

Every effort is made to hold the incident retrospective meeting within 72 hours of incident resolution, understanding that the timing of the meeting will depend on the complexity of the root cause and availability of team members across geographical regions.

After scheduling the incident retrospective, the engineering owner documents the root cause analysis and creates The Incident Document based on the following categories:

Incident Overview
Customer Impact
Technical Description
Root Cause and Service Information
Incident Details and Timings
Remediations

The Incident Document guides the incident retrospective and captures any remediation work that is identified to fully resolve the underlying issues that caused the incident.

There is an additional analysis phase conducted for severity 0-3 incidents known as Root Cause Analysis. This analysis gives the Engineering team a chance to understand and document the incident and define the work needed to fully fix the issues. This information is captured in the Incident Document.

Zendesk Incident Root Cause Analysis Process

Low severity incident analysis

Low severity incidents go through a leaner root cause and reporting phase than high severity incidents. While a formal incident retrospective meeting is not completed (unless requested by the Product Engineering owner) for low severity incidents, an Incident Document is created by the Product Engineering owner.

Root causes are identified, classified and shared with Engineering teams, and remediation items are added to the Product Engineering team backlog with SLAs. As with higher severity incidents, Zendesk seeks to learn and improve our engineering processes as a result of thoroughly investigating low severity incidents.

Since Severity 3 incidents have a minor impact on customers, the issue status and identified remediations are shared with affected customers who reached out about the incident via Zendesk Customer Support through a Zendesk ticket.

Severity 4 incidents by definition do not have direct customer impact. Post incident analysis is not communicated to customers, but the root causes are identified and remediations are addressed internally using the processes and procedures described above.

Assigning remediation Items

In order to ensure remediation items are completed, the Product Engineering team reviews the validated remediation items in the retrospective and performs the following actions:

Classify remediations as Preventive or General:

Preventive items are ones that would actively prevent a recurrence of the incident
General items are not solely preventive on their own but would resolve a core part of the incident’s circumstances
Prioritize the remediations to set the response SLAs. This exercise goes through the following activities:

Identify the engineering teams responsible for working the remediation item
Link the remediation item to the identified root cause that it addresses
Add the remediation item to the work backlog of the responsible engineering team
Add the remediation item to the engineering SLA report to track SLA achievement

Below is a chart that Product Engineering teams use to determine when a remediation is prioritized and planned for their work effort.

Zendesk Remediation Item Priority SLA

The Zendesk Customer Support team attending the retrospective creates the customer-facing descriptions of incident, root causes, and any remediations identified. This is posted to the Help Center article associated with the incident.

Service Availability Incident Example Continued

Here’s how an incident retrospective was conducted for this incident.

4 business days after the incident occurred, the Incident Response team and Engineers gathered to review the incident, collaborate on the root causes, and create or update the remediation items. All remediation items are agreed to by consensus of the meeting attendees.

Each person involved in the incident played a role in the incident retrospective:

The details reviewed and discussed in the meeting included:

Area	Example
Timeline	20:02 UTC - New container versions deployed to host services with updated certificates 20:08 UTC - Container connectivity warnings start to appear 20:37 UTC - First evidence of services not being able to connect to the new containers, thus causing service delay/interruption 20:57 UTC - Zendesk internal service stops processing requests, causing timeout errors in Support, Guide & Talk applications hosted on pod 17 21:02 UTC - Cluster autoscaler starts to create new containers for services that cannot be reached 21:07 UTC - Full provisioning of service containers that will work with existing service configurations complete 21:49 UTC - Cleanup of unreachable containers complete 22:07 UTC - Incident is fully resolved
Root Causes	After security certificate service changed, containers were not all rebuilt to pick up the changes encoded in the script. Containers that were not redeployed did not reference the correct security certificate provider and were not trusted by other Zendesk services and containers
Influencing Factors	We did not update the deployment scripts to properly reference the new security certificate provider when creating new containers Deployed the new containers too quickly and widely to be able to adjust after failures started occurring No automated rollback capability
Remediations	Change how security certificate compliance is evaluated when new containers are built and deployed Add a different, more robust method for verifying certificates before launching new instances Document the deployment strategy for horizontally scaled infrastructure Enable automatic rollback of deployments if any alerts occur Research how platform engineering can rebuild their infrastructure components more frequently Discover how critical infrastructure can be made more distributed and fault tolerant

For there to be a thorough analysis to generate concrete actions for the Engineering team, all team members provided input to recount the incident and develop remediation tasks. Once all questions or issues were addressed by the Incident Response team, the incident retrospective was considered complete.

The Zendesk Customer Support lead responsible for the public facing incident retrospective was asked at the end of the internal retrospective meeting if she had any questions or needed any additional information from the team to create the public documentation. She had no further questions and added the retrospective information below to the public service incident article in the Service Notifications section in our help center.

Public Retrospective Information for the Service Availability VM Incident

Three important outcomes of this incident retrospective that have improved Zendesk products and services were:

The root causes of the incident were identified and will be considered by all Zendesk product teams in future development moving forward

The remediations were identified and assigned to engineering teams with SLAs

The public retrospective was published by Zendesk Customer Support to the Help Center and was sent to affected customers who submitted tickets

Closing out a service incident

As a best practice, Zendesk closes any open tickets with customers to make sure everything is properly documented and communicated for the incident.

All completed service incidents are summarized in a weekly service incident digest report which is shared widely across Zendesk. Incident descriptions, customer impact and important remediations are included in this report and are also in a bi-weekly Operations Review report that is shared with Zendesk’s Executive team.

After retrospective information is published to the Help Center and open tickets are updated with the results from the retrospective, the analysis and reporting phase for the service incident is considered complete. Zendesk Customer Support links those tickets to the service incident and they are marked as closed.

Edited Nov 01, 2024 · Erin McKeown

Follower

Votes

Comments

Erin McKeown created an article, October 16, 2021 03:46

ArticleZendesk policies and agreements

Business Resilience

Zendesk’s Global Business Resilience Program’s mission is to ensure Zendesk has the ability to rapidly adapt and respond to business disruptions, safeguard people and assets, while maintaining continuous business operations.

Zendesk achieves resiliency through four principal areas of focus: Business Continuity, Disaster Recovery, Incident Management and Crisis Management. Zendesk maintains our readiness by proactively assessing operational risks, establishing contingency plans, and administering incident response and crisis management training.

Zendesk assesses and mitigates potential business disruptions through our Business Resilience (BR) Program. Under this program, all critical business functions and locations are required to maintain and exercise alternate operation strategies. The Resilience Program team validates that each business unit’s resiliency strategies are effective and meet the policy established by the program. For critical business operations, we conduct internal/external audits of business continuity plans and moderate annual exercises to ensure their plan efficiently mitigates realistic disruptions and meets compliance certifications and memberships

Zendesk maintains a risk framework that accounts for the evaluation of our facilities, technology, applications, data, processes and overall organization to ensure our risk mitigation strategy operates at multiple levels with broad coverage.

Within the Business Resilience program we maintain governance through our support model, Business Resilience Steering Committee and Incident Management Council.

In the event of a business disruption, we have plans designed to allow us to continue operations of critical functions, we accomplish this in part by:

Using redundant processing capacity at other locations.
Designing our technology and systems to support the recovery processes for critical business functions.
Using business and technology teams that are responsible for activating and managing the recovery process.
Exercising our recovery procedures and testing those procedures on a regular basis.

When it comes to disaster recovery, as part of our strategy, Zendesk leverages rigorous business impact and risk analysis to identify applications/services that are critical to each of our products. Amazon Web Services (AWS) is an Advanced Technology Partner of Zendesk. By building within the AWS environment, we benefit from all the partnership has to offer. Our applications/services are hosted in separate Availability Zones (AZ) using industry-standard practices to copy data across multiple AZs in real time.

In addition to all the business continuity and disaster recovery efforts for all Zendesk customers, some may prefer an additional level of redundancy and recoverability, which can be found through our Enhanced Disaster Recovery.

Edited Oct 28, 2021 · Erin McKeown

Followers

Votes

Comments

Erin McKeown created an article, October 16, 2021 03:32

ArticleZendesk programs and services

Incident Management part 1: How Zendesk service issues become service incidents

This is part 1 of the Overview of incident management at Zendesk. This guide contains the following parts:

Part 1: How Zendesk service issues become service incidents (this article)
Part 2: How Zendesk manages service incidents
Part 3: Monitoring a public Zendesk service incident
Part 4: Post resolution incident analysis and reporting

In this article, part 1, you'll get an understanding of the service incident life cycle at Zendesk, starting from when an incident is detected or reported to the ways Zendesk teams communicate and escalate the incident internally to how incident remediation works.

Before a service incident is created at Zendesk, our Engineering team may receive an alert, or tickets might be raised to Zendesk Customer Support team that indicates something unusual is happening.

Service incident creation workflow

These issues generally come from two sources:

1. Zendesk Network Operating Center (ZNOC) receives an alert, which is then reviewed and validated by the Zendesk Product Engineering team for affected products (e.g., Support, Guide, Chat, Talk)

The ZNOC team has monitoring tools and processes to alert the Product Engineering team when there are issues with Zendesk products such as a product responding more slowly than expected, increased error rates, or if a service’s volume is changing at a greater rate than expected. Usually these alerts are the first indication that something is not working properly. The majority of service incidents are first discovered through the monitoring that Zendesk has built into our systems.

2. Customer Reports, routed through Zendesk Customer Support

Customer reports are another first indication that something isn't working properly. When customers notice Zendesk service issues, they report the issue using Zendesk's own internal support system or chatting with Zendesk Customer Support.

To make sure customer reported issues are handled appropriately, the Zendesk Customer Support team evaluates the reported issue and determines if it is part of an existing service incident, if a new service incident should be created, or if the issue reported should follow our standard troubleshooting path.

If there is an existing incident, the customer ticket will be linked to that incident and the customer will be informed of the current status.
If it's a new report, a new service incident will be created based on the severity scale (more in that in Part 2) and regular status updates sent to all linked tickets.
If the issue is not caused by a potential Zendesk service incident, Zendesk Customer Support will help the customer troubleshoot the issue and resolve or escalate as appropriate.

Learn more

This completes Part 1, How Zendesk service issues become service incidents, of the Overview of incident management at Zendesk.

If you'd like to learn more, you can move on to the next part of this guide: Part 2: How Zendesk manages service incidents.

Edited Nov 01, 2024 · Erin McKeown

Follower

Votes

Comments

Erin McKeown created an article, October 16, 2021 03:32

ArticleZendesk programs and services

Overview of Incident Management at Zendesk

Zendesk provides business critical functions for our customers. When the service of these products is interrupted—causing disruptions known as service incidents—Zendesk takes action and initiates an investigation and remediation process.

This process includes detection, reporting, analysis, and mitigation of incidents, as well as documentation and remediation steps to ensure that we learn from them. Zendesk seeks to restore the full function of services quickly and thoroughly to provide a trusted and reliable experience for customers.

The service incident management process has four main goals:

Restore normal operations of Zendesk services as quickly as possible
Provide meaningful information to customers during an incident to mitigate impact where possible and provide updates on remediation status
Perform detailed root cause analysis and identify permanent fixes once service is restored and share this analysis with customers to maintain trust in Zendesk services
Share lessons learned across engineering teams and track incident causes and remediations

This guide describes the Zendesk incident management process in the following parts:

Edited Nov 01, 2024 · Erin McKeown

Followers

Votes

Comments

Erin McKeown

ACTIVITY OVERVIEW

BADGES

ARTICLES

POSTS

COMMUNITY COMMENTS

ARTICLE COMMENTS

ACTIVITY OVERVIEW

Learn more

Incident severity

Zendesk service availability incident example

Incident response team structure

Communication timelines for incidents

Zendesk service availability incident example continued

Low severity incident process

Learn more

Conducting a service incident retrospective

High severity incident analysis

Low severity incident analysis

Assigning remediation Items

Service Availability Incident Example Continued

Closing out a service incident

Learn more

Common topics

Role-based guides

Additional resources

Essential Cookies

Functional Cookies

Analytics Cookies

Targeting Cookies