15

Zendesk on Zendesk: Lifecycle of a Problem (and Incident) ticket

Zendesk on Zendesk is a day-long discussion about a specific topic and how Zendesk Support uses Zendesk. Each session is hosted by a member of our Support team.

This session is about how Zendesk Support tracks and resolves Problem and Incident tickets. It covers:

  • Processes for verifying, testing, and triaging Problem tickets
  • Using Problem and Incident ticket types and key ticket fields to track issues
  • Communicating details and updates to customers affected by the problem

This session is hosted by Joseph May, a Tier 3 Support Engineer in our San Francisco office.

Introduction

Last month my colleague Anna Rozentale discussed how we rolled out Zendesk Voice. This month I’ll be covering a broader topic, an integral component every support team faces: how can I take a number of inbound reports stemming from the same root cause and collect them so we can attack the root cause of the issue, then solve them all simultaneously?

Since every organization has limited resources, the ability to collate a number of reports into a unified issue allows us to marshall the talents and time of our Engineers and Operations team so that they are fighting one fire, rather than a hundred.

Zendesk gives this capability by using the Problem and Incident ticket types. Here at Zendesk, we ourselves use this workflow to manage concerns varying from outages, security incidents, bugs, to explaining new features, and collecting tickets about a upcoming public event or seminar.

In an effort to use our own application to effectively manage our own troubleshooting process, we use Problem and Incident tickets whenever we have multiple tickets on a singular subject that have either a time- or project-based blocker. This gives us an easy and elegant way to view the issues and respond to our customers.

Today, we'd like to share with you a story of a real world challenge that we faced one day.


Triage

At Zendesk, we use a dedicated triage agent during normal business hours to monitor, evaluate, and route inbound written communication. The triage role is essentially the first responder— and the inbound request pool is like the waiting room at the Emergency Room (but hopefully with a better experience).

During global business hours, a new request will typically be viewed, assessed for severity, and routed to the appropriate team within minutes.

On April 10th, we received an inbound request for help:

Within a number of minutes, the team began investigating. After internal discussion, the issue was escalated to our second tier of support within thirty minutes.


Testing and replication

One thing that customer advocates often struggle with is the fine balancing act between discovering an issue with a product and being able to replicate it in a sanitized environment or in a test account. More simply—how do we determine that It's not you, it's me ?

While we actively monitor and have alarm mechanisms in place, we rely heavily on customer reports in addition to our automated systems due to the sheer scope of operating a cloud application.

Since there are limitless hardware, software, and network environments, customer reports are critical in the early diagnosis of an issue and in separating a local issue from a larger Zendesk issue.

Once a ticket is received from the triage agent, our team actively begins to troubleshoot the issue and, whenever possible, attempts to replicate it in an agent environment to rule out the possibility of misconfiguration.

In our report on April 10th, our agents moved steadily to replicate the issue. After discussion and quick collaboration, they were able to quickly discern that the issue affected a number of instances and saw similar results:

It was quickly apparent that our we needed to get our Operations team involved. An internal Red Alert fired and created a Problem ticket for our Operations team.

Once an issue is confirmed here at Zendesk, an agent will create a formal Problem ticket and escalate to Engineering if it is a software issue or to Operations for an infrastructure issue.

Since Zendesk is a global team with a number of agents, we use a macro to gather information in a standardized format and ensure it's escalated in a standardized way.

We do this for two primary reasons:

  • To ensure that key information is not accidentally omitted.
  • To carefully translate the issue from Human to Engineer by providing clear and concise troubleshooting and replication steps for the team.

Here's the escalation to our developers:


Incident linkage and mitigation

While our Support Team was hard at work diagnosing, testing, and ultimately creating a Problem Ticket, two key parties are working diligently as liaisons between our Operations, Developers, and Support team: an incident lead and a communications lead.

  • The incident lead assists with troubleshooting, gathering additional technical data, handling direct testing requests from the Operations and Development team, monitoring open incidents
  • The communications lead handles Twitter channels for @Zendesk and @ZendeskOPS.

While the team was diagnosing the incoming tickets, our Communications Lead put up a public forum post  From here, we were able to direct incoming traffic, add relevant updates, and afterward, post a full post-mortem report on the root issues of the problem, with specific steps that would be taken in the future to prevent against similar incidents.

With a problem ticket laid in place, our support team was then able to link inbound tickets to the Problem as Incidents. Collecting all incoming reports of the issue is not only valuable for communication in the time of crisis, but also provides for excellent reporting metrics and analysis afterwards.

Should an incoming report appear to be related to an existing problem, but needed further investigation, we take advantage of two key ticket statuses: Pending and On-hold status.

Pending status is typically used when you're waiting for a customer reply, and that's exactly what we do with linked Incidents to a Problem. If we have reached out to the customer for additional clarification, by setting it to Pending, we have a visual cue marking it's distinction from an inbound report that is not actionable.

We use the On-hold status when the customer has been updated, but we're waiting for a fix from our Development or Operations team.

The beauty of these two statuses is that if the customer replies, the ticket will automatically reopen so the assigned agent (and incident lead ) to review it and take care of the customer.


Operations and Engineering escalation

After an issue has been diagnosed, linked up with other Incidents, and escalated, the support role shifts to managing customer reports and standing ready to assist the engineering team.

Within ten minutes of escalating the issue to our Operations team, the root cause of the issue was discovered to be a faulty circuit that had recently been installed in our datacenter. After installation, it failed, preventing traffic from flowing to the desired destinations.

Our team quickly acted to eliminate that circuit as a valid pathway, which quickly resulted in a resumption of normal traffic patterns.


Messaging and communication

Once our Operations team restored traffic to normal behavior, an All Clear was sounded via social media channels. Our support team returned to the Problem ticket and began to craft an outgoing message for all linked Incidents.

At Zendesk, we try to ensure a few key areas are addressed during the closure of a Problem ticket:

  • Relay to the customer that the problem has been resolved
  • Share as much information as possible about the root causes of the issue
  • Provide a link to the forum post and encourage them to Subscribe to it, so they will be notified via email of any updates
  • Encourage them to reach out if they experience any issues or have further concerns

Once our Operations team has done a full post-mortem, we update our forum post describing the root causes with as much public information as we can share:


Final thoughts

Thank you for joining us on this journey through the lifecycle of a Problem Ticket! We hope that some of this information might assist you with your own workflows.

We'd love to hear your thoughts, feedback, and more importantly, how you handle problems and Problem Tickets in your own Zendesk. 

18 comments

  • -1
    Avatar
    Ryan Hester

    If we can't make this event, will it be recorded for later viewing? 

  • 2

    Hi Ryan,

    This is actually an article and online discussion, so there won't be anything to record. :) You can check back on this page for the full article and comments anytime.

    Nora 

  • 0

    We don't currently use this feature so it would be great to see how y'all use it! We do link tickets to JIRA though.

  • 0

    Looking forward to hearing your ideas

  • 1

    Hey folks, the discussion is now live. I'd love to hear how your organization utilizes problem tickets.

  • 2

    Joseph, thanks for taking the time to post. It all seems like a robust process (as one would expect), but do you find yourself in unlinking tickets on a regular basis? 

    Last time I tried problem/iincident my agents could not reliably ascertain connections.  Does you engineering team provide internal updates that help confirm the connection? 

  • 3

    We use Problem and Incident ticket types and appreciate the options.  We don't have On-Hold activated but will consider it.  Wish the On-Hold feature would stop the clock so the open ticket wouldn't indicate such a long period of resolution and throw off stats.

  • 1

    Hi Colin-

    That's a great question. Overall I would say that reliability is high, as every linked incident has an owner who linked it in the first place. When I was very new, I attached an Incident ticket to a Problem erroneously, and it led to an extended support engagement. I believe this is a great opportunity to address workflow, should this become the case too often. I rarely see this happen these days.

    As far as internal updates, often our engineers will look at incidents as well (especially if replication isn't feasible). If it isn't the same/attached erroneously they reach out to the agent for further review.

  • 1

    Hi Janelle- you can use the ([Text Field] Duration in minutes) fact for creating custom resolution metrics. I would recommend looking at this article about  Building Custom Metrics for the Events Model in Insights.

    Another option would be to create a custom metric that subtracts On Hold Time from Full Resolution Time.

  • 1

    @Joseph, thank you!

  • 1

    Thanks! @Joseph great article.

  • 1

    @Joseph, at what point in team size did this process make sense to implement?  Or, has it been there since the beginning of Zendesk on Zendesk? 

  • 1

    Hi Makenzie-

    Great question, and one I don't know the exact details on - I would say day 1, if I had to venture a guess.

    What I did do was go back to look at the oldest recorded problem ticket we have - the team was a lot smaller back then, but the Problem/Incident workflow was around very early on.

    Was there anything specific in regards to your own Zendesk and how you use it that you would care to share, or have a question about?

  • 2

    Hi Joseph!

    I recently experienced something that would have been very well handled by the Problem/Incident workflow.  I spent a fair amount of time manually posting replies across multiple incidents regarding the same problem.  I can certainly see the benefit of tracking Problems in this way!

    Thanks for the great ideas/info!

    Makenzie

  • 0

    @Makenzie - in my own Zendesk implementation, I'll use Problem/Incident workflows for something like a cancellation of a number of clients.

    I see a number of students weekly - if I have a gig, and need to reschedule, I make a Problem ticket out of it, and Incident each Student, so I can be sure that none fall through the cracks.  Since adopting this workflow, it's saved me a lot of headaches, and allows me to better take care of my people :)

  • 1

    Thanks @dr. j!  Great application of the Problem/Incident workflow!

     

  • 1

    This is cool! 

    We're a bit new to ZD, so I'm trying to get my use-cases correct. 

    Am I right in thinking that Problem/Incident are best used for emergencies like outages, or situations where the message you use to solve the Problem will also apply to the Incidents? 

    One of our products doesn't have a lot of outages, but it does have consistent pain-point bugs. My initial idea with Problem/Incident was to link every report of a bug or improvement request to a Problem, link that problem with the associated JIRA, and we're good to go. When the JIRA's solved, it will update the Problem, and we can communicate with the subset of customers who were most interested in that issue's resolution. 

    Do any of you use Problem/Incident that way? Can you personalize the solve message enough with Placeholders? (https://support.zendesk.com/hc/en-us/articles/203662116-Using-placeholders)

    We're really low-volume, high-touch support--bunch of universities--so Problem/Incident would mostly help me with reporting. 

  • 0

    Hi there Erik! - thanks for reaching out, and welcome to the family!

    Since many of our beloved customers are in the software incident, the natural design, and implication is that it's a "Problem" like an outage, but for me, I like to also use Problem tickets as a way to aggregate similar issues.

    To address your JIRA question, yes, we do almost the very same thing — personally, I feel that a JIRA question should only be linked to a Problem ticket (one which our agents are reviewing), not a customer ticket (incident), so the communication can be streamlined and consistent.

    I use Zendesk in my own small business, and like you, I don't have "infrastructure outages", but I do have occasions when I have to communicate with several customers (individually), and later as a whole, on a related subject.

    One example, is if I'm headed out of town, and need to reschedule a bunch of students - I create a Problem ticket for myself, then incidents for each student, until they're taken care of, then I solve out the problem ticket, wither with:

    • a nice public note - See you soon!
    • no public comment (since they've been addressed individually)

    Pro-Tip, is to also use placeholders — particulary the ticket.requester.first_name is one that I use a lot.

    If you really want to get sneaky - also take a look at using user, or org fields - this might help get you started.

Please sign in to leave a comment.