Zendesk on Zendesk is a discussion about a specific topic and how Zendesk Support uses Zendesk. Each session is shared by a member of our Support team.
This session is about how Zendesk Support tracks and resolves Problem and Incident tickets. It covers:
- Processes for verifying, testing, and triaging Problem tickets
- Using Problem and Incident ticket types and key ticket fields to track issues
- Communicating details and updates to customers affected by the problem
This month I’ll be covering a broader topic, an integral component every support team faces: how can I take a number of inbound reports stemming from the same root cause and collect them so we can attack the root cause of the issue, then solve them all simultaneously?
Since every organization has limited resources, the ability to collate a number of reports into a unified issue allows us to marshall the talents and time of our Engineers and Operations team so that they are fighting one fire, rather than a hundred.
Zendesk gives this capability by using the Problem and Incident ticket types. Here at Zendesk, we ourselves use this workflow to manage concerns varying from outages, security incidents, bugs, to explaining new features, and collecting tickets about a upcoming public event or seminar.
In an effort to use our own application to effectively manage our own troubleshooting process, we use Problem and Incident tickets whenever we have multiple tickets on a singular subject that have either a time- or project-based blocker. This gives us an easy and elegant way to view the issues and respond to our customers.
Today, we'd like to share with you a story of a real world challenge that we faced one day.
At Zendesk, we use a dedicated triage agent during normal business hours to monitor, evaluate, and route inbound written communication. The triage role is essentially the first responder— and the inbound request pool is like the waiting room at the Emergency Room (but hopefully with a better experience).
During global business hours, a new request will typically be viewed, assessed for severity, and routed to the appropriate team within minutes.
On April 10th, we received an inbound request for help:
Within a number of minutes, the team began investigating. After internal discussion, the issue was escalated to our second tier of support within thirty minutes.
Testing and replication
One thing that customer advocates often struggle with is the fine balancing act between discovering an issue with a product and being able to replicate it in a sanitized environment or in a test account. More simply—how do we determine that It's not you, it's me ?
While we actively monitor and have alarm mechanisms in place, we rely heavily on customer reports in addition to our automated systems due to the sheer scope of operating a cloud application.
Since there are limitless hardware, software, and network environments, customer reports are critical in the early diagnosis of an issue and in separating a local issue from a larger Zendesk issue.
Once a ticket is received from the triage agent, our team actively begins to troubleshoot the issue and, whenever possible, attempts to replicate it in an agent environment to rule out the possibility of misconfiguration.
In our report on April 10th, our agents moved steadily to replicate the issue. After discussion and quick collaboration, they were able to quickly discern that the issue affected a number of instances and saw similar results:
It was quickly apparent that our we needed to get our Operations team involved. An internal Red Alert fired and created a Problem ticket for our Operations team.
Once an issue is confirmed here at Zendesk, an agent will create a formal Problem ticket and escalate to Engineering if it is a software issue or to Operations for an infrastructure issue.
Since Zendesk is a global team with a number of agents, we use a macro to gather information in a standardized format and ensure it's escalated in a standardized way.
We do this for two primary reasons:
- To ensure that key information is not accidentally omitted.
- To carefully translate the issue from Human to Engineer by providing clear and concise troubleshooting and replication steps for the team.
Here's the escalation to our developers:
Incident linkage and mitigation
While our Support Team was hard at work diagnosing, testing, and ultimately creating a Problem Ticket, two key parties are working diligently as liaisons between our Operations, Developers, and Support team: an incident lead and a communications lead.
- The incident lead assists with troubleshooting, gathering additional technical data, handling direct testing requests from the Operations and Development team, monitoring open incidents
- The communications lead handles Twitter channels for @Zendesk and @ZendeskOPS.
While the team was diagnosing the incoming tickets, our Communications Lead put up a public forum post From here, we were able to direct incoming traffic, add relevant updates, and afterward, post a full post-mortem report on the root issues of the problem, with specific steps that would be taken in the future to prevent against similar incidents.
With a problem ticket laid in place, our support team was then able to link inbound tickets to the Problem as Incidents. Collecting all incoming reports of the issue is not only valuable for communication in the time of crisis, but also provides for excellent reporting metrics and analysis afterwards.
Should an incoming report appear to be related to an existing problem, but needed further investigation, we take advantage of two key ticket statuses: Pending and On-hold status.
Pending status is typically used when you're waiting for a customer reply, and that's exactly what we do with linked Incidents to a Problem. If we have reached out to the customer for additional clarification, by setting it to Pending, we have a visual cue marking it's distinction from an inbound report that is not actionable.
We use the On-hold status when the customer has been updated, but we're waiting for a fix from our Development or Operations team.
The beauty of these two statuses is that if the customer replies, the ticket will automatically reopen so the assigned agent (and incident lead ) to review it and take care of the customer.
Operations and Engineering escalation
After an issue has been diagnosed, linked up with other Incidents, and escalated, the support role shifts to managing customer reports and standing ready to assist the engineering team.
Within ten minutes of escalating the issue to our Operations team, the root cause of the issue was discovered to be a faulty circuit that had recently been installed in our datacenter. After installation, it failed, preventing traffic from flowing to the desired destinations.
Our team quickly acted to eliminate that circuit as a valid pathway, which quickly resulted in a resumption of normal traffic patterns.
Messaging and communication
Once our Operations team restored traffic to normal behavior, an All Clear was sounded via social media channels. Our support team returned to the Problem ticket and began to craft an outgoing message for all linked Incidents.
At Zendesk, we try to ensure a few key areas are addressed during the closure of a Problem ticket:
- Relay to the customer that the problem has been resolved
- Share as much information as possible about the root causes of the issue
- Provide a link to the forum post and encourage them to Subscribe to it, so they will be notified via email of any updates
- Encourage them to reach out if they experience any issues or have further concerns
Once our Operations team has done a full post-mortem, we update our forum post describing the root causes with as much public information as we can share:
Thank you for joining us on this journey through the lifecycle of a Problem Ticket! We hope that some of this information might assist you with your own workflows.
We'd love to hear your thoughts, feedback, and more importantly, how you handle problems and Problem Tickets in your own Zendesk.
Post is closed for comments.