What is the problem?
We use PagerDuty and trigger words (and a few other things) to catch issues we know are going to present major workflow issues for customers - so far, this has worked decently well. However, some issues aren't major so we don't have trigger words (everything would set off PagerDuty otherwise!) and that's ok, unless we get a whole bunch of them at once, which would indicate a larger platform-level problem that might not be caught by our error logging tools. Our only indicator is a sudden increase in volume of tickets (for us it's something like 10+ within 5 minutes, but other higher- or lower- volume shops will be different).
Why is it a problem?
We have a contractual obligation to respond to system-wide defects, and we recently discovered this is a way things can manifest so we want to track it, and can't.
How do you solve the problem today?
On a hope and a prayer that our PagerDuty triggers and engineering logging alerts are enough.
How would you ideally solve the problem?
My first thought was to use automations that allowed tracking ticket volume (i.e. "x number of newly created tickets within y period of time") that we could set to notify PagerDuty that we need to look at the queue asap.
How big is the problem (business impact, frequency of impact, who is impacted)
So far it's only happened a handful of times, but I absolutely see it happening with much more frequency as our business grows. I could see it potentially happening a few times per quarter (pretty bad in terms of customer experience if we don't catch it). Our agents and leads are impacted since they don't necessarily see the whole picture in terms of volume, and our customers can't work effectively (we make public safety software so this is quite a large issue).