Improved notifications and recovery for target failures
According to "Notifying external targets," an external target will be automatically deactivated after 21 consecutive attempts fail. For high-volume targets, this threshold can be reached very quickly during a very short disruption. (Imagine a target that is notified whenever a ticket is solved, for instance.) If something's down for even a minute, targets can wind up disabled.
When a target is automatically disabled, an email is sent to all admins, and that's that. (This is an improvement over only notifying the account owner -- thanks.) For the high-volume targets most likely to be disabled by this process, the time it takes to receive this email and act on it can result in a lot of lost data.
Here's the message:
>The target '[some target]' has been temporarily disabled due to too many failures. You can re-enable the target from Settings > Extensions > Targets to continue sending messages to the target, but please check the possible reason of these failures by testing the target first.
Please, consider these changes to make this situation easier to deal with. I've loosely ordered them by estimated complexity.
- Include a link directly to the Settings > Extensions > Targets page, instead of making me navigate there myself.
- Make it easier to integrate with incident response platforms by allowing me to specify an email address for these "target disabled" notifications.
- Include a log of the 21 failed attempts that triggered the deactivation.
- Keep a log of all attempts to notify the target, even beyond the initial 21, so they can be backfilled later.
- Attempt reactivating the target at increasing intervals. For long-running targets, the problem is usually temporary downtime, not any issue with the target's configuration.
- Allow me to increase the threshold from 21 attempts for particularly high-volume targets, or automatically scale this number based on how frequently the target is used.
- Retry failed attempts after the target is activated.
These are all important improvements to me, but if I could have only one -- let me export failed calls, including ones that are attempted when the target has been disabled, so I can backfill them later. Otherwise, give me greater control over the notifications, so I can fight the diffusion of responsibility that sets in with "all admins" notifications.
YES!!! How are more people not blowing up this thread??!?!
We have records of failed attempts - where the attempt was sent to a Redis based queue - pretty much the lightest thing our server to do.
At the time stamp of the "failed attempt" - we have NO record of any downtime, errors or anything close to a maxed out CPU/Memory etc.
WE NEED LOGS!!!
I would love all the same things Collin has requested, but I would start the list at #3, and put the link and the response platforms at the bottom of the list.
We need, need, need logs so we know why something failed, and logs while it is disabled so we can back-fill manually.
Just got this error but there are no failures saved under my Channels > API > Target Failures section. :(
Completely agree with everything Colin said above! This would be a vital improvement to the current setup. We are currently trying to figure out how to integrate alerts based monitoring for this, but we are having trouble figuring out the best approach, because emails are only being sent to the agents. which can often get missed, meaning it can be hours before the issue is resolved and there is no option to retry any previously failed attempts, short of manually going through all tickets since the outage!
i agree we should be able to turn off the automatic deactivation of extensions
A way for our systems to know the trigger disabled so we can react accordingly. Much needed! We are having to work around this, just so we can improve availability of our system.
This is beyond frustrating. Targets are disabled and we can't even figure out why. And no response is given to remedy the situation.
+1, me from 3 years ago. Again had a big issue caused by how Zendesk handles target failures.
+1 also experiencing disabled targets, but no evidence of our system receiving any requests.
Iniciar sesión para dejar un comentario.