Sometimes we have processes that stop executing because of database connectivity

Sometimes we have processes that stop executing because of database connectivity issues, engine issues, etc. The alerts show:

Problem: An error occurred in executing an Activity Class.
Recommended Action: Examine the activity class to correct the error and then resume.

The process dashboard shows the process as active with 'There is a problem with this process.' and the orange triangle with the ! mark. I can monitor the process and most of the time there is a red box and slash around the node in question. Sometimes because the node has been processed more than once there is only a normal blue box. I can cancel the spawned node (the one with 'Problem with task' and restart it just fine (after we bounced the server, etc.).

1) Is there a way to see all processes that are hung, the report 'Processes with Problems' doesn't show any.
2) Is there a way to mass restart all processes that are hung? Currently, opening hundreds of alerts, moni...

OriginalPostID-104760

OriginalPostID-104760

  Discussion posts and replies are publicly visible

Parents
  • To find broken processes, we use a report for all models and filter on 0 active tasks (num_active_tasks()=0). This does a pretty good job, but doesn't catch the scenerios where you have 2 tasks active at once (and one could be broken) or processes holding intentionally, etc.

    For mass re-starts, I began setting all DB nodes to false() for 'Pause on Error'. The error flag is recorded, and if the connection was not successful the process sends an alert to the developers group (with the error message) and waits at a Recieve Message node. Once the DB is back online, I run a process model (DB Restarts) that sends a message to all processes to release the hold. This works fantastic - this week we had 100 broken processes due to a connection failure, and it took me less than 1 minute to browse to the restart model, fire it off, and all processes resumed.
Reply
  • To find broken processes, we use a report for all models and filter on 0 active tasks (num_active_tasks()=0). This does a pretty good job, but doesn't catch the scenerios where you have 2 tasks active at once (and one could be broken) or processes holding intentionally, etc.

    For mass re-starts, I began setting all DB nodes to false() for 'Pause on Error'. The error flag is recorded, and if the connection was not successful the process sends an alert to the developers group (with the error message) and waits at a Recieve Message node. Once the DB is back online, I run a process model (DB Restarts) that sends a message to all processes to release the hold. This works fantastic - this week we had 100 broken processes due to a connection failure, and it took me less than 1 minute to browse to the restart model, fire it off, and all processes resumed.
Children
No Data