Uploaded image for project: 'OpenIDM'
  1. OpenIDM
  2. OPENIDM-8381

Recovery of scheduled jobs following cluster node failure does not work

    Details

      Description

      When a cluster node fails to update its heartbeat with the configured interval (30 seconds default), the ClusterManagementService will call registered ClusterEventListener instances to respond to this event. 

      The RepoJobStore is-a ClusterEventListener. Its handleEvent method will clear all triggers acquired by the failed node and make these triggers available again. This should, theoretically, make the triggers available for execution.

      In practice, this does not seem to work. It is true that removeAcquiredTrigger and addWaitingTrigger are successfully invoked, but the 'recovered' trigger is never returned from acquireNextTrigger, and thus never invoked by Quartz.

      The fact that RepoJobStore#handleEvent does not modify the TriggerWrapper associated with the trigger to update the acquired flag to false, and the nodeId to null, does seem to be relevant. Making this change did not allow the recovered trigger to be subsequently executed. In fact, it seems never to be returned from getWaitingTriggers, which is the list of Trigger instances processed by acquireNextTrigger. 

      Note, however, that the recovered trigger will still be returned by 

      http://localhost:8080/openidm/scheduler/trigger/?_queryFilter=true

      However, this trigger will never be returned by

      http://localhost:8080/openidm/repo/scheduler/waitingTriggers?_queryFilter=true

      which is ultimately the list processed by acquireNextTrigger. 

      A few hours of debugging and examining debug logs left me none-the-wiser. The recovered trigger never appears in the debug log at the end of acquireNextTrigger, which would seem to exclude the possibility that the recovered trigger was executed subsequent to its recovery. 

      I believe that the trigger recovery never worked. This failure was probably masked by the fact that triggers were all of the type cron, and thus would continue to execute on another cluster node when its schedule indicated, thus masking a failed recovery. 

      Simple, or fire-once, triggers were implemented for clustered-recon, and these triggers don't fire again following recovery. 

       

       

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                cgdrake Chris Drake
                Reporter:
                dhogan Dirk Hogan
                QA Assignee:
                Vojtěch Oczka
              • Votes:
                0 Vote for this issue
                Watchers:
                3 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: