Problem statement: I see a lot of unknown tasks since we moved to High Availability (master-worker)setup
Why this happens:
Whenever a task is executed on a worker, a task recovery file will periodically be written to the work directory, to keep track of what the task looks like and which parts of the task have already been executed. In case of a worker crash, or when the worker is terminated, the worker can be restarted; it will find these
.task files and recover the deployment tasks as usual.
Note: For a successful recovery to happen, the work directory and the location where the
.task files are kept must be persistent across worker restarts.
When a task is not properly archived, i.e. it has (partially) executed but is not yet finished or canceled; and the worker that it was running on is unavailable (e.g. due to a crash, termination, or network issues), the task monitor will show the task as being in an UNKNOWN state. Once the worker becomes available again (network issues resolved, or worker restarted and task recovered), the task will again be shown in its proper state. If a task is not recoverable, it can be removed from the Task Monitor screen. If a task is removed in this way, and the worker becomes available anyway, the task will be a ghost task on the worker since Deploy no longer knows about it. You can tell all workers to re-register any ghost tasks at once by using the CLI. Point the CLI to one of the masters, and issue the
Here are a few options to deal with this situation:
Restore the task
To restore the unknown tasks and return a list of Task IDs to the Deploy CLI, execute this method from the Deploy CLI:
Deploy fetches the tasks from all the workers and restores the information for the tasks back to the active repository (database). Resolving the unknown tasks on workers is done based on the missing information in the database for such tasks that exist in the local task repository.
Cleaning via Force Cancel
When using the force cancel option to cancel a task, the task data is removed from the database.
workdir on one of the workers still contains the task, Deploy displays the task as unknown. So you need to manually clean the work directory as well
Cleaning via DB
if you want to wipe out those unknown tasks: Use it as a last resort
- If the above doesn’t work then check the DB table for XLD_PENDING_TASKS and see if you can find unknown tasks there. Delete them.
Note Please take a back of DB just to be safe.
My personal recommendation here is to have an archive policy in place so that complete tasks/deployments are archived and don’t stay in a dangling state. The Engineering Team is already aware of all the conditions leading to the Unkown task state and working on improvements in the upcoming version to avoid this situation.