StarPU Handbook
Loading...
Searching...
No Matches
39. Fault Tolerance

39.1 Introduction

Due to e.g. hardware error, some tasks may fail, or even complete nodes may fail. For now, StarPU provides some support for failure of tasks.

39.2 Retrying tasks

In case a task implementation notices that it fail to compute properly, it can call starpu_task_failed() to notify StarPU of the failure.

tests/fault-tolerance/retry.c is an example of coping with such failure: the principle is that when submitting the task, one sets its prologue callback to starpu_task_ft_prologue(). That prologue will turn the task into a meta task, which will manage the repeated submission of try-tasks to perform the computation until one of the computations succeeds. One can create a try-task for the meta task by using starpu_task_ft_create_retry().

By default, try-tasks will be just retried until one of them succeeds (i.e. the task implementation does not call starpu_task_failed()). One can change the behavior by passing a check_failsafe function as prologue parameter, which will be called at the end of the try-task attempt. It can look at starpu_task_get_current()->failed to determine whether the try-task succeeded, in which case it can call starpu_task_ft_success() on the meta-task to notify success, or if it failed, in which case it can call starpu_task_failsafe_create_retry() to create another try-task, and submit it with starpu_task_submit_nodeps().

This can however only work if the task input is not modified, and is thus not supported for tasks with data access mode STARPU_RW.