StarPU Internal Handbook
|
TODO
Execution entities:
Work entities:
Data entities:
A worker is a CPU thread created by StarPU. Its role is to manage one computing unit. This computing unit can be a local CPU core, in which case, the worker thread manages the actual CPU core to which it is assigned; or it can be a computing device such as a GPU or an accelerator (or even a remote computing node when StarPU is running in distributed master-slave mode.) When a worker manages a computing device, the CPU core to which the worker's thread is by default exclusively assigned to the device management work and does not participate to computation.
Scheduling operations related state
While a worker is conducting a scheduling operations, e.g. the worker is in the process of selecting a new task to execute, flag state_sched_op_pending is set to !0
, otherwise it is set to 0
.
While state_sched_op_pending is !0, the following exhaustive list of operations on that workers are restricted in the stated way:
state_relax_refcnt > 0
;state_relax_refcnt > 0
.Entering and leaving the state_sched_op_pending state is done through calls to _starpu_worker_enter_sched_op() and _starpu_worker_leave_sched_op() respectively (see these functions in use in functions _starpu_get_worker_task() and _starpu_get_multi_worker_task()). These calls ensure that any pending conflicting operation deferred while the worker was in the state_sched_op_pending state is performed in an orderly manner.
Scheduling contexts related states
Flag state_changing_ctx_notice
is set to !0
when a thread is about to add the worker to a scheduling context or remove it from a scheduling context, and is currently waiting for a safe window to do so, until the targeted worker is not in a scheduling operation or parallel task operation anymore. This flag set to !0
will also prevent the targeted worker to attempt a fresh scheduling operation or parallel task operation to avoid starving conditions. However, a scheduling operation that was already in progress before the notice is allowed to complete.
Flag state_changing_ctx_waiting
is set to !0
when a scheduling context worker addition or removal involving the targeted worker is about to occur and the worker is currently performing a scheduling operation to tell the targeted worker that the initiator thread is waiting for the scheduling operation to complete and should be woken up upon completion.
Relaxed synchronization related states
Any StarPU worker may participate to scheduling operations, and in this process, may be forced to observe state information from other workers. A StarPU worker thread may therefore be observed by any thread, even other StarPU workers. Since workers may observe each other in any order, it is not possible to rely exclusively on the sched_mutex
of each worker to protect the observation of worker state flags by other workers, because worker A observing worker B would involve locking workers in (A B) sequence, while worker B observing worker A would involve locking workers in (B A) sequence, leading to lock inversion deadlocks.
In consequence, no thread must hold more than one worker's sched_mutex at any time. Instead, workers implement a relaxed locking scheme based on the state_relax_refcnt
counter, itself protected by the worker's sched_mutex. When state_relax_refcnt
0, the targeted worker state flags may be observed, otherwise the thread attempting
the observation must repeatedly wait on the targeted worker's sched_cond
condition until state_relax_refcnt > 0
.
The relaxed mode, while on, can actually be seen as a transactional consistency model, where concurrent accesses are authorized and potential conflicts are resolved after the fact. When the relaxed mode is off, the consistency model becomes a mutual exclusion model, where the sched_mutex of the worker must be held in order to access or change the worker state.
Parallel tasks related states
When a worker is scheduled to participate to the execution of a parallel task, it must wait for the whole team of workers participating to the execution of this task to be ready. While the worker waits for its teammates, it is not available to run other tasks or perform other operations. Such a waiting operation can therefore not start while conflicting operations such as scheduling operations and scheduling context resizing involving the worker are on-going. Conversely these operations and other may query weather the worker is blocked on a parallel task entry with starpu_worker_is_blocked_in_parallel().
The starpu_worker_is_blocked_in_parallel() function is allowed to proceed while and only while
state_relax_refcnt > 0
. Due to the relaxed worker locking scheme, the state_blocked_in_parallel
flag of the targeted worker may change after it has been observed by an observer thread. In consequence, flag state_blocked_in_parallel_observed
of the targeted worker is set to 1
by the observer immediately after the observation to "taint" the targeted worker. The targeted worker will clear the state_blocked_in_parallel_observed
flag tainting and defer the processing of parallel task related requests until a full scheduling operation shot completes without the state_blocked_in_parallel_observed
flag being tainted again. The purpose of this tainting flag is to prevent parallel task operations to be started immediately after the observation of a transient scheduling state.
Worker's management of parallel tasks is governed by the following set of state flags and counters:
state_blocked_in_parallel:
set to !0
while the worker is currently blocked on a parallel task;state_blocked_in_parallel_observed:
set to !0
to taint the worker when a thread has observed the state_blocked_in_parallel flag of this worker while its state_relax_refcnt
state counter was >0
. Any pending request to add or remove the worker from a parallel task team will be deferred until a whole scheduling operation shot completes without being tainted again.state_block_in_parallel_req:
set to !0
when a thread is waiting on a request for the worker to be added to a parallel task team. Must be protected by the worker's sched_mutex
.state_block_in_parallel_ack:
set to !0
by the worker when acknowledging a request for being added to a parallel task team. Must be protected by the worker's sched_mutex
.state_unblock_in_parallel_req:
set to !0
when a thread is waiting on a request for the worker to be removed from a parallel task team. Must be protected by the worker's sched_mutex
.state_unblock_in_parallel_ack:
set to !0
by the worker when acknowledging a request for being removed from a parallel task team. Must be protected by the worker's sched_mutex
.block_in_parallel_ref_count:
counts the number of consecutive pending requests to enter parallel task teams. Only the first of a train of requests for entering parallel task teams triggers the transition of the state_block_in_parallel_req
flag from 0
to 1
. Only the last of a train of requests to leave a parallel task team triggers the transition of flag state_unblock_in_parallel_req
from 0
to 1
. Must be protected by the worker's sched_mutex
.Entry point
All the operations of a worker are handled in an iterative fashion, either by the application code on a thread launched by the application, or automatically by StarPU on a device-dependent CPU thread launched by StarPU. Whether a worker's operation cycle is managed automatically or not is controlled per session by the field
not_launched_drivers
of the starpu_conf
struct, and is decided in _starpu_launch_drivers
function.
When managed automatically, cycles of operations for a worker are handled by the corresponding driver specific
starpu<DRV>_worker()
function, where DRV
is a driver name such as cpu (_starpu_cpu_worker
) or cuda (_starpu_cuda_worker
), for instance. Otherwise, the application must supply a thread which will repeatedly call starpu_driver_run_once() for the corresponding worker.
In both cases, control is then transferred to
_starpu_cpu_driver_run_once
(or the corresponding driver specific func). The cycle of operations typically includes, at least, the following operations:
When the worker cycles are handled by StarPU automatically, the iterative operation processing ends when the
running
field of _starpu_config
becomes false. This field should not be read directly, instead it should be read through the _starpu_machine_is_running() function.
Task scheduling
If the worker does not yet have a queued task, it calls _starpu_get_worker_task() to try and obtain a task. This may involve scheduling operations such as stealing a queued but not yet executed task from another worker. The operation may not necessarily succeed if no tasks are ready and/or suitable to run on the worker's computing unit.
Parallel task team build-up
If the worker has a task ready to run and the corresponding job has a size
>1
, then the task is a parallel job and the worker must synchronize with the other workers participating to the parallel execution of the job to assign a unique rank for each worker. The synchronization is done through the job's sync_mutex
mutex.
Task input processing
Before the task can be executed, its input data must be made available on a memory node reachable by the worker's computing unit. To do so, the worker calls _starpu_fetch_task_input()
Data transfer processing
The worker makes pending data transfers (involving memory node(s) that it is driving) progress, with a call to __starpu_datawizard_progress(),
Task execution
Once the worker has a pending task assigned and the input data for that task are available in the memory node reachable by the worker's computing unit, the worker calls
_starpu_cpu_driver_execute_task
(or the corresponding driver specific function) to proceed to the execution of the task.
A scheduling context is a logical set of workers governed by an instance of a scheduling policy. Tasks submitted to a given scheduling context are confined to the computing units governed by the workers belonging to this scheduling context at the time they get scheduled.
A scheduling context is identified by an unsigned integer identifier between
0
and STARPU_NMAX_SCHED_CTXS - 1
. The STARPU_NMAX_SCHED_CTXS
identifier value is reserved to indicated an unallocated, invalid or deleted scheduling context.
Accesses to the scheduling context structure are governed by a multiple-readers/single-writer lock (
rwlock
field). Changes to the structure contents, additions or removals of workers, statistics updates, all must be done with proper exclusive write access.
A worker can be assigned to one or more scheduling contexts. It exclusively receives tasks submitted to the scheduling context(s) it is currently assigned at the time such tasks are scheduled. A worker may add itself to or remove itself from a scheduling context.
Locking and synchronization rules between workers and scheduling contexts
A thread currently holding a worker sched_mutex must not attempt to acquire a scheduling context rwlock, neither for writing nor for reading. Such an attempt constitutes a lock inversion and may result in a deadlock.
A worker currently in a scheduling operation must enter the relaxed state before attempting to acquire a scheduling context rwlock, either for reading or for writing.
When the set of workers assigned to a scheduling context is about to be modified, all the workers in the union between the workers belonging to the scheduling context before the change and the workers expected to belong to the scheduling context after the change must be notified using the
notify_workers_about_changing_ctx_pending
function prior to the update. After the update, all the workers in that same union must be notified for the update completion with a call to notify_workers_about_changing_ctx_done
.
The function
notify_workers_about_changing_ctx_pending
places every worker passed in argument in a state compatible with changing the scheduling context assignment of that worker, possibly blocking until that worker leaves incompatible states such as a pending scheduling operation. If the caller of notify_workers_about_changing_ctx_pending()
is itself a worker included in the set of workers passed in argument, it does not notify itself, with the assumption that the worker is already calling notify_workers_about_changing_ctx_pending()
from a state compatible with a scheduling context assignment update. Once a worker has been notified about a scheduling context change pending, it cannot proceed with incompatible operations such as a scheduling operation until it receives a notification that the context update operation is complete.
Each driver defines a set of routines depending on some specific hardware. These routines include hardware discovery/initialization, task execution, device memory management and data transfers.
While most hardware dependent routines are in source files located in the
/src/drivers
subdirectory of the StarPU tree, some can be found elsewhere in the tree such as src/datawizard/malloc.c
for memory allocation routines or the subdirectories of src/datawizard/interfaces/
for data transfer routines.
The driver ABI defined in the _starpu_driver_ops structure includes the following operations:
.init: initialize a driver instance for the calling worker managing a hardware computing unit compatible with this driver.
.run_once: perform a single driver progress cycle for the calling worker (see Operations).
.deinit: deinitialize the driver instance for the calling worker
.run: executes the following sequence automatically: call
.init, repeatedly call
.run_once until the function _starpu_machine_is_running() returns false, call
.deinit.The source code common to all drivers is shared in
src/drivers/driver_common/driver_common.[ch]
. This file includes services such as grabbing a new task to execute on a worker, managing statistics accounting on job startup and completion and updating the worker status
A subset of the drivers corresponds to drivers managing computing units in master/slave mode, that is, drivers involving a local master instance managing one or more remote slave instances on the targeted device(s). This includes devices such as discrete manycore accelerators (e.g. Intel's Knight Corners board, for instance), or pseudo devices such as a cluster of cpu nodes driver through StarPU's MPI master/slave mode. A driver instance on the master side is named the source, while a driver instances on the slave side is named the sink.
A significant part of the work realized on the source and sink sides of master/slave drivers is identical among all master/slave drivers, due to the similarities in the software pattern. Therefore, many routines are shared among all these drivers in the
src/drivers/mp_common
subdirectory. In particular, a set of default commands to be used between sources and sinks is defined, assuming the availability of some communication channel between them (see enum _starpu_mp_command)
TODO
TODO
TODO