The registerDoFuture()
function makes the
%dopar%
operator of the
foreach package to process foreach iterations via any of
the future backends supported by the future package, which
includes various parallel and distributed backends.
In other words, if a computational backend is supported via
the Future API, it'll be automatically available for all functions
and packages making using the foreach framework.
Neither the developer nor the end user has to change any code.
registerDoFuture()
registerDoFuture()
returns, invisibly, the previously registered
foreach %dopar%
backend.
To use futures with the foreach package and its
%dopar%
operator, use
doFuture::registerDoFuture()
to register doFuture to be
used as a %dopar%
adapter. After this, %dopar%
will
parallelize with whatever future backend is set by
future::plan()
.
The built-in future backends are always available, e.g.
sequential (sequential processing),
multicore (forked processes),
multisession (background R sessions),
and cluster (background R sessions on
local and remote machines).
For example, plan(multisession)
will make %dopar%
parallelize via R processes running in the background on the
local machine, and
plan(cluster, workers = c("n1", "n2", "n2", "n3"))
will
parallelize via R processes running on external machines.
Additional backends are provided by other future-compliant
packages. For example, the future.batchtools package
provides support for high-performance compute (HPC) cluster
schedulers such as SGE, Slurm, and TORQUE / PBS.
As an illustration, plan(batchtools_slurm)
will parallelize
by submitting the foreach iterations as tasks to the Slurm
scheduler, which in turn will distribute the tasks to one
or more compute nodes.
Unless running locally in the global environment (= at the R prompt),
the foreach package requires you do specify what global variables
and packages need to be available and attached in order for the
"foreach" expression to be evaluated properly. It is not uncommon to
get errors on one or missing variables when moving from running a
res <- foreach() %dopar% { ... }
statement on the local machine
to, say, another machine on the same network. The solution to the
problem is to explicitly export those variables by specifying them in
the .export
argument to foreach::foreach()
,
e.g. foreach(..., .export = c("mu", "sigma"))
. Likewise, if the
expression needs specific packages to be attached, they can be listed
in argument .packages
of foreach()
.
When using registerDoFuture()
, the above becomes less
critical, because by default the Future API identifies all globals and
all packages automatically (via static code inspection). This is done
exactly the same way regardless of future backend.
This automatic identification of globals and packages is illustrated
by the below example, which does not specify
.export = c("my_stat")
. This works because the future framework
detects that function my_stat()
is needed and makes sure it is
exported. If you would use, say, cl <- parallel::makeCluster(2)
and doParallel::registerDoParallel(cl)
, you would get a run-time
error on Error in { : task 1 failed - \"could not find function "my_stat" ...
.
Having said this, note that, in order for your "foreach" code to work
everywhere and with other types of foreach adapters as well, you may
want to make sure that you always specify arguments .export
and .packages
.
Whether load balancing ("chunking") should take place or not can be
controlled by specifying either argument
.options.future = list(scheduling = <ratio>)
or
.options.future = list(chunk.size = <count>)
to foreach()
.
The value chunk.size
specifies the average number of elements
processed per future ("chunks").
If +Inf
, then all elements are processed in a single future (one worker).
If NULL
, then argument future.scheduling
is used.
The value scheduling
specifies the average number of futures
("chunks") that each worker processes.
If 0.0
, then a single future is used to process all iterations;
none of the other workers are not used.
If 1.0
or TRUE
, then one future per worker is used.
If 2.0
, then each worker will process two futures (if there are
enough iterations).
If +Inf
or FALSE
, then one future per iteration is used.
The default value is scheduling = 1.0
.
The name of foreach()
argument .options.future
follows the naming
conventions of the doMC, doSNOW, and doParallel packages,
This argument should not be mistaken for the R
options of the future package.
For backward-compatibility reasons with existing foreach code, one may
also use arguments .options.multicore = list(preschedule = <logical>)
and
.options.snow = list(preschedule = <logical>)
when using doFuture.
.options.multicore = list(preschedule = TRUE)
is equivalent to
.options.future = list(scheduling = 1.0)
and
.options.multicore = list(preschedule = FALSE)
is equivalent to
.options.future = list(scheduling = +Inf)
.
and analogously for .options.snow
.
Argument .options.future
takes precedence over argument
.option.multicore
which takes precedence over argument .option.snow
,
when it comes to chunking.
The doFuture adapter registered by registerDoFuture()
does not itself
provide a framework for generating proper random numbers in parallel.
This is a deliberate design choice based on how the foreach ecosystem is
set up and to align it with other foreach adapters, e.g. doParallel.
To generate statistically sound parallel RNG, it is recommended to use
the doRNG package, where the %dorng%
operator is used in place of %dopar%
.
For example,
This works because doRNG is designed to work with any type of foreach
%dopar%
adapter including the one provided by doFuture.
If you forget to use %dorng%
instead of %dopar%
when the foreach
iteration generates random numbers, doFuture will detect the
mistake and produce an informative warning.
Please refrain from modifying the foreach backend inside your package or
functions, i.e. do not call any registerNnn()
in your code. Instead,
leave the control on what backend to use to the end user. This idea is
part of the core philosophy of the foreach framework.
However, if you think it necessary to register the doFuture backend in a function, please make sure to undo your changes when exiting the function. This can be done using:
oldDoPar <- registerDoFuture()
on.exit(with(oldDoPar, foreach::setDoPar(fun=fun, data=data, info=info)), add = TRUE)
[...]
This is important, because the end-user might have already registered a foreach backend elsewhere for other purposes and will most likely not known that calling your function will break their setup. Remember, your package and its functions might be used in a greater context where multiple packages and functions are involved and those might also rely on the foreach framework, so it is important to avoid stepping on others' toes.
# \donttest{
library(iterators) # iter()
registerDoFuture() # (a) tell %dopar% to use the future framework
plan(multisession) # (b) parallelize futures on the local machine
## Example 1
A <- matrix(rnorm(100^2), nrow = 100)
B <- t(A)
y1 <- apply(B, MARGIN = 2L, FUN = function(b) {
A %*% b
})
y2 <- foreach(b = iter(B, by = "col"), .combine = cbind) %dopar% {
A %*% b
}
stopifnot(all.equal(y2, y1))
## Example 2 - Chunking (4 elements per future [= worker])
y3 <- foreach(b = iter(B, by = "col"), .combine = cbind,
.options.future = list(chunk.size = 10)) %dopar% {
A %*% b
}
stopifnot(all.equal(y3, y1))
## Example 3 - Simulation with parallel RNG
library(doRNG)
#> Loading required package: rngtools
my_stat <- function(x) {
median(x)
}
my_experiment <- function(n, mu = 0.0, sigma = 1.0) {
## Important: use %dorng% whenever random numbers
## are involved in parallel evaluation
foreach(i = 1:n) %dorng% {
x <- rnorm(i, mean = mu, sd = sigma)
list(mu = mean(x), sigma = sd(x), own = my_stat(x))
}
}
## Reproducible results when using the same RNG seed
set.seed(0xBEEF)
y1 <- my_experiment(n = 3)
set.seed(0xBEEF)
y2 <- my_experiment(n = 3)
stopifnot(identical(y2, y1))
## But only then
y3 <- my_experiment(n = 3)
str(y3)
#> List of 3
#> $ :List of 3
#> ..$ mu : num 1.95
#> ..$ sigma: num NA
#> ..$ own : num 1.95
#> $ :List of 3
#> ..$ mu : num 0.201
#> ..$ sigma: num 1.45
#> ..$ own : num 0.201
#> $ :List of 3
#> ..$ mu : num -0.417
#> ..$ sigma: num 0.998
#> ..$ own : num 0.118
#> - attr(*, "rng")=List of 3
#> ..$ : int [1:7] 10407 -821273412 -52578226 1415511586 721384351 -665928286 1316562960
#> ..$ : int [1:7] 10407 -1924040949 -1632234809 -437763632 -1464377300 676654412 2080370711
#> ..$ : int [1:7] 10407 -1371615447 -889757879 1692656974 -1723952224 1378814990 1816467252
#> - attr(*, "doRNG_version")= chr "1.7.4"
stopifnot(!identical(y3, y1))
# }
# \dontshow{
## R CMD check: make sure any open connections are closed afterward
if (!inherits(plan(), "sequential")) plan(sequential)
# }