Often, simulation levels will be simple, such as a vector of sample sizes:

However, there are many instances in which more complex objects are needed. For these cases, instead of a vector of numbers or character strings, a named list of lists can be used. The toy example below illustrates this.

```
sim <- new_sim()
sim %<>% set_levels(
n = c(10,100),
distribution = list(
"Beta 1" = list(type="Beta", params=c(0.3, 0.7)),
"Beta 2" = list(type="Beta", params=c(1.5, 0.4)),
"Normal" = list(type="Normal", params=c(3.0, 0.2))
)
)
create_data <- function(n, type, params) {
if (type=="Beta") {
return(rbeta(n, shape1=params[1], shape2=params[2]))
} else if (type=="Normal") {
return(rnorm(n, mean=params[1], sd=params[2]))
}
}
sim %<>% set_script(function() {
x <- create_data(L$n, L$distribution$type, L$distribution$params)
return(list("y"=mean(x)))
})
sim %<>% run()
#> Done. No errors or warnings detected.
```

Note that the list names (`"Beta 1"`

,
`"Beta 2"`

, and `"Normal"`

) become the entries in
the `sim$results`

dataframe, as well as the dataframe
returned by `summarize()`

.

```
sim %>% summarize(list(stat="mean", x="y"))
#> level_id n distribution n_reps mean_y
#> 1 1 10 Beta 1 1 0.3293261
#> 2 2 100 Beta 1 1 0.2975617
#> 3 3 10 Beta 2 1 0.8442316
#> 4 4 100 Beta 2 1 0.7989588
#> 5 5 10 Normal 1 2.9720820
#> 6 6 100 Normal 1 2.9747477
```

Sometimes, it may be the case that you want to run simulations only
for a subset of level combinations. In these cases, use the
`.keep`

option of `set_levels()`

. First, set the
levels normally. Second, view the `sim$levels_grid`

dataframe
to examine the level combinations and the associated
`level_id`

values. Third, call `set_levels()`

again with the `.keep`

option to specify which level
combinations to keep. This is demonstrated below:

```
sim <- new_sim()
sim %<>% set_levels(alpha=c(2,3,4), beta=c(50,60))
print(sim$levels_grid)
#> level_id alpha beta
#> 1 1 2 50
#> 2 2 3 50
#> 3 3 4 50
#> 4 4 2 60
#> 5 5 3 60
#> 6 6 4 60
sim %<>% set_levels(.keep=c(1,2,6))
print(sim$levels_grid)
#> level_id alpha beta
#> 1 1 2 50
#> 2 2 3 50
#> 6 6 4 60
```

In most situations, the results of simulations are numeric. However,
we may want to return more complex data, such as matrices, lists, or
model objects. To do this, we add a key-value pair to the list returned
by the simulation script with the special key `".complex"`

and a list (containing the complex data) as the value. This is
illustrated in the toy example below. Here, the simulation script
estimates the parameters of a linear regression and returns these as
numeric, but also returns the estimated covariance matrix and the entire
model object.

```
sim <- new_sim()
sim %<>% set_levels(n=c(10, 100, 1000))
create_data <- function(n) {
x <- runif(n)
y <- 3 + 2*x + rnorm(n)
return(data.frame("x"=x, "y"=y))
}
sim %<>% set_config(num_sim=2)
sim %<>% set_script(function() {
dat <- create_data(L$n)
model <- lm(y~x, data=dat)
return(list(
"beta0_hat" = model$coefficients[[1]],
"beta1_hat" = model$coefficients[[2]],
".complex" = list(
"model" = model,
"cov_mtx" = vcov(model)
)
))
})
sim %<>% run()
#> Done. No errors or warnings detected.
```

After running this simulation, the numeric results can be accessed
directly via `sim$results`

or using the
`summarize()`

function, as usual:

```
head(sim$results)
#> sim_uid level_id rep_id n runtime beta0_hat beta1_hat
#> 1 1 1 1 10 0.0023009777 3.313459 1.64224090
#> 2 4 1 2 10 0.0006840229 3.682214 0.04035482
#> 3 2 2 1 100 0.0026309490 3.083269 2.08769380
#> 4 5 2 2 100 0.0006608963 2.781492 2.11971793
#> 5 3 3 1 1000 0.0009598732 3.105255 1.86686623
#> 6 6 3 2 1000 0.0008368492 2.964213 2.09937006
```

To examine the complex return data, we can use the function
`get_complex()`

, as illustrated below:

```
c5 <- get_complex(sim, sim_uid=5)
print(summary(c5$model))
#>
#> Call:
#> lm(formula = y ~ x, data = dat)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -2.71501 -0.57515 -0.07813 0.74569 2.87825
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 2.7815 0.1926 14.442 < 2e-16 ***
#> x 2.1197 0.3234 6.554 2.63e-09 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 0.9914 on 98 degrees of freedom
#> Multiple R-squared: 0.3047, Adjusted R-squared: 0.2976
#> F-statistic: 42.95 on 1 and 98 DF, p-value: 2.628e-09
print(c5$cov_mtx)
#> (Intercept) x
#> (Intercept) 0.03709408 -0.05340809
#> x -0.05340809 0.10461387
```

`batch`

functionThe `batch()`

function is useful for sharing data or
objects between simulation replicates. Essentially, it allows simulation
replicates to be divided into “batches”; all replicates in a given batch
will then share a certain set of objects. A common use case for this is
a simulation that involves using multiple methods to analyze a shared
dataset, and repeating this process over a number of dataset replicates.
This may be of interest if, for example, it is computationally expensive
to generate a simulated dataset.

To illustrate the use of `batch()`

using this example, we
first consider the following simulation:

```
sim <- new_sim()
create_data <- function(n) { rnorm(n=n, mean=3) }
est_mean <- function(dat, type) {
if (type=="est_mean") { return(mean(dat)) }
if (type=="est_median") { return(median(dat)) }
}
sim %<>% set_levels(est=c("est_mean","est_median"))
sim %<>% set_config(num_sim=3)
sim %<>% set_script(function() {
dat <- create_data(n=100)
mu_hat <- est_mean(dat=dat, type=L$est)
return(list(
"mu_hat" = round(mu_hat,2),
"dat_1" = round(dat[1],2)
))
})
sim %<>% run()
#> Done. No errors or warnings detected.
```

From the `"dat_1"`

column of the results object (equal to
the first element of the `dat`

vector created in the
simulation script), we see that a unique dataset was created for each
simulation replicate:

```
sim$results[order(sim$results$rep_id),c(1:7)!=5]
#> sim_uid level_id rep_id est mu_hat dat_1
#> 1 1 1 1 est_mean 3.17 3.75
#> 4 2 2 1 est_median 3.13 2.50
#> 2 3 1 2 est_mean 3.06 4.36
#> 5 5 2 2 est_median 2.88 2.31
#> 3 4 1 3 est_mean 3.00 2.27
#> 6 6 2 3 est_median 2.97 2.63
```

Suppose that instead, we wish to analyze each simulated dataset using
multiple methods (in this case corresponding to `"est_mean"`

and `"est_median"`

), and repeat this procedure a total of
three times. We can do this using the `batch()`

function, as
follows:

```
sim <- new_sim()
create_data <- function(n) { rnorm(n=n, mean=3) }
est_mean <- function(dat, type) {
if (type=="est_mean") { return(mean(dat)) }
if (type=="est_median") { return(median(dat)) }
}
sim %<>% set_levels(est=c("est_mean","est_median"))
sim %<>% set_config(num_sim=3, batch_levels=NULL)
sim %<>% set_script(function() {
batch({
dat <- create_data(n=100)
})
mu_hat <- est_mean(dat=dat, type=L$est)
return(list(
"mu_hat" = round(mu_hat,2),
"dat_1" = round(dat[1],2)
))
})
sim %<>% run()
#> Done. No errors or warnings detected.
```

In the code above, we changed two things. First, we added
`batch_levels=NULL`

to the `set_config()`

call;
this will be explained below. Second, we wrapped the code line
`dat <- create_data(n=100)`

inside the
`batch()`

function. Whatever code goes inside the
`batch()`

function will produce the same output for all
simulations in a batch.

```
sim$results[order(sim$results$rep_id),c(1:7)!=5]
#> sim_uid level_id rep_id est mu_hat dat_1
#> 1 1 1 1 est_mean 3.05 3.99
#> 4 2 2 1 est_median 2.98 3.99
#> 2 3 1 2 est_mean 3.09 1.72
#> 5 5 2 2 est_median 3.14 1.72
#> 3 4 1 3 est_mean 2.98 2.66
#> 6 6 2 3 est_median 2.90 2.66
```

In this case, from the `"dat_1"`

column of the results
object, we see that one dataset was created and shared by the batch
corresponding to `sim_uids`

1 and 2 (likewise for
`sim_uids`

{3,5} and {4,6}).

However, the situation is often more complicated. Suppose we have a
simulation with multiple levels, some that correspond to creating data
and some that correspond to analyzing the data. Here, the
`batch_levels`

argument of `set_config()`

plays a
role. Specifically, this argument should be a character vector equal to
the names of the simulation levels that are referenced (via the special
variable `L`

) from within a `batch()`

block. In
the example below, the levels `n`

and `mu`

are
used within the `batch()`

call, while the level
`est`

is not.

```
sim <- new_sim()
create_data <- function(n, mu) { rnorm(n=n, mean=mu) }
est_mean <- function(dat, type) {
if (type=="est_mean") { return(mean(dat)) }
if (type=="est_median") { return(median(dat)) }
}
sim %<>% set_levels(n=c(10,100), mu=c(3,5), est=c("est_mean","est_median"))
sim %<>% set_config(num_sim=2, batch_levels=c("n", "mu"), return_batch_id=T)
sim %<>% set_script(function() {
batch({
dat <- create_data(n=L$n, mu=L$mu)
})
mu_hat <- est_mean(dat=dat, type=L$est)
return(list(
"mu_hat" = round(mu_hat,2),
"dat_1" = round(dat[1],2)
))
})
sim %<>% run()
#> Done. No errors or warnings detected.
sim$results[order(sim$results$batch_id),c(1:10)!=8]
#> sim_uid level_id rep_id batch_id n mu est mu_hat dat_1
#> 1 1 1 1 1 10 3 est_mean 3.11 4.88
#> 9 5 5 1 1 10 3 est_median 2.93 4.88
#> 2 9 1 2 2 10 3 est_mean 2.79 1.65
#> 10 13 5 2 2 10 3 est_median 2.35 1.65
#> 3 2 2 1 3 100 3 est_mean 3.02 3.83
#> 11 6 6 1 3 100 3 est_median 3.00 3.83
#> 4 10 2 2 4 100 3 est_mean 3.10 2.81
#> 12 14 6 2 4 100 3 est_median 3.06 2.81
#> 5 3 3 1 5 10 5 est_mean 4.98 5.83
#> 13 7 7 1 5 10 5 est_median 5.03 5.83
#> 6 11 3 2 6 10 5 est_mean 5.36 7.78
#> 14 15 7 2 6 10 5 est_median 5.41 7.78
#> 7 4 4 1 7 100 5 est_mean 5.08 4.56
#> 15 8 8 1 7 100 5 est_median 4.99 4.56
#> 8 12 4 2 8 100 5 est_mean 5.30 5.69
#> 16 16 8 2 8 100 5 est_median 5.26 5.69
```

The batches were created such that each batch contained two
replicates, one for each level value of `est`

. For expository
purposes, we also specified the `return_batch_id=T`

option in
`set_config()`

so that the results object would return the
`batch_id`

. This is not necessary in practice. The
`batch_id`

variable defines the batches; all simulations that
share the same `batch_id`

are in a single batch. The
`return_batch_id=T`

option can be useful to ensure correct
usage of the `batch()`

function.

We note the following about the `batch()`

function:

- The code within the
`batch()`

code block must*only*create objects; this code should not change or delete existing objects, as these changes will be ignored. - In the majority of cases, the
`batch()`

function will be called just once, at the beginning of the simulation script. However, it can be used anywhere in the script and can be called multiple times. The`batch()`

function should never be used outside of the simulation script. - Although we have illustrated the use of the
`batch()`

function to create a dataset to share between multiple simulation replicates, it can be used for much more, such as taking a sample from an existing dataset or computing shared nuisance function estimators. - If the simulation is being run in parallel (either locally or on a
CCS),
`n_cores`

cannot exceed the number of batches, since all simulations within a batch must run on the same core. - If the simulation script uses the
`batch()`

function, the simulation cannot be updated using the`update_sim()`

or`update_sim_on_cluster()`

functions, with the exception of updates that only entail removing simulation replicates.

In statistical research, it is desirable to be able to reproduce the
exact results of a simulation study. Since R code often involves
stochastic (random) functions like `rnorm()`

or
`sample()`

that return different values when called multiple
times, reproducibility is not guaranteed. In a simple R script, calling
the `set.seed()`

function at the beginning of the script
ensures that the stochastic functions that follow will produce the same
results whenever the script is run. However, a more nuanced strategy is
needed when running simulations. When running 100 replicates of the same
simulation, we do not want each replicate to return identical results;
rather, we would like for each replicate to be different from one
another, but for *the entire set of replicates* to be the same
when the entire simulation is run twice in a row.
**SimEngine** manages this process, even when simulations
are being run in parallel locally or on a cluster computing system. In
**SimEngine**, a single “global seed” is used to generate a
different seed for each simulation replicate. The
`set_config()`

function is used to set or change this global
seed:

If a seed is not set using `set_config()`

,
**SimEngine** will set a random seed automatically so that
the results can be replicated if desired. To view this seed, we use the
`vars()`

function:

In the simulation coding workflow, errors are inevitable. Some errors
may affect all simulation replicates, while other errors may only affect
a subset of replicates. By default, when a simulation is run,
**SimEngine** will not stop if an error occurs; instead,
errors are logged and stored in a dataframe along with information about
the simulation replicates that resulted in those errors. Examining this
dataframe by typing `print(sim$errors)`

can sometimes help to
quickly pinpoint the issue. This is demonstrated below:

```
sim <- new_sim()
sim %<>% set_config(num_sim=2)
sim %<>% set_levels(
Sigma = list(
s1 = list(mtx=matrix(c(3,1,1,2), nrow=2)),
s3 = list(mtx=matrix(c(4,3,3,9), nrow=2)),
s2 = list(mtx=matrix(c(1,2,2,1), nrow=2)),
s4 = list(mtx=matrix(c(8,2,2,6), nrow=2))
)
)
sim %<>% set_script(function() {
x <- MASS::mvrnorm(n=1, mu=c(0,0), Sigma=L$Sigma$mtx)
return(list(x1=x[1], x2=x[2]))
})
sim %<>% run()
#> Done. Errors detected in 25% of simulation replicates. Warnings detected in 0% of simulation replicates.
print(sim$errors)
#> sim_uid level_id rep_id Sigma runtime message
#> 1 5 3 1 s2 0.0002849102 'Sigma' is not positive definite
#> 2 6 3 2 s2 0.0002381802 'Sigma' is not positive definite
#> call
#> 1 MASS::mvrnorm(n = 1, mu = c(0, 0), Sigma = L$Sigma$mtx)
#> 2 MASS::mvrnorm(n = 1, mu = c(0, 0), Sigma = L$Sigma$mtx)
```

From the output above, we see that the code fails for the simulation
replicates that use the level with `Sigma="s2"`

because it
uses an invalid (not positive definite) covariance matrix. Similarly, if
a simulation involves replicates that throw warnings, all warnings are
logged and stored in the dataframe `sim$warnings`

.

The workflow demonstrated above can be useful to pinpoint errors, but
it has two main drawbacks. First, it is undesirable to run a
time-consuming simulation involving hundreds or thousands of replicates,
only to find at the end that every replicate failed because of a typo.
It may therefore useful to stop an entire simulation after a single
error has occurred. Second, it can sometimes be difficult to determine
exactly what caused an error without making use of more advanced
debugging tools. For both of these situations,
**SimEngine** includes the following configuration
option:

Setting `stop_at_error=TRUE`

will stop the simulation when
it encounters any error. Furthermore, the error will be thrown by R in
the usual way, so if the simulation is being run in in RStudio, the
built-in debugging tools (such as “Show Traceback” and “Rerun with
debug”) can be used to find and fix the bug. Placing a call to
`browser()`

at the top of the simulation script can also be
useful for debugging.

Statistical simulations are often based on the principle of Monte
Carlo approximation; specifically, pseudo-random sampling is used to
evaluate the performance of a statistical procedure under a particular
data-generating process. The performance of the procedure can be viewed
as a statistical parameter and, due to the fact that only a finite
number of simulation replicates can be performed, there is uncertainty
in any estimate of performance. This uncertainty is often referred to as
*Monte Carlo error*. We can quantify Monte Carlo error using, for
example, the standard error of the performance estimator.

Measuring and reporting Monte Carlo error is a vital component of a
simulation study. **SimEngine** includes an option in the
`summarize()`

function to automatically estimate the Monte
Carlo standard error for any inferential summary statistic, e.g.,
estimator bias or confidence interval coverage. The standard error
estimates are based on the formulas provided in Morris et al. 2019. If
the option `mc_se`

is set to `TRUE`

, estimates of
Monte Carlo standard error will be included in the summary data frame,
along with associated 95% confidence intervals based on a normal
approximation.

```
sim <- new_sim()
create_data <- function(n) { rpois(n, lambda=5) }
est_mean <- function(dat) {
return(mean(dat))
}
sim %<>% set_levels(n=c(10,100,1000))
sim %<>% set_config(num_sim=5)
sim %<>% set_script(function() {
dat <- create_data(L$n)
lambda_hat <- est_mean(dat=dat)
return (list("lambda_hat"=lambda_hat))
})
sim %<>% run()
#> Done. No errors or warnings detected.
sim %>% summarize(
list(stat="mse", name="lambda_mse", estimate="lambda_hat", truth=5),
mc_se = TRUE
)
#> level_id n n_reps lambda_mse lambda_mse_mc_se lambda_mse_mc_ci_l
#> 1 1 10 5 0.602000 0.4280700877 -0.237017372
#> 2 2 100 5 0.052200 0.0234607545 0.006216921
#> 3 3 1000 5 0.002099 0.0005030319 0.001113057
#> lambda_mse_mc_ci_u
#> 1 1.441017372
#> 2 0.098183079
#> 3 0.003084943
```