It is useful to be able to simulate data with a specified structure. The `faux`

package provides some functions to make this process easier. See the package website for more details.

You can install the released version of faux from CRAN with:

And the development version from GitHub with:

This function creates a dataset with a specific between- and/or within-subjects design. see vignette

For example, the following creates a 2w*2b design with 100 observations in each cell. The between-subject factor is `pet`

with two levels (`cat`

and `dog`

). The within-subject factor is `time`

with two levels (`day`

and `night`

). The mean for the `cat_day`

cell is 10, the mean for the `cat_night`

cell is 20, the mean for the `dog_day`

cell is 15, and the mean for the `dog_night`

cell is 25. All cells have a SD of 5 and all within-subject cells are correlated `r = 0.5`

. The resulting data has exactly these values (set `empirical = FALSE`

to sample from a population with these values). Set `plot = TRUE`

to show a plot of means and SDs.

```
between <- list(pet = c(cat = "Cat Owners",
dog = "Dog Owners"))
within <- list(time = c("morning", "noon", "evening", "night"))
mu <- data.frame(
cat = c(10, 12, 14, 16),
dog = c(10, 15, 20, 25),
row.names = within$time
)
df <- sim_design(within, between,
n = 100, mu = mu, sd = 5, r = .5,
empirical = TRUE, plot = TRUE)
```

pet | n | var | morning | noon | evening | night | mean | sd |
---|---|---|---|---|---|---|---|---|

cat | 100 | evening | 0.5 | 0.5 | 1.0 | 0.5 | 14 | 5 |

cat | 100 | morning | 1.0 | 0.5 | 0.5 | 0.5 | 10 | 5 |

cat | 100 | night | 0.5 | 0.5 | 0.5 | 1.0 | 16 | 5 |

cat | 100 | noon | 0.5 | 1.0 | 0.5 | 0.5 | 12 | 5 |

dog | 100 | evening | 0.5 | 0.5 | 1.0 | 0.5 | 20 | 5 |

dog | 100 | morning | 1.0 | 0.5 | 0.5 | 0.5 | 10 | 5 |

dog | 100 | night | 0.5 | 0.5 | 0.5 | 1.0 | 25 | 5 |

dog | 100 | noon | 0.5 | 1.0 | 0.5 | 0.5 | 15 | 5 |

You can plot the data from `sim_design()`

and swap the factor visualisations. see vignette

```
p1 <- plot_design(df)
p2 <- plot_design(df, "pet", "time")
cowplot::plot_grid(p1, p2, nrow = 2, align = "v")
```

This function produces a data table with the same distributions and correlations as an existing data table. It only returns numeric columns and simulates all numeric variables from a continuous normal distribution (for now). see vignette

For example, the following code creates a new sample from the built-in dataset `iris`

with 50 observations of each species.

This function produces a data table for a basic cross-classified design with random intercepts for subjects and items.

For example, the following code produces the data for 100 subjects responding to 50 items where the response has an overall mean (`grand_i`

) of 10. Subjects vary in their average response with an SD of 1, items vary in their average response with an SD of 2, and the residual error term has an SD of 3.

```
dat <- sim_mixed_cc(
sub_n = 100, # subject sample size
item_n = 50, # item sample size
grand_i = 10, # overall mean of the score
sub_sd = 1, # SD of subject random intercepts
item_sd = 2, # SD of item random intercepts
error_sd = 3 # SD of residual error
)
```

You can then see how changing these numbers affects the random effects in an intercept-only mixed effects model.

```
lme4::lmer(y ~ 1 + (1 | sub_id) + (1 | item_id), data = dat) %>%
broom.mixed::tidy() %>%
knitr::kable(digits = 2)
#> Registered S3 method overwritten by 'broom.mixed':
#> method from
#> tidy.gamlss broom
```

effect | group | term | estimate | std.error | statistic |
---|---|---|---|---|---|

fixed | NA | (Intercept) | 9.79 | 0.28 | 34.83 |

ran_pars | sub_id | sd__(Intercept) | 0.99 | NA | NA |

ran_pars | item_id | sd__(Intercept) | 1.84 | NA | NA |

ran_pars | Residual | sd__Observation | 2.95 | NA | NA |

This function uses `lme4::lmer()`

to get subject, item and error SDs from an existing dataset and simulates a new dataset with the specified number of subjects and items with distributions drawn from the example data.

```
new_dat <- sim_mixed_df(fr4,
sub_n = 100,
item_n = 50,
dv = "rating",
sub_id = "rater_id",
item_id = "face_id")
```

This function makes multiple normally distributed vectors with specified parameters and relationships. see vignette

For example, the following creates a sample that has 100 observations of 3 variables, drawn from a population where A has a mean of 0 and SD of 1, while B and C have means of 20 and SDs of 5. A correlates with B and C with r = 0.5, and B and C correlate with r = 0.25.

```
dat <- rnorm_multi(
n = 100,
mu = c(0, 20, 20),
sd = c(1, 5, 5),
r = c(0.5, 0.5, 0.25),
varnames = c("A", "B", "C"),
empirical = FALSE
)
#> [32mThe number of variables (vars) was guessed from the input to be 3[39m
```

n | var | A | B | C | mean | sd |
---|---|---|---|---|---|---|

100 | A | 1.00 | 0.53 | 0.55 | 0.02 | 1.08 |

100 | B | 0.53 | 1.00 | 0.29 | 20.18 | 5.15 |

100 | C | 0.55 | 0.29 | 1.00 | 20.18 | 5.25 |

This function creates a vector that has a specified correlation with an existing vector.

```
# create a pre-existing vector x
x <- rnorm(100, 0, 1)
# create a vector y with exactly mean=0, sd=1, and r(x,y)=0.5
y <- rnorm_pre(x, mu = 0, sd = 1, r = 0.5, empirical = TRUE)
list(
mean = mean(y),
sd = sd(y),
r = cor(x,y)
) %>% str()
#> List of 3
#> $ mean: num -1.24e-17
#> $ sd : num 1
#> $ r : num 0.5
```

If `empirical = FALSE`

(the default), this resulting vector is sampled from a population with the specified parameters (but won’t have *exactly* those properties).

Sometimes you want to mess up a dataset for teaching (thanks for the idea, Emily!). The `messy()`

function will replace `prop`

proportion of the data in the specified columns with the value of `replace`

(defaults to `NA`

).

```
# replace 10% of Species with NA
iris2 <- messy(iris, 0.1, "Species")
# replace 10% of petal.Width adn Sepal.Width with NA
iris3 <- messy(iris, 0.1, "Petal.Width", "Sepal.Width")
# replace 50% of columns 1-2 with NA
iris4 <- messy(iris, 0.5, 1:2)
# replace 50% of Species with "NOPE"
iris5 <- messy(iris, 0.5, "Species", replace = "NOPE")
```

If you want to check your simulated stats or just describe an existing dataset, use `get_params()`

.

n | var | Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | mean | sd |
---|---|---|---|---|---|---|---|

150 | Petal.Length | 0.87 | -0.43 | 1.00 | 0.96 | 3.76 | 1.77 |

150 | Petal.Width | 0.82 | -0.37 | 0.96 | 1.00 | 1.20 | 0.76 |

150 | Sepal.Length | 1.00 | -0.12 | 0.87 | 0.82 | 5.84 | 0.83 |

150 | Sepal.Width | -0.12 | 1.00 | -0.43 | -0.37 | 3.06 | 0.44 |

You can also group your data and change the digits to round.

Species | n | var | Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | mean | sd |
---|---|---|---|---|---|---|---|---|

setosa | 50 | Petal.Length | 0.27 | 0.18 | 1.00 | 0.33 | 1.46 | 0.17 |

setosa | 50 | Petal.Width | 0.28 | 0.23 | 0.33 | 1.00 | 0.25 | 0.11 |

setosa | 50 | Sepal.Length | 1.00 | 0.74 | 0.27 | 0.28 | 5.01 | 0.35 |

setosa | 50 | Sepal.Width | 0.74 | 1.00 | 0.18 | 0.23 | 3.43 | 0.38 |

versicolor | 50 | Petal.Length | 0.75 | 0.56 | 1.00 | 0.79 | 4.26 | 0.47 |

versicolor | 50 | Petal.Width | 0.55 | 0.66 | 0.79 | 1.00 | 1.33 | 0.20 |

versicolor | 50 | Sepal.Length | 1.00 | 0.53 | 0.75 | 0.55 | 5.94 | 0.52 |

versicolor | 50 | Sepal.Width | 0.53 | 1.00 | 0.56 | 0.66 | 2.77 | 0.31 |

virginica | 50 | Petal.Length | 0.86 | 0.40 | 1.00 | 0.32 | 5.55 | 0.55 |

virginica | 50 | Petal.Width | 0.28 | 0.54 | 0.32 | 1.00 | 2.03 | 0.27 |

virginica | 50 | Sepal.Length | 1.00 | 0.46 | 0.86 | 0.28 | 6.59 | 0.64 |

virginica | 50 | Sepal.Width | 0.46 | 1.00 | 0.40 | 0.54 | 2.97 | 0.32 |

It is useful for IDs for random effects (e.g., subjects or stimuli) to be character strings (so you don’t accidentally include them as fixed effects) with the same length s(o you can sort them in order like S01, S02,…, S10 rather than S1, S10, S2, …) This function returns a list of IDs that have the same string length and a specified prefix.

```
make_id(n = 10, prefix = "ITEM_")
#> [1] "ITEM_01" "ITEM_02" "ITEM_03" "ITEM_04" "ITEM_05" "ITEM_06" "ITEM_07"
#> [8] "ITEM_08" "ITEM_09" "ITEM_10"
```

You can also manually set the number of digits and set `n`

to a range of integers.

```
make_id(n = 10:20, digits = 3)
#> [1] "S010" "S011" "S012" "S013" "S014" "S015" "S016" "S017" "S018" "S019"
#> [11] "S020"
```

Convert a data table made with faux from long to wide.

```
between <- list("pet" = c("cat", "dog"))
within <- list("time" = c("day", "night"))
df_long <- sim_design(within, between, long = TRUE, plot = FALSE)
df_wide <- long2wide(df_long)
```

id | pet | day | night | |
---|---|---|---|---|

1.S001.1 | S001 | cat | 0.5230799 | 0.5415126 |

1.S002.1 | S002 | cat | -1.2129722 | -0.6874280 |

1.S003.1 | S003 | cat | -0.5516822 | -0.7844739 |

1.S004.1 | S004 | cat | -0.6098251 | -0.6234040 |

1.S005.1 | S005 | cat | 0.2853363 | 2.3027620 |

1.S006.1 | S006 | cat | -0.2490935 | -0.6530662 |

If you have a data table not made by faux, you need to specify the within-subject columns, the between-subject columns, the DV column, and the ID column.

```
# make a long data table
df_long <- expand.grid(
sub_id = 1:10,
A = c("A1", "A2"),
B = c("B1", "B2")
)
df_long$C <- rep(c("C1", "C2"), 20)
df_long$score <- rnorm(40)
# convert it to wide
df_wide <- long2wide(df_long, within = c("A", "B"),
between = "C", dv = "score", id = "sub_id")
```

sub_id | C | A1_B1 | A2_B1 | A1_B2 | A2_B2 |
---|---|---|---|---|---|

1 | C1 | -0.3782026 | 0.0048040 | -0.5674500 | 0.6154273 |

2 | C2 | 0.5338898 | -1.6407051 | 1.7815958 | -2.1325543 |

3 | C1 | 1.2792787 | 1.4960844 | 0.6207239 | 1.7913821 |

4 | C2 | 0.1885384 | -0.0884940 | -0.7397000 | 0.0295121 |

5 | C1 | -0.3090509 | -0.3942107 | 0.9916406 | -0.2951152 |

6 | C2 | 1.4168226 | 0.8341336 | 0.0634940 | -1.1571559 |

You can convert a data table made by faux from wide to long easily.

```
between <- list("pet" = c("cat", "dog"))
within <- list("time" = c("day", "night"))
df_wide <- sim_design(within, between, long = FALSE, plot = FALSE)
df_long <- wide2long(df_wide)
```

id | pet | time | y | |
---|---|---|---|---|

S001.1 | S001 | cat | day | -0.9012830 |

S002.1 | S002 | cat | day | -0.3205931 |

S003.1 | S003 | cat | day | -0.7733501 |

S004.1 | S004 | cat | day | -1.1861210 |

S005.1 | S005 | cat | day | -1.0848848 |

S006.1 | S006 | cat | day | -0.4888620 |

If you have a data table not made by faux, you need to specify the within-subject factors and columns, and specify the names of the ID and DV columns to create.

If column names are combinations of factor levels (e.g., A1_B1, A1_B2, A2_B1, A2_B2), then you can specify the regex pattern to separate them with the argument `sep`

(which defaults to `_`

).

```
long_iris <- wide2long(
iris,
within_factors = c("feature", "dimension"),
within_cols = 1:4,
dv = "value",
id = "flower_id",
sep = "\\."
)
```

flower_id | Species | feature | dimension | value | |
---|---|---|---|---|---|

S001.1 | S001 | setosa | Sepal | Length | 5.1 |

S002.1 | S002 | setosa | Sepal | Length | 4.9 |

S003.1 | S003 | setosa | Sepal | Length | 4.7 |

S004.1 | S004 | setosa | Sepal | Length | 4.6 |

S005.1 | S005 | setosa | Sepal | Length | 5.0 |

S006.1 | S006 | setosa | Sepal | Length | 5.4 |

If you have a data table in long format, you can recover the design from it by specifying the dv and id columns (assuming all other columns are within- or between-subject factors).

Then you can use `json_design()`

to save the design to a file or view it in JSON format (condensed or pretty).

{"within":{"feature":{"Sepal":"Sepal","Petal":"Petal"},"dimension":{"Length":"Length","Width":"Width"}},"between":{"Species":{"setosa":"setosa","versicolor":"versicolor","virginica":"virginica"}},"dv":{"value":"value"},"id":{"flower_id":"flower_id"},"n":{"setosa":50,"versicolor":50,"virginica":50},"mu":{"setosa":{"Sepal_Length":5.006,"Sepal_Width":3.428,"Petal_Length":1.462,"Petal_Width":0.246},"versicolor":{"Sepal_Length":5.936,"Sepal_Width":2.77,"Petal_Length":4.26,"Petal_Width":1.326},"virginica":{"Sepal_Length":6.588,"Sepal_Width":2.974,"Petal_Length":5.552,"Petal_Width":2.026}},"sd":{"setosa":{"Sepal_Length":0.35248969,"Sepal_Width":0.37906437,"Petal_Length":0.173664,"Petal_Width":0.10538559},"versicolor":{"Sepal_Length":0.51617115,"Sepal_Width":0.31379832,"Petal_Length":0.46991098,"Petal_Width":0.19775268},"virginica":{"Sepal_Length":0.63587959,"Sepal_Width":0.32249664,"Petal_Length":0.5518947,"Petal_Width":0.27465006}},"r":{"setosa":[[1,0.74254669,0.26717576,0.27809835],[0.74254669,1,0.17769997,0.23275201],[0.26717576,0.17769997,1,0.33163004],[0.27809835,0.23275201,0.33163004,1]],"versicolor":[[1,0.52591072,0.75404896,0.54646107],[0.52591072,1,0.56052209,0.66399872],[0.75404896,0.56052209,1,0.78666809],[0.54646107,0.66399872,0.78666809,1]],"virginica":[[1,0.45722782,0.86422473,0.28110771],[0.45722782,1,0.40104458,0.53772803],[0.86422473,0.40104458,1,0.32210822],[0.28110771,0.53772803,0.32210822,1]]}}

{ "within": { "feature": { "Sepal": "Sepal", "Petal": "Petal" }, "dimension": { "Length": "Length", "Width": "Width" } }, "between": { "Species": { "setosa": "setosa", "versicolor": "versicolor", "virginica": "virginica" } }, "dv": { "value": "value" }, "id": { "flower_id": "flower_id" }, "n": { "setosa": 50, "versicolor": 50, "virginica": 50 }, "mu": { "setosa": { "Sepal_Length": 5.006, "Sepal_Width": 3.428, "Petal_Length": 1.462, "Petal_Width": 0.246 }, "versicolor": { "Sepal_Length": 5.936, "Sepal_Width": 2.77, "Petal_Length": 4.26, "Petal_Width": 1.326 }, "virginica": { "Sepal_Length": 6.588, "Sepal_Width": 2.974, "Petal_Length": 5.552, "Petal_Width": 2.026 } }, "sd": { "setosa": { "Sepal_Length": 0.35248969, "Sepal_Width": 0.37906437, "Petal_Length": 0.173664, "Petal_Width": 0.10538559 }, "versicolor": { "Sepal_Length": 0.51617115, "Sepal_Width": 0.31379832, "Petal_Length": 0.46991098, "Petal_Width": 0.19775268 }, "virginica": { "Sepal_Length": 0.63587959, "Sepal_Width": 0.32249664, "Petal_Length": 0.5518947, "Petal_Width": 0.27465006 } }, "r": { "setosa": [ [1, 0.74254669, 0.26717576, 0.27809835], [0.74254669, 1, 0.17769997, 0.23275201], [0.26717576, 0.17769997, 1, 0.33163004], [0.27809835, 0.23275201, 0.33163004, 1] ], "versicolor": [ [1, 0.52591072, 0.75404896, 0.54646107], [0.52591072, 1, 0.56052209, 0.66399872], [0.75404896, 0.56052209, 1, 0.78666809], [0.54646107, 0.66399872, 0.78666809, 1] ], "virginica": [ [1, 0.45722782, 0.86422473, 0.28110771], [0.45722782, 1, 0.40104458, 0.53772803], [0.86422473, 0.40104458, 1, 0.32210822], [0.28110771, 0.53772803, 0.32210822, 1] ] } }

Not all correlation matrices are possible. For example, if variables A and B are correlated with r = 1.0, then the correlation between A and C can only be exactly equal to the correlation between B and C.

The function `pos_def_limits()`

lets you know what the possible range of values is for the missing value in a correlation matrix with one missing value. The correlation values are entered just from the top right triangle of the matrix, with a single `NA`

for the missing value.

x |
---|

-0.42 |

x |
---|

0.74 |

For example, if r_{AB} = 0.8 and r_{AC} = 0.2, then -0.42 <= r_{BC} <= 0.74.

If you enter a correlation matrix that contains impossible combinations, your limits will be `NA`

.

x |
---|

NA |

x |
---|

NA |

If you have a full matrix and want to know if it is positive definite, you can use the following code:

```
matrix(c(1, .3, -.9, .2,
.3, 1, .4, .5,
-.9, .4, 1, .3,
.2, .5, .3, 1), 4) %>%
is_pos_def()
#> [1] FALSE
```

Please note that the [34m’faux’[39m project is released with a Contributor Code of Conduct. By contributing to this project, you agree to abide by its terms.