Simulate Correlated Variables

Lisa DeBruine

2020-09-21

library(ggplot2)
library(dplyr)
library(tidyr)
library(faux)

The rnorm_multi() function makes multiple normally distributed vectors with specified parameters and relationships.

Quick example

For example, the following creates a sample that has 100 observations of 3 variables, drawn from a population where A has a mean of 0 and SD of 1, while B and C have means of 20 and SDs of 5. A correlates with B and C with r = 0.5, and B and C correlate with r = 0.25.


dat <- rnorm_multi(n = 100, 
                  mu = c(0, 20, 20),
                  sd = c(1, 5, 5),
                  r = c(0.5, 0.5, 0.25), 
                  varnames = c("A", "B", "C"),
                  empirical = FALSE)
#> The number of variables (vars) was guessed from the input to be 3
Sample stats
n var A B C mean sd
100 A 1.00 0.49 0.51 -0.04 1.04
100 B 0.49 1.00 0.19 19.95 4.91
100 C 0.51 0.19 1.00 19.64 4.61

Specify correlations

You can specify the correlations in one of four ways:

One Number

If you want all the pairs to have the same correlation, just specify a single number.

bvn <- rnorm_multi(100, 5, 0, 1, .3, varnames = letters[1:5])
Sample stats from a single rho
n var a b c d e mean sd
100 a 1.00 0.18 0.29 0.33 0.31 0.04 1.03
100 b 0.18 1.00 0.18 0.33 0.30 0.13 1.06
100 c 0.29 0.18 1.00 0.14 0.20 0.07 0.99
100 d 0.33 0.33 0.14 1.00 0.28 0.15 1.06
100 e 0.31 0.30 0.20 0.28 1.00 0.03 1.03

Matrix

If you already have a correlation matrix, such as the output of cor(), you can specify the simulated data with that.

cmat <- cor(iris[,1:4])
bvn <- rnorm_multi(100, 4, 0, 1, cmat, 
                  varnames = colnames(cmat))
Sample stats from a correlation matrix
n var Sepal.Length Sepal.Width Petal.Length Petal.Width mean sd
100 Petal.Length 0.87 -0.58 1.00 0.96 0.04 1.03
100 Petal.Width 0.82 -0.52 0.96 1.00 0.05 1.04
100 Sepal.Length 1.00 -0.24 0.87 0.82 0.09 0.98
100 Sepal.Width -0.24 1.00 -0.58 -0.52 0.07 1.08

Vector (vars*vars)

You can specify your correlation matrix by hand as a vars*vars length vector, which will include the correlations of 1 down the diagonal.

cmat <- c(1, .3, .5,
          .3, 1, 0,
          .5, 0, 1)
bvn <- rnorm_multi(100, 3, 0, 1, cmat, 
                  varnames = c("first", "second", "third"))
Sample stats from a vars*vars vector
n var first second third mean sd
100 first 1.00 0.31 0.48 0.05 1.02
100 second 0.31 1.00 0.01 -0.14 0.86
100 third 0.48 0.01 1.00 0.02 1.12

Vector (vars*(vars-1)/2)

You can specify your correlation matrix by hand as a vars*(vars-1)/2 length vector, skipping the diagonal and lower left duplicate values.

rho1_2 <- .3
rho1_3 <- .5
rho1_4 <- .5
rho2_3 <- .2
rho2_4 <- 0
rho3_4 <- -.3
cmat <- c(rho1_2, rho1_3, rho1_4, rho2_3, rho2_4, rho3_4)
bvn <- rnorm_multi(100, 4, 0, 1, cmat, 
                  varnames = letters[1:4])
Sample stats from a (vars*(vars-1)/2) vector
n var a b c d mean sd
100 a 1.00 0.29 0.61 0.41 -0.10 1.06
100 b 0.29 1.00 0.23 -0.03 0.09 1.14
100 c 0.61 0.23 1.00 -0.28 0.08 1.17
100 d 0.41 -0.03 -0.28 1.00 -0.12 0.97

empirical

If you want your samples to have the exact correlations, means, and SDs you entered, set empirical to TRUE.

bvn <- rnorm_multi(100, 5, 0, 1, .3, 
                  varnames = letters[1:5], 
                  empirical = T)
Sample stats with empirical = TRUE
n var a b c d e mean sd
100 a 1.0 0.3 0.3 0.3 0.3 0 1
100 b 0.3 1.0 0.3 0.3 0.3 0 1
100 c 0.3 0.3 1.0 0.3 0.3 0 1
100 d 0.3 0.3 0.3 1.0 0.3 0 1
100 e 0.3 0.3 0.3 0.3 1.0 0 1

Pre-existing variable

Us rnorm_pre() to create a vector with a specified correlation to a pre-existing variable. The following code creates a vector called sl.5 with a mean of 10, SD of 2 and a correlation of r = 0.5 to the Sepal.Length column in the built-in dataset iris.

sl <- iris$Sepal.Length

sl.5.v1 <- rnorm_pre(sl, mu = 10, sd = 2, r = 0.5)
sl.5.v2 <- rnorm_pre(sl, mu = 10, sd = 2, r = 0.5)
rnorm_pre
n var sl sl.5.v1 sl.5.v2 mean sd
150 sl 1.00 0.45 0.49 5.84 0.83
150 sl.5.v1 0.45 1.00 0.17 10.12 2.27
150 sl.5.v2 0.49 0.17 1.00 10.05 2.05

Set empirical = TRUE to return a vector with the exact specified parameters.

sl.5.v1 <- rnorm_pre(sl, mu = 10, sd = 2, r = 0.5, empirical = TRUE)
sl.5.v2 <- rnorm_pre(sl, mu = 10, sd = 2, r = 0.5, empirical = TRUE)
rnorm_pre with empirical = TRUE
n var sl sl.5.v1 sl.5.v2 mean sd
150 sl 1.0 0.5 0.5 5.84 0.83
150 sl.5.v1 0.5 1.0 0.3 10.00 2.00
150 sl.5.v2 0.5 0.3 1.0 10.00 2.00