Imagine we have an experiment in which **10 individuals** completed a task with **100 trials**. For each of the 1000 total trials, we measured two things, **V1** and **V2**, and our research aims at **investigating the link between these two variables**.

We will generate data using the `simulate_simpson()`

function from the `correlation`

package.

```
library(correlation)
<- simulate_simpson(n = 100, groups = 10) data
```

Now let’s visualize the two variables:

```
library(ggplot2)
ggplot(data, aes(x = V1, y = V2)) +
geom_point() +
geom_smooth(colour = "black", method = "lm", se = FALSE) +
theme_classic()
```

That seems pretty straightforward! It seems like there is a **negative correlation** between V1 and V2. Let’s test this.

`correlation(data)`

```
## # Correlation table (pearson-method)
##
## Parameter1 | Parameter2 | r | 95% CI | t(998) | p
## ---------------------------------------------------------------------
## V1 | V2 | -0.84 | [-0.86, -0.82] | -48.77 | < .001***
##
## p-value adjustment method: Holm (1979)
## Observations: 1000
```

Indeed, there is **strong, negative and significant correlation** between V1 and V2. Great, can we go ahead and **publish these results in PNAS**?

Hold on sunshine! Ever heard of something called the **Simpson’s Paradox**?

Let’s colour our datapoints by group (by individuals):

```
library(ggplot2)
ggplot(data, aes(x = V1, y = V2)) +
geom_point(aes(colour = Group)) +
geom_smooth(aes(colour = Group), method = "lm", se = FALSE) +
geom_smooth(colour = "black", method = "lm", se = FALSE) +
theme_classic()
```

Mmh, interesting. It seems like, for each subject, the relationship is different. The negative general trend seems to be created by **differences between the groups** and could be spurious!

**Multilevel (as in multi-group) correlations allow us to account for differences between groups**. It is based on a partialization of the group, entered as a random effect in a mixed linear regression.

You can compute them with the **correlations** package by setting the `multilevel`

argument to `TRUE`

.

`correlation(data, multilevel = TRUE)`

```
## Parameter1 | Parameter2 | r | CI | t(998) | p
## ------------------------------------------------------------------
## V1 | V2 | 0.50 | [0.45, 0.55] | 18.24 | < .001***
##
## Observations: 1000
```

**Dayum!** We were too hasty in our conclusions! Taking the group into account seems to be super important.

Note: In this simple case where only two variables are of interest, it would be of course best to directly proceed using a mixed regression model instead of correlations. That being said, the latter can be useful for exploratory analysis, when multiple variables are of interest, or in combination with a network or structural approach.