In the `uwIntroStats`

package, we have set out to make regression and analysis easier by:

- allowing the user to specify any type of regression from one function
- allowing the user to specify multiple-partial F-tests
- displaying output in a more intuitive fashion than base R

This function is `regress()`

. The basic arguments to this function, which unlock all of its potential, are

`fnctl`

- the functional`formula`

- the formula for the linear model`data`

- the data to use for the model`id`

- the identification variable in data with repeated measurements

We use the concept of a *functional* to handle our first goal. A functional takes a function as its argument and returns a number - hence the mean is a functional, because it takes a distribution as its argument and returns a single number. The allowed functionals to `regress()`

are

Functional | Type of Regression | Previous command (package) |
---|---|---|

`"mean"` |
Linear Regression | `lm()` (`stats` - base R) |

`"geometric mean"` |
Linear Regression on logarithmically transformed Y | `lm()` , with Y log transformed (`stats` - base R) |

`"odds"` |
Logistic Regression | `glm(family = binomial)` (`stats` - base R) |

`"rate"` |
Poisson Regression | `glm(family = poisson)` (`stats` - base R) |

The *formula* to `regress()`

is the same as a formula given to `lm()`

or any of the other regression commands from base R, `survival`

, or `geepack`

, but with one small addition. To address our second goal of allowing the user to specify multiple-partial F-tests, we have added a special function - `U()`

- which can be added to the formula. The `U()`

function is documented more fully in “User Specified Multiple-Partial F-tests in Regression”.

The *data* argument is exactly the same as that in `lm()`

or any of the other regression commands.

Last, *id* allows the user to fit a generalized estimating equations (GEE) model while using the same syntax as any of the functionals to `regress()`

. The GEE framework is a useful way to model correlated data, which often comes in the form of repeated measurements.

As a first example, we run a linear regression analysis of atrophy on age and male, from the `mri`

data. This dataset is included in the `uwIntroStats`

package, and its documentation can be found on Scott Emerson's website here.

```
## Preparing our R session
library(uwIntroStats)
```

```
##
## Attaching package: 'uwIntroStats'
```

```
## The following object is masked from 'package:base':
##
## tabulate
```

```
data(mri)
```

```
regress("mean", atrophy ~ age + male, data = mri)
```

```
##
## Call:
## regress(fnctl = "mean", formula = atrophy ~ age + male, data = mri)
##
## Residuals:
## Min 1Q Median 3Q Max
## -33.765 -8.582 -0.356 7.344 52.100
##
## Coefficients:
## Estimate Naive SE Robust SE 95%L 95%H
## [1] Intercept -17.83 6.081 6.557 -30.70 -4.959
## [2] age 0.6819 0.08129 0.08769 0.5097 0.8540
## [3] male 5.964 0.8857 0.8845 4.227 7.700
## F stat df Pr(>F)
## [1] Intercept 7.40 1 0.0067
## [2] age 60.47 1 < 0.00005
## [3] male 45.46 1 < 0.00005
##
## Residual standard error: 12 on 732 degrees of freedom
## Multiple R-squared: 0.14, Adjusted R-squared: 0.1376
## F-statistic: 52.18 on 2 and 732 DF, p-value: < 2.2e-16
```

This call automatically prints the coefficients table. First, notice that by default robust standard error estimates (calculated using the `sandwich`

package) are returned, in addition to the naive estimates. The robust estimates are also used to perform the inference - thus the confidence intervals, statistics, and p-values use these estimates of the standard error.

If we did not use robust standard error estimates, then in the case of linear regression we would be assuming that all groups have the same variance. Then any inference we make could be on the fact that the variances are different, rather than just on the means - which is usually what we want in linear regression.

F-statistics are also displayed by default. This allows us to display multiple-partial F-tests within the coefficients table, and is more in line with teaching philosophy at the University of Washington.

All of the usual inferential statements apply to our output. Thus we would say:

We analyzed the association between cerebral atrophy, age, and sex by running a linear regression of cerebral atrophy modeled as a continuous variable on age - modeled as a continuous variable - and sex - modeled as a binary variable. We allowed for unequal variances among groups by computing robust standard error estimates using the Huber-White procedure. We calculated 95% confidence intervals and p-values using these same robust standard error estimates.

Based on this linear regression analysis, we estimate that for each one year increase in age, the mean cerebral atrophy score increases by 0.682 units. Based on a 95% confidence interval, the data suggest that this estimate is not unreasonable if the true coefficient for age was in the interval from 0.509 to 0.854. We also estimate that males have an average cerebral atrophy score of 5.96 units higher than females. Based on a 95% confidence interval, the data suggest that this estimate is not unreasonable if the true coefficient for sex was in the interval from 4.227 to 7.700.

However, in this case we did not adjust for multiple comparisons in calculating the individual p-values. If we wanted to make inferential claims using these, we would have to use a correction. We could also use a multiple-partial F-test to test both coefficients simultaneously.

In normal linear regression, we are comparing the mean of the response variable across groups defined by the predictors. However, if we were to log transform the response, we would be comparing the geometric mean across groups. In `regress()`

, we simply have to use the `"geometric mean"`

functional.

Using the same function, and the same syntax, we can also run generalized linear regression. For example, if we wanted to examine the odds of having diabetes between males and females, we would run a logistic regression.

```
regress("odds", diabetes ~ male, data = mri)
```

```
##
## Call:
## regress(fnctl = "odds", formula = diabetes ~ male, data = mri)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.5593 -0.5593 -0.3823 -0.3823 2.3034
##
## Coefficients:
##
## Raw Model:
## Estimate Naive SE Robust SE F stat df
## [1] Intercept -2.580 0.2034 0.2037 160.39 1
## [2] male 0.8037 0.2519 0.2522 10.15 1
## Pr(>F)
## [1] Intercept < 0.00005
## [2] male 0.0015
##
## Transformed Model:
## e(Est) e(95%L) e(95%H) F stat df
## [1] Intercept 0.07580 0.05082 0.1131 160.39 1
## [2] male 2.234 1.361 3.665 10.15 1
## Pr(>F)
## [1] Intercept < 0.00005
## [2] male 0.0015
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 501.59 on 734 degrees of freedom
## Residual deviance: 490.82 on 733 degrees of freedom
## AIC: 494.82
##
## Number of Fisher Scoring iterations: 5
```

In all of the generalized linear regression output (and in the output from regression on the geometric mean), by default we see two tables. The `Raw Model`

table displays coefficients and standard errors for the model after we have transformed the response variable, but we have not transformed back. Recall that in most generalized linear regression cases, we need to back-transform our results to get them in the original units. This is due to using a link function to model the regression. If you want to suppress printing this table, set `suppress = TRUE`

.

The `Transformed Model`

table does the back-transform for you. It exponentiates all of the coefficients and the confidence intervals so that they are in the original units.

Proportional hazards regression is no longer supported.

A common way to account for correlated data - for example longitudinal data, where we have multiple measurements on the same subjects over time - is to use the Generalized Estimating Equations (GEE) framework. We chose to use this framework in our `regress()`

function because it gives flexibility and uses common conventions. To make use of this functionality, as we mentioned above, you simply need to specify the `id`

argument in `regress()`

.

For an example, we turn to the `salary`

data, again hosted at Scott Emerson's website. More information can be found in the documentation, but these data essentially deal with university level faculty. In 1995, a survey was run asking the faculty to recall their salaries from the past. Each faculty member had records dating either to their starting year at the university or 1976, whichever was more recent. Thus we have repeated measurements on these individuals, and the variable of interest (salary) is highly correlated across measurements.

One interesting question to ask of these data involves discrimination based on sex. We will not model these data in the most sophisticated way (we leave that to the enterprising reader), but we can regress salary on sex and year, and involve their interaction.

```
salary <- read.table("http://www.emersonstatistics.com/datasets/salary.txt", header = TRUE, stringsAsFactors = FALSE)
## create an indicator variable
salary$female <- ifelse(salary$sex == "F", 1, 0)
```

Now that the data is read in, we can run the regression. First we run a regular linear regression which does not properly account for the correlated data, and then we run the GEE based regression.

```
## linear regression
regress("mean", salary ~ female*year, data = salary)
```

```
##
## Call:
## regress(fnctl = "mean", formula = salary ~ female * year, data = salary)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3727.9 -907.4 -230.0 719.2 7605.5
##
## Coefficients:
## Estimate Naive SE Robust SE 95%L 95%H
## [1] Intercept -17417 176.5 167.6 -17745 -17088
## [2] female 4102 420.7 299.5 3515 4689
## [3] year 255.5 2.021 2.005 251.6 259.5
## [4] female:year -57.88 4.757 3.541 -64.82 -50.94
## F stat df Pr(>F)
## [1] Intercept 10802.86 1 < 0.00005
## [2] female 187.59 1 < 0.00005
## [3] year 16235.87 1 < 0.00005
## [4] female:year 267.10 1 < 0.00005
##
## Residual standard error: 1423 on 19788 degrees of freedom
## Multiple R-squared: 0.487, Adjusted R-squared: 0.4869
## F-statistic: 7166 on 3 and 19788 DF, p-value: < 2.2e-16
```

From this call to regress, we would think that we had 19792 unique data points, which is quite a lot of faculty to have at a university. If we compare to the results from the GEE based regression, we will see how far off we are.

```
## GEE
regress("mean", salary ~ female*year, id = id, data = salary)
```

```
##
## Call:
## regress(fnctl = "mean", formula = salary ~ female * year, data = salary,
## id = id)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3727.9 -907.4 -230.0 719.2 7605.5
##
## Coefficients:
## Estimate Std Err 95%L 95%H Wald
## [1] Intercept -17417 283.6 -17973 -16861 3771.66
## [2] female 4102 528.7 3065 5138 60.20
## [3] year 255.5 3.536 248.6 262.5 5222.05
## [4] female:year -57.88 6.370 -70.36 -45.39 82.56
## df Pr(>|W|)
## [1] Intercept 1 < 0.00005
## [2] female 1 < 0.00005
## [3] year 1 < 0.00005
## [4] female:year 1 < 0.00005
##
## Estimated Scale Parameters:
## Estimate Std.err
## (Intercept) 2024610 77997.57
##
## Correlation: Structure = independence
##
## Number of Clusters: 1597
##
## Maximum Cluster Size: 20
```

Now `regress()`

tells us that we only have 1597 faculty members, which is much more reasonable. Also, it tells us that the “maximum cluster size” - i.e. the largest number of observations on any one id - is 20. This makes sense due to the sampling scheme, where some faculty have been at the university for 20 years by the time the survey was taken.

While our estimates of the coefficients are the same between the two calls to `regress()`

, it is our standard error estimates and inference which are drastically different. When we overestimate how many unique samples we have (in the naive linear regression), our standard error is much smaller than it should be, and thus we have much larger F statistics and smaller p-values. While in this case our inference would be the same, since the p-values are on the same order of magnitude in both cases, failing to account for the correlation in these data is still a mistake.

There are three special functions in `uwIntroStats`

which allow us to re-parameterize variables:

`dummy`

- create dummy variables`lspline`

- create linear splines`polynomial`

- create a polynomial

Each of these three functions is used to great effect in `regress()`

. Also, each will give a multiple-partial F-test of the entire variable. This allows you to determine if the variable should be included in the model, rather than having only the coefficient estimates.

For example, we can model race as dummy variables to examine the differences in the odds of having diabetes between races. This allows us to better make comparisons, because modeling as dummy variables essentially creates indicator variables against a reference group.

```
regress("odds", diabetes ~ dummy(race), data = mri)
```

```
##
## Call:
## regress(fnctl = "odds", formula = diabetes ~ dummy(race), data = mri)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.6165 -0.4539 -0.4539 -0.4539 2.3459
##
## Coefficients:
##
## Raw Model:
## Estimate Naive SE Robust SE F stat df
## [1] Intercept -2.221 0.1407 0.1411 247.78 1
## dummy(race) 2.11 3
## [2] race.2 0.6568 0.2949 0.2957 4.93 1
## [3] race.3 -0.4648 0.6131 0.6147 0.57 1
## [4] race.4 0.6113 0.7873 0.7894 0.60 1
## Pr(>F)
## [1] Intercept < 0.00005
## dummy(race) 0.0977
## [2] race.2 0.0267
## [3] race.3 0.4498
## [4] race.4 0.4390
##
## Transformed Model:
## e(Est) e(95%L) e(95%H) F stat df
## [1] Intercept 0.1085 0.08227 0.1432 247.78 1
## dummy(race) 2.11 3
## [2] race.2 1.929 1.079 3.446 4.93 1
## [3] race.3 0.6282 0.1879 2.100 0.57 1
## [4] race.4 1.843 0.3912 8.681 0.60 1
## Pr(>F)
## [1] Intercept < 0.00005
## dummy(race) 0.0977
## [2] race.2 0.0267
## [3] race.3 0.4498
## [4] race.4 0.4390
##
## Dummy terms calculated from race, reference = 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 501.59 on 734 degrees of freedom
## Residual deviance: 495.55 on 731 degrees of freedom
## AIC: 503.55
##
## Number of Fisher Scoring iterations: 5
```

First, notice that below the table `regress()`

tells us how the dummy variables were calculated. In this case the reference was 1, corresponding to `white`

in this data set. Next we see the multiple-partial F-test, which is on its own line for `dummy(race)`

. The coefficient estimates are nested beneath this line to indicate that these coefficients all come from the same variable (`race`

) but we have modeled them as three variables.

As we mentioned above, the *formula* in `regress()`

allows you to specify multiple-partial F-tests. This comes in handy if you want to test a subset of variables all at once in your regression.

For example, say that we are interested in the relationship between atrophy, age, sex, and race. However, we also want to include the sex-age interaction and the sex-race interaction. We also want to model race as dummy variables. Last, we want to determine if all of the variables involved in the sex-age interaction (`male`

, `age`

, and the interaction) should be in the model, and similar for the sex-race interaction. We use the `U()`

function to accomplish this goal.

```
regress("mean", atrophy ~ U(ma = ~male*age) + U(mr = ~male*dummy(race)), data = mri)
```

```
##
## Call:
## regress(fnctl = "mean", formula = atrophy ~ U(ma = ~male * age) +
## U(mr = ~male * dummy(race)), data = mri)
##
## Residuals:
## Min 1Q Median 3Q Max
## -32.682 -8.236 -0.236 7.158 52.546
##
## Coefficients:
## Estimate Naive SE Robust SE 95%L
## [1] Intercept -8.448 8.881 9.005 -26.13
## ma
## [2] male -11.40 12.19 12.88 -36.69
## [3] age 0.5558 0.1193 0.1208 0.3187
## [4] male:age 0.2414 0.1632 0.1730 -0.09826
## mr
## male -11.40 12.19 12.88 -36.69
## dummy(race)
## [5] race.2 -1.003 1.796 1.825 -4.585
## [6] race.3 1.713 2.463 2.067 -2.345
## [7] race.4 2.070 6.040 8.518 -14.65
## male:dummy(race)
## [8] male:race.2 -2.202 2.560 2.488 -7.086
## [9] male:race.3 -2.883 3.664 3.688 -10.12
## [10] male:race.4 -7.558 7.415 9.366 -25.95
## 95%H F stat df Pr(>F)
## [1] Intercept 9.231 0.88 1 0.3485
## ma 34.35 3 < 0.00005
## [2] male 13.89 0.78 1 0.3764
## [3] age 0.7929 21.18 1 < 0.00005
## [4] male:age 0.5811 1.95 1 0.1634
## mr 1.04 7 0.4038
## male 13.89 0.78 1 0.3764
## dummy(race) 0.40 3 0.7539
## [5] race.2 2.579 0.30 1 0.5826
## [6] race.3 5.770 0.69 1 0.4076
## [7] race.4 18.79 0.06 1 0.8081
## male:dummy(race) 0.61 3 0.6084
## [8] male:race.2 2.682 0.78 1 0.3764
## [9] male:race.3 4.357 0.61 1 0.4346
## [10] male:race.4 10.83 0.65 1 0.4200
##
## Dummy terms calculated from race, reference = 1
##
## Residual standard error: 12 on 725 degrees of freedom
## Multiple R-squared: 0.1488, Adjusted R-squared: 0.1382
## F-statistic: 12.15 on 9 and 725 DF, p-value: < 2.2e-16
```

We have also made use of functionality in `U()`

which allows us to name the groups for the multiple-partial F-tests. Thus we can see that the F-statistic for simultaneously testing whether `male`

, `age`

, and the interaction `male:age`

are equal to zero is 34.35, with the correct degrees of freedom for the test (3) and a small p-value. We would conclude (without adjusting for multiple comparisons) that (some of) these variables should be in the model. On the other hand, we see that the F-statistic for simultaneously testing whether `male`

, `race`

, and the interaction are equal to zero is 1.04. We would conclude (again without adjusting for multiple comparisons) that we cannot reject the null hypothesis that these are all equal to zero.

We also get individual coefficient estimates for each, where we have again nested the estimates within their corresponding groups defined by calls to `U()`

. Note that we have repeated the line for `male`

since it appears in both groups. Other than that, the coefficients table is the same as it would be if we had run the formula `atrophy ~ male*age + male*dummy(race)`

.