gRim
package## Loading required package: gRain
## Loading required package: gRbase
The gRim
package is an R package for gRaphical interaction
models (hence the name). gRim
implements 1) graphical
log–linear models for discrete data, that is for contingency tables
and 2) Gaussian graphical models for continuous data (multivariate
normal data) and 3) mixed homogeneous interaction models for mixed
data (data consisiting of both discrete and continuous variables).
The main functions for creating models of the various types are:
dmod()
function creates a hierarchical
log–linear model.cmod()
function creates a Gaussian
graphical model.mmod()
function creates a mixed
interaction model.The arguments to the model functions are:
args(dmod)
## function (formula, data, marginal = NULL, interactions = NULL,
## fit = TRUE, details = 0, ...)
## NULL
args(cmod)
## function (formula, data, marginal = NULL, fit = TRUE, maximal_only = FALSE,
## details = 0)
## NULL
args(mmod)
## function (formula, data, marginal = NULL, fit = TRUE, details = 0)
## NULL
The model objects created by these functions are of the respective
classes dModel
, cModel
and mModel
and they are also of the class
iModel
. We focus the presentation on models for discrete data, but
most of the topics we discuss apply to all types of models.
The reinis
data from \grbase\ is a \(2^6\) contingency table.
data(reinis)
str(reinis)
## 'table' num [1:2, 1:2, 1:2, 1:2, 1:2, 1:2] 44 40 112 67 129 145 12 23 35 12 ...
## - attr(*, "dimnames")=List of 6
## ..$ smoke : chr [1:2] "y" "n"
## ..$ mental : chr [1:2] "y" "n"
## ..$ phys : chr [1:2] "y" "n"
## ..$ systol : chr [1:2] "y" "n"
## ..$ protein: chr [1:2] "y" "n"
## ..$ family : chr [1:2] "y" "n"
Models are specified as generating classes. A generating class can be a list or a right–hand–sided formula. In addition, various model specification shortcuts are available.
The following two specifications of a log–linear model are equivalent:
data(reinis)
dm1 <- dmod(list(c("smoke", "systol"), c("smoke", "mental", "phys")), data=reinis)
dm1 <- dmod(~smoke:systol + smoke:mental:phys, data=reinis)
dm1
## Model: A dModel with 4 variables
## -2logL : 9391.38 mdim : 9 aic : 9409.38
## ideviance : 730.47 idf : 5 bic : 9459.05
## deviance : 3.80 df : 6
The output reads as follows: -2logL
is minus twice the maximized
log–likelihood and mdim
is the number of parameters in the model
(no adjustments have been made for sparsity of data).
The ideviance
and idf
gives the deviance and degrees of
freedom between the model and the independence model for the same
variables and deviance
and df
is the deviance and degrees of
freedom between the model and the saturated model for the same
variables.
Notice that the generating class does not appear directly but can be
retrieved using formula()
and terms()
:
formula(dm1)
## ~smoke * systol + smoke * mental * phys
terms(dm1)
## [[1]]
## [1] "smoke" "systol"
##
## [[2]]
## [1] "smoke" "mental" "phys"
Below we illustrate various other ways of specifying log–linear models.
\begin{itemize}
A saturated model can be specified using ~.^.
whereas
~.^2
specifies the model with all–two–factor
interactions. Using ~.^1
specifies the independence model.
If we want, say, at most two–factor interactions in the
model we can use the interactions
argument.
Attention can be restricted to a subset of the variables
using the marginal
argument.
Variable names can be abbreviated.
\end{itemize}
The following models illustrate these abbreviations:
dm2 <- dmod(~.^2, margin=c("smo","men","phy","sys"),
data=reinis)
formula(dm2)
## ~smoke * mental + smoke * phys + smoke * systol + mental * phys +
## mental * systol + phys * systol
dm3 <- dmod(list(c("smoke", "systol"), c("smoke", "mental", "phys")),
data=reinis, interactions=2)
formula(dm3)
## ~smoke * systol + smoke * mental + smoke * phys + mental * phys
plot(dm1)
For Gaussian models there are at most second order interactions. Hence we may specify the saturated model in different ways:
data(carcass)
cm1 <- cmod(~Fat11:Fat12:Fat13, data=carcass)
cm1 <- cmod(~Fat11:Fat12 + Fat12:Fat13 + Fat11:Fat13, data=carcass)
cm1
## Model: A cModel with 3 variables
## -2logL : 2432.47 mdim : 6 aic : 2444.47
## ideviance : 886.10 idf : 3 bic : 2467.51
## deviance : -0.00 df : 0
\footnote{Harmonize cmod() output with that of dmod()}
plot(cm1)
data(milkcomp1)
mm1 <- mmod(~.^., data=milkcomp1)
## 'as(<ddiMatrix>, "dgCMatrix")' is deprecated.
## Use 'as(as(., "generalMatrix"), "CsparseMatrix")' instead.
## See help("Deprecated") and help("Matrix-deprecated").
mm1
## Model: A mModel with 5 variables
## -2logL : 475.92 mdim : 44 aic : 563.92
## ideviance : 101.57 idf : 31 bic : 652.24
## deviance : 0.00 df : 0
plot(mm1) ## FIXME: should use different colours for disc and cont variables.
update()
The update()
function enables \dmodo\ objects to be modified by the addition
or deletion of interaction terms or edges, using the arguments aterm()
, dterm()
,
aedge()
or dedge()
. Some examples follow:
### Set a marginal saturated model:
ms <- dmod(~.^., marginal=c("phys","mental","systol","family"), data=reinis)
formula(ms)
## ~phys * mental * systol * family
### Delete one edge:
ms1 <- update(ms, list(dedge=~phys:mental))
formula(ms1)
## ~phys * systol * family + mental * systol * family
### Delete two edges:
ms2<- update(ms, list(dedge=~phys:mental+systol:family))
formula(ms2)
## ~phys * systol + phys * family + mental * systol + mental * family
### Delete all edges in a set:
ms3 <- update(ms, list(dedge=~phys:mental:systol))
formula(ms3)
## ~systol * family + mental * family + phys * family
### Delete an interaction term
ms4 <- update(ms, list(dterm=~phys:mental:systol) )
formula(ms4)
## ~phys * mental * family + phys * systol * family + mental * systol *
## family
### Set a marginal independence model:
m0 <- dmod(~.^1, marginal=c("phys","mental","systol","family"), data=reinis)
formula(m0)
## ~phys + mental + systol + family
### Add three interaction terms:
ms5 <- update(m0, list(aterm=~phys:mental+phys:systol+mental:systol) )
formula(ms5)
## ~mental * systol + phys * systol + phys * mental + family
### Add two edges:
ms6 <- update(m0, list(aedge=~phys:mental+systol:family))
formula(ms6)
## ~mental * phys + family * systol
A brief explanation of these operations may be helpful. To obtain a hierarchical model when we delete a term from a model, we must delete any higher-order relatives to the term. Similarly, when we add an interaction term we must also add all lower-order relatives that were not already present. Deletion of an edge is equivalent to deleting the corresponding two-factor term. Let \(m-e\) be the result of deleting edge \(e\) from a model \(m\). Then the result of adding \(e\) is defined as the maximal model \(m^*\) for which \(m^*-e=m\).
ciTest()
Tests of general conditional independence hypotheses of the form $u
\perp v | W$ can be performed using the ciTest()
.
function.
cit <- ciTest(reinis, set=c("systol", "smoke", "family", "phys"))
cit
## Testing systol _|_ smoke | family phys
## Statistic (DEV): 13.045 df: 4 p-value: 0.0111 method: CHISQ
## Slice information:
## statistic p.value df family phys
## 1 4.734420 0.029565 1 y y
## 2 0.003456 0.953121 1 n y
## 3 7.314160 0.006841 1 y n
## 4 0.993337 0.318928 1 n n
The general syntax of the set
argument is of the form \((u,v,W)\)
where \(u\) and \(v\) are variables and \(W\) is a set of variables.
The set
argument can also be given as a right–hand sided formula.
In model terms, the test performed by \comic{ciTest()} corresponds to the test for removing the edge \(\{ u, v \}\) from the saturated model with variables \(\{u, v\} \cup W\). If we (conceptually) form a factor \(S\) by crossing the factors in \(W\), we see that the test can be formulated as a test of the conditional independence \(u \perp v | S\) in a three way table. The deviance decomposes into independent contributions from each stratum:
\begin{eqnarray*} \nonumber D & =& 2 \sum_{ijs} n_{ijs}\log \frac{n_{ijs}}{\hat m_{ijs}} \ &= & \sum_s 2 \sum_{ij} n_{ijs}\log \frac{n_{ijs}}{\hat m_{ijs}}= \sum_s D_s \end{eqnarray*}
where the contribution \(D_s\) from the \(s\)th slice is the deviance for the independence model of \(u\) and \(v\) in that slice. For example,
cit$slice
## statistic p.value df family phys
## 1 4.734420192 0.029564812 1 y y
## 2 0.003456089 0.953120558 1 n y
## 3 7.314159809 0.006841338 1 y n
## 4 0.993336675 0.318928227 1 n n
The \(s\)th slice is a \(|u|\times|v|\) table ${n_{ijs}}{i=1\dots |u|, j=1 \dots |v|}$. The number of degrees of freedom corresponding to the test for independence in this slice is \begin{displaymath} df_s=(#{i: n{i\cdot s}>0}-1)(#{j: n_{\cdot js}>0}-1) \end{displaymath} where \(n_{i\cdot s}\) and \(n_{\cdot js}\) are the marginal totals.
So the correct number of degrees of freedom for the test in the
present example is \(3\), as calculated by ciTtest()
and
testdelete()
.
An alternative to the asymptotic \(\chi^2\) test is to determine the
reference distribution using Monte Carlo methods. The marginal totals
are sufficient statistics under the null hypothesis, and in a
conditional test the test statistic is evaluated in the conditional
distribution given the sufficient statistics. Hence one can generate
all possible tables with those given margins, calculate the desired
test statistic for each of these tables and then see how extreme the
observed test statistic is relative to those of the calculated
tables. A Monte Carlo approximation to this procedure is to randomly
generate large number of tables with the given margins, evaluate the
statistic for each simulated table and then see how extreme the
observed test statistic is in this distribution. This is called a
Monte Carlo exact test
and it provides a \comi{Monte Carlo
\(p\)–value}:
ciTest(reinis, set=c("systol","smoke","family","phys"), method='MC')
## Testing systol _|_ smoke | family phys
## Statistic (DEV): 13.045 df: NA p-value: 0.0125 method: MC
## Slice information:
## statistic n.extreme p.value df family phys
## 1 4.734420 72 0.0360 NA y y
## 2 0.003456 1730 0.8650 NA n y
## 3 7.314160 11 0.0055 NA y n
## 4 0.993337 573 0.2865 NA n n
This section describes some fundamental methods for inference in \grim. As basis for the description consider the following model shown in Fig. @ref(fig:fundamentalfig1):
dm5 <- dmod(~ment:phys:systol + ment:systol:family + phys:systol:smoke,
data=reinis)
Let \(\cal M_0\) be a model and let \(e=\{u,v\}\) be an edge in \(\cal M_0\).
The candidate model formed by deleting \(e\) from \(\cal M_0\) is \(\cal M_1\).
The testdelete()
function can be used to test for deletion of
an edge from a model:
testdelete(dm5, ~smoke:systol)
## dev: 11.698 df: 2 p.value: 0.00288 AIC(k=2.0): 7.7 edge: smoke:systol
## host: phys systol smoke
## Notice: Test performed in saturated marginal model
testdelete(dm5, ~family:systol)
## dev: 1.085 df: 2 p.value: 0.58135 AIC(k=2.0): -2.9 edge: family:systol
## host: mental systol family
## Notice: Test performed in saturated marginal model
In the first case the \(p\)–value suggests that the edge can not be deleted. In the second case the \(p\)–value suggests that the edge can be deleted. The reported AIC value is the difference in AIC between the candidate model and the original model. A negative value of AIC suggest that the candidate model is to be preferred.
Next, let \(\cal M_0\) be a model and let \(e=\{u,v\}\) be an edge not in
\(\cal M_0\). The candidate model formed by adding \(e\) to \(\cal M_0\) is
denoted \(\cal M_1\).
The testadd()
function can be used to test for deletion of
an edge from a model:
testadd(dm5, ~smoke:mental)
## dev: 7.797 df: 4 p.value: 0.09930 AIC(k=2.0): -0.2 edge: smoke:mental
## host: mental phys smoke systol
## Notice: Test performed in saturated marginal model
The \(p\)–value suggests that no significant improvedment of the model is obtained by adding the edge. The reported AIC value is the difference in AIC between the candidate model and the original model. A negative value of AIC would have suggested that the candidate model is to be preferred.
\footnote{A function for testing addition / deletion of more general terms is needed.}
The getInEdges()
function will return a list of all the edges
in the dependency graph \(\cal G\) defined by the model. If we set
type='decomposable'
then the edges returned are as follows: An
edge \(e=\{u,v\}\) is returned if \(\cal G\) minus the edge \(e\) is
decomposable. In connection with model selection this is convenient
because it is thereby possibly to restrict the search to decomposable
models.
ed.in <- getInEdges(ugList(terms(dm5)), type="decomposable")
The getOutEdges()
function will return a list of all the edges
which are not in the dependency graph \(\cal G\) defined by the model. If we set
type='decomposable'
then the edges returned are as follows: An
edge \(e=\{u,v\}\) is returned if \(\cal G\) plus the edge \(e\) is
decomposable. In connection with model selection this is convenient
because it is thereby possibly to restrict the search to decomposable
models.
ed.out <- getOutEdges(ugList(terms(dm5)), type="decomposable")
args(testInEdges)
## function (object, edgeMAT = NULL, criterion = "aic", k = 2, alpha = NULL,
## headlong = FALSE, details = 1, ...)
## NULL
args(testOutEdges)
## function (object, edgeMAT = NULL, criterion = "aic", k = 2, alpha = NULL,
## headlong = FALSE, details = 1, ...)
## NULL
The functions labelInEdges()} and \code{labelOutEdges()
will
test for deletion of edges and addition of edges. The default is to
use AIC for evaluating each edge. It is possible to specify the penalty parameter for AIC to being other
values than 2 and it is possible to base the evaluation on
significance tests instead of AIC. Setting headlong=TRUE
causes
the function to exit once an improvement is found.
For example:
testInEdges(dm5, getInEdges(ugList(terms(dm5)), type="decomposable"),
k=log(sum(reinis)))
## statistic df p.value aic V1 V2 action
## 1 686.702943 2 0.000000e+00 671.666814 mental phys -
## 2 4.692559 2 9.572463e-02 -10.343569 mental family +
## 3 28.146937 2 7.726277e-07 13.110809 phys smoke -
## 4 1.084805 2 5.813498e-01 -13.951323 systol family +
## 5 11.698229 2 2.882450e-03 -3.337899 systol smoke +
Two functions are currently available for model selection:
backward()
and forward()
. These functions employ the
functions in Section @ref(sec:labeledges))
For example, we start with the saturated model and do a backward search.
dm.sat <- dmod(~.^., data=reinis)
dm.back <- backward(dm.sat)
## change.AIC -19.7744 Edge deleted: mental,systol
## change.AIC -8.8511 Edge deleted: phys,systol
## change.AIC -4.6363 Edge deleted: mental,protein
## change.AIC -1.6324 Edge deleted: family,systol
## change.AIC -3.4233 Edge deleted: protein,family
## change.AIC -0.9819 Edge deleted: phys,family
## change.AIC -1.3419 Edge deleted: smoke,family
plot(dm.back)
cm.sat <- cmod(~.^., data=carcassall[,1:15])
cm.back <- backward(cm.sat, k=log(nrow(carcass)), type="unrestricted")
## change.AIC -5.8280 Edge deleted: Fat02,Fat13
## change.AIC -5.8179 Edge deleted: lengthp,Fat11
## change.AIC -5.7950 Edge deleted: lengthc,Fat11
## change.AIC -5.7684 Edge deleted: Meat12,LeanMeat
## change.AIC -5.7500 Edge deleted: lengthc,Meat13
## change.AIC -5.7435 Edge deleted: lengthp,Fat13
## change.AIC -5.7221 Edge deleted: lengthc,Fat02
## change.AIC -5.7094 Edge deleted: Fat03,Fat11
## change.AIC -5.7308 Edge deleted: Fat02,Meat13
## change.AIC -5.7134 Edge deleted: Fat02,Fat11
## change.AIC -5.6630 Edge deleted: Fat12,Fat02
## change.AIC -5.6446 Edge deleted: lengthp,Fat16
## change.AIC -5.6437 Edge deleted: Fat16,Fat03
## change.AIC -5.6443 Edge deleted: Fat14,Fat02
## change.AIC -5.6313 Edge deleted: Meat11,lengthp
## change.AIC -5.6227 Edge deleted: lengthf,Fat13
## change.AIC -5.5647 Edge deleted: Meat11,Fat16
## change.AIC -5.4573 Edge deleted: Fat12,Fat16
## change.AIC -5.4209 Edge deleted: Meat11,Fat03
## change.AIC -5.3358 Edge deleted: Meat12,lengthp
## change.AIC -5.4023 Edge deleted: LeanMeat,lengthc
## change.AIC -5.4080 Edge deleted: lengthf,Fat16
## change.AIC -5.5571 Edge deleted: lengthc,Fat16
## change.AIC -5.1738 Edge deleted: Fat12,lengthp
## change.AIC -5.1657 Edge deleted: lengthf,Meat11
## change.AIC -5.1537 Edge deleted: Fat13,Fat03
## change.AIC -5.0790 Edge deleted: Fat11,Fat14
## change.AIC -4.9814 Edge deleted: weight,Fat11
## change.AIC -5.0393 Edge deleted: weight,Fat14
## change.AIC -4.9686 Edge deleted: Meat13,lengthp
## change.AIC -4.8794 Edge deleted: Meat11,lengthc
## change.AIC -4.6744 Edge deleted: Meat12,Fat13
## change.AIC -4.8765 Edge deleted: Fat14,Meat12
## change.AIC -4.9395 Edge deleted: Fat14,Fat03
## change.AIC -4.0518 Edge deleted: Meat11,Fat12
## change.AIC -4.8607 Edge deleted: Meat11,Fat11
## change.AIC -4.0678 Edge deleted: lengthc,Fat13
## change.AIC -4.2270 Edge deleted: lengthc,Fat14
## change.AIC -5.0534 Edge deleted: lengthf,Fat14
## change.AIC -3.9497 Edge deleted: LeanMeat,Fat13
## change.AIC -3.7324 Edge deleted: lengthf,Fat12
## change.AIC -3.9473 Edge deleted: lengthf,Meat13
## change.AIC -4.6293 Edge deleted: weight,Meat12
## change.AIC -4.3618 Edge deleted: lengthf,Fat03
## change.AIC -3.6575 Edge deleted: Fat02,Meat12
## change.AIC -4.5237 Edge deleted: Fat02,Meat11
## change.AIC -2.4812 Edge deleted: lengthp,Fat03
## change.AIC -4.3531 Edge deleted: lengthp,Fat02
## change.AIC -2.9806 Edge deleted: lengthf,lengthp
## change.AIC -2.3228 Edge deleted: weight,Fat03
## change.AIC -2.7543 Edge deleted: weight,Meat13
## change.AIC -1.9527 Edge deleted: LeanMeat,Meat11
## change.AIC -2.1409 Edge deleted: Fat13,Fat16
## change.AIC -2.8012 Edge deleted: Fat12,Fat14
## change.AIC -1.6710 Edge deleted: Fat03,LeanMeat
## change.AIC -1.3052 Edge deleted: weight,lengthp
## change.AIC -0.8295 Edge deleted: Fat02,Fat16
plot(cm.back)
Default is to search among decomposable models if the initial model is
decomposable. Default is also to label all edges (with AIC values);
however setting search='headlong'
will cause the labelling to
stop once an improvement has been found.
Forward search works similarly; for example we start from the independence model:
dm.i <- dmod(~.^1, data=reinis)
dm.forw <- forward(dm.i)
## change.AIC 683.9717 Edge added: mental,phys
## change.AIC 25.4810 Edge added: phys,smoke
## change.AIC 15.9293 Edge added: mental,protein
## change.AIC 10.8092 Edge added: protein,systol
## change.AIC 2.7316 Edge added: mental,family
## change.AIC 1.9876 Edge added: mental,smoke
## change.AIC 16.4004 Edge added: smoke,protein
## change.AIC 12.5417 Edge added: smoke,systol
plot(dm.forw)
The stepwise()
function will perform a stepwise model
selection. Start from the saturated model:
dm.s2<-stepwise(dm.sat, details=1)
## STEPWISE:
## criterion: aic ( k = 2 )
## direction: backward
## type : decomposable
## search : all
## steps : 1000
## change.AIC -19.7744 Edge deleted: mental,systol
## change.AIC -8.8511 Edge deleted: phys,systol
## change.AIC -4.6363 Edge deleted: mental,protein
## change.AIC -1.6324 Edge deleted: family,systol
## change.AIC -3.4233 Edge deleted: protein,family
## change.AIC -0.9819 Edge deleted: phys,family
## change.AIC -1.3419 Edge deleted: smoke,family
The default selection criterion is AIC (as opposed to significance test); the default penalty parameter in AIC is \(2\) (which gives genuine AIC). The default search direction is backward (as opposed to forward). Default is to restrict the search to decomposable models if the starting model is decomposable; as opposed to unrestricted search. Default is not to do headlong search which means that all edges are tested and the best edge is chosen to delete. Headlong on the other hand means that once a deletable edge is encountered, then this edge is deleted.
Likewise, we may do a forward search starting from the independence model:
dm.i2<-stepwise(dm.i, direction="forward", details=1)
## STEPWISE:
## criterion: aic ( k = 2 )
## direction: forward
## type : decomposable
## search : all
## steps : 1000
## change.AIC 683.9717 Edge added: mental,phys
## change.AIC 25.4810 Edge added: phys,smoke
## change.AIC 15.9293 Edge added: mental,protein
## change.AIC 10.8092 Edge added: protein,systol
## change.AIC 2.7316 Edge added: mental,family
## change.AIC 1.9876 Edge added: mental,smoke
## change.AIC 16.4004 Edge added: smoke,protein
## change.AIC 12.5417 Edge added: smoke,systol
\begin{figure}[h] \centering \includegraphics{figures/GRIM-stepwise01} \caption{Models for the \reinis\ data obtained by backward (left) and forward (right) stepwise model selection.} {#fig:stepwise01} \end{figure}
Stepwise model selection is in practice only feasible for moderately sized problems.
The stepwise model selection can be controlled by fixing specific edges. For example we can specify edges which are not to be considered in a bacward selection:
fix <- list(c("smoke","phys","systol"), c("systol","protein"))
fix <- do.call(rbind, unlist(lapply(fix, names2pairs),recursive=FALSE))
fix
## [,1] [,2]
## [1,] "phys" "smoke"
## [2,] "smoke" "systol"
## [3,] "phys" "systol"
## [4,] "protein" "systol"
dm.s3 <- backward(dm.sat, fixin=fix, details=1)
## change.AIC -19.7744 Edge deleted: mental,systol
## change.AIC -4.6982 Edge deleted: systol,family
## change.AIC -6.8301 Edge deleted: protein,family
## change.AIC -1.2294 Edge deleted: mental,protein
## change.AIC -0.9819 Edge deleted: phys,family
## change.AIC -1.3419 Edge deleted: smoke,family
There is an important detail here: The matrix fix
specifies a
set of edges. Submitting these in a call to \comic{backward} does
not mean that these edges are forced to be in the model. It means that
those edges in fixin
which are in the model will not be removed.
Likewise in forward selection:
dm.i3 <- forward(dm.i, fixout=fix, details=1)
## change.AIC 683.9717 Edge added: mental,phys
## change.AIC 15.9293 Edge added: mental,protein
## change.AIC 15.4003 Edge added: smoke,protein
## change.AIC 8.6638 Edge added: smoke,mental
## change.AIC 2.7316 Edge added: mental,family
## change.AIC 1.1727 Edge added: phys,protein
Edges in fix
will not be added to the model but if they are in
the starting model already, they will remain in the final model.
\begin{figure}[h] \centering \includegraphics{figures/GRIM-stepwise02} \caption{Models for the \reinis\ data obtained by backward (left) and forward (right) stepwise model selection when certain edges are restricted in the selection procedure. } {#fig:stepwise02} \end{figure}
data(mildew)
dm1 <- dmod(~.^., data=mildew)
dm1
## Model: A dModel with 6 variables
## -2logL : 365.75 mdim : 63 aic : 491.75
## ideviance : 209.72 idf : 57 bic : 633.41
## deviance : 0.00 df : 0
## Notice: Table is sparse
## Asymptotic chi2 distribution may be questionable.
## Degrees of freedom can not be trusted.
## Model dimension adjusted for sparsity : 21
dm2 <- stepwise(dm1)
dm2
## Model: A dModel with 6 variables
## -2logL : 383.01 mdim : 11 aic : 405.01
## ideviance : 192.46 idf : 5 bic : 429.74
## deviance : 17.26 df : 52
## Notice: Table is sparse
## Asymptotic chi2 distribution may be questionable.
## Degrees of freedom can not be trusted.
## Model dimension adjusted for sparsity : 10
plot(dm2)
{#sec:dimloglin}
The dim_loglin()
is a general function for finding the dimension
of a log–linear model. It works on the generating class of a model
being represented as a list. For a decomposable model
it is possible to calculate
and adjusted dimension which accounts for sparsity of data with dim_loglin_decomp()
:
ff <- ~la10:locc:mp58:c365+mp58:c365:p53a:a367
mm <- dmod(ff, data=mildew)
plot(mm)
dim_loglin(terms(mm), mildew)
## [1] 27
dim_loglin_decomp(terms(mm), mildew)
## [1] 19
The IPS algorithm for hierarchical log–linear models is inefficient in the sense that it requires the entire table to be fitted. For example, if there are \(81\) variables each with \(10\) levels then a table with \(10^{81}\) will need to be created. (Incidently, \(10^{81}\) is one of the figures reported as the number of atoms in the universe. It is a large number!).
Consider a hierarchical log–linear model with generating class $\cal A = {a_1, \dots, a_M}$ over a set of variables \(\Delta\). The Iterative Proportional Scaling (IPS) algorithm (as described e.g.\ in @lauritzen:96, p.\ 83) as a commonly used method for fitting such models. The updating steps are of the form
\begin{equation} p(i) \leftarrow p(i)\frac{n(i_{a_k})/n}{p(i_{a_k})} \mbox{ for } k=1,\dots,M. \end{equation}
The IPS algorithm is implemented in the loglin()
function.
A more efficient IPS algorithm is described by
@jirousek:preucil:95, and this is implemented in the
effloglin()
function. The implementation of
effloglin()
is made entirely in R
and therefore the word
efficient should be understood in terms of space
requirement (for small problems, loglin()
is much faster than
effloglin()
).
The algorithm goes as follows: It is assumed that \(\cal A\) is minimally specified, i.e.\ that no element in \(\cal A\) is contained in another element in \(\cal A\). Form the dependency graph \(\cal G(\cal A)\) induced by \(\cal A\). Let \(\cal G'\) denoted a triangulation of \(\cal G(\cal A)\) and let \(\cal C=\{C_1,\dots,C_N\}\) denote the cliques of \(\cal G'\). Each \(a\in \cal A\) is then contained in exactly one clique $C\in \cal C$. Let \(\cal A_C=\{a\in \cal A:a\subset C\}\) so that $\cal A_{C_1}, \dots, \cal A_{C_N}$ is a disjoint partitioning of \(\cal A\).
Any probability \(p\) satisfying the constraints of \(\cal A\) will also factorize according to \(\cal G'\) so that
\begin{align} p(i) = \prod_{C\in \cal C} \psi_C(i_C) (#eq:effloglin1) \end{align}
Using e.g.\ the computation architecture of @lauritzen:spiegelhalter:88 the clique marginals
\begin{align} p_{C}(i_{C}), \quad C \in \cal C (#eq:effloglin2) \end{align}
can be obtained from (@ref(eq:effloglin1)). In practice calculation of
(@ref(eq:effloglin2)) is done using the gRrain
package.
For \(C\in \cal C\) and an \(a \in \cal A_C\) update \(\psi_C\) in
(@ref(eq:effloglin1)) as
\begin{align} \psi_C(i_C) \leftarrow \psi_C(i_C) \frac{n(i_a)/n}{p_a(i_a)} \end{align}
where \(p_a\) is obtained by summing over variables in $C\setminus a$ in \(p_C\) from (@ref(eq:effloglin2)). Then find the new clique marginals in (@ref(eq:effloglin2)), move on to the next \(a\) in \(\cal A_{C}\) and so on.
As an example, consider 4–cycle model for reinis data:
data(reinis)
ff <- ~smoke:mental+mental:phys+phys:systol+systol:smoke
dmod(ff, data=reinis)
## Model: A dModel with 4 variables
## -2logL : 9414.99 mdim : 8 aic : 9430.99
## ideviance : 706.87 idf : 4 bic : 9475.13
## deviance : 27.40 df : 7
This model can be fitted with loglin()
as
glist <- rhsFormula2list(ff)
glist
## [[1]]
## [1] "smoke" "mental"
##
## [[2]]
## [1] "mental" "phys"
##
## [[3]]
## [1] "phys" "systol"
##
## [[4]]
## [1] "systol" "smoke"
fv1 <- loglin(reinis, glist, print=FALSE)
fv1[1:3]
## $lrt
## [1] 1233.054
##
## $pearson
## [1] 1198.018
##
## $df
## [1] 55
An alternative is effloglin()
which uses the algorithm above on a
triangulated graph:
fv2 <- effloglin(reinis, glist, print=FALSE)
## 103
## loop over margins. logL: 28.763350644279
## 103
## loop over margins. logL: 371.795118911387
## 103
## loop over margins. logL: 391.272776485601
## 103
## loop over margins. logL: 396.832642464605
## 103
## loop over margins. logL: 396.832741133973
## 103
## loop over margins. logL: 396.832742073716
## 103
## loop over margins. logL: 396.842881530033
## 103
## loop over margins. logL: 396.842900258297
## 103
fv2[c('logL','nparm','df')]
## $logL
## [1] 396.8429
##
## $nparm
## [1] 8
##
## $df
## [1] 7
The real virtue of effloglin()
lies in that it is possible to
submit data as a list of sufficient marginals:
stab <- lapply(glist, function(gg) tableMargin(reinis, gg))
fv3 <- effloglin(stab, glist, print=FALSE)
## 103
## loop over margins. logL: 28.763350644279
## 103
## loop over margins. logL: 371.795118911387
## 103
## loop over margins. logL: 391.272776485601
## 103
## loop over margins. logL: 396.832642464605
## 103
## loop over margins. logL: 396.832741133973
## 103
## loop over margins. logL: 396.832742073716
## 103
## loop over margins. logL: 396.842881530033
## 103
## loop over margins. logL: 396.842900258297
## 103
A sanity check:
m1 <- loglin(reinis, glist, print=F, fit=T)
f1 <- m1$fit
m3 <- effloglin(stab, glist, print=F, fit=T)
## 103
## loop over margins. logL: 28.763350644279
## 103
## loop over margins. logL: 371.795118911387
## 103
## loop over margins. logL: 391.272776485601
## 103
## loop over margins. logL: 396.832642464605
## 103
## loop over margins. logL: 396.832741133973
## 103
## loop over margins. logL: 396.832742073716
## 103
## loop over margins. logL: 396.842881530033
## 103
## loop over margins. logL: 396.842900258297
## 103
f3 <- m3$fit
max(abs(f1 %a-% f3))
## [1] 184.2839
Consider the saturated and the independence models for the
carcass
data:
data(carcass)
cm1 <- cmod(~.^., carcass)
cm2 <- cmod(~.^1, data=carcass)
testdelete()
Let \(\cal M_0\) be a model and let \(e=\{u,v\}\) be an edge in \(\cal M_0\).
The candidate model formed by deleting \(e\) from \(\cal M_0\) is \(\cal M_1\).
The testdelete()
function can be used to test for deletion of
an edge from a model:
testdelete(cm1, ~Meat11:Fat11)
## dev: 3.898 df: 1 p.value: 0.04834 AIC(k=2.0): 1.9 edge: Meat11:Fat11
## host: Meat12 Fat11 LeanMeat Meat13 Fat13 Fat12 Meat11
## Notice: Test performed in saturated marginal model
testdelete(cm1, ~Meat12:Fat13)
## dev: 0.139 df: 1 p.value: 0.70889 AIC(k=2.0): -1.9 edge: Meat12:Fat13
## host: Meat12 Fat11 LeanMeat Meat13 Fat13 Fat12 Meat11
## Notice: Test performed in saturated marginal model
In the first case the \(p\)–value suggests that the edge can not be deleted. In the second case the \(p\)–value suggests that the edge can be deleted. The reported AIC value is the difference in AIC between the candidate model and the original model. A negative value of AIC suggest that the candidate model is to be preferred.
testadd()
Next, let \(\cal M_0\) be a model and let \(e=\{u,v\}\) be an edge not in
\(\cal M_0\). The candidate model formed by adding \(e\) to \(\cal M_0\) is
denoted \(\cal M_1\).
The testadd()
function can be used to test for deletion of
an edge from a model:
testadd(cm2, ~Meat11:Fat11)
## dev: 0.507 df: 1 p.value: 0.47625 AIC(k=2.0): -1.5 edge: Meat11:Fat11
## host: Meat11 Fat11
## Notice: Test performed in saturated marginal model
testadd(cm2, ~Meat12:Fat13)
## dev: 4.168 df: 1 p.value: 0.04120 AIC(k=2.0): 2.2 edge: Meat12:Fat13
## host: Meat12 Fat13
## Notice: Test performed in saturated marginal model
In the first case the \(p\)–value suggests that no significant improvedment of the model is obtained by adding the edge. In the second case a significant improvement is optained by adding the edge. The reported AIC value is the difference in AIC between the candidate model and the original model. A negative value of AIC suggest that the candidate model is to be preferred.
getInEdges()
Consider the following model for the \carcass\ data:
data(carcass)
cm1 <- cmod(~LeanMeat:Meat12:Fat12+LeanMeat:Fat11:Fat12+Fat11:Fat12:Fat13, data=carcass)
plot(cm1)
The edges in the model are
getInEdges(cm1)
## [,1] [,2]
## [1,] "LeanMeat" "Meat12"
## [2,] "LeanMeat" "Fat12"
## [3,] "LeanMeat" "Fat11"
## [4,] "Meat12" "Fat12"
## [5,] "Fat12" "Fat11"
## [6,] "Fat12" "Fat13"
## [7,] "Fat11" "Fat13"
In connection with model selection it is sometimes convenient to get only the edges which are contained in only one clique:
getInEdges(cm1, type="decomposable")
## [,1] [,2]
## [1,] "LeanMeat" "Meat12"
## [2,] "LeanMeat" "Fat11"
## [3,] "Meat12" "Fat12"
## [4,] "Fat12" "Fat13"
## [5,] "Fat11" "Fat13"
\footnote{getInEdges/getOutEdges: type=’‘decomposable’’ is a silly value for the argument}
\footnote{getInEdges/getOutEdges: Should be possible to have edges as a matrix instead. Perhaps even as default.}
getOutEdges()
The edges not in the model are
getOutEdges(cm1)
## [,1] [,2]
## [1,] "LeanMeat" "Fat13"
## [2,] "Meat12" "Fat11"
## [3,] "Meat12" "Fat13"
In connection with model selection it is sometimes convenient to get only the edges which when added will be in only one clique of the new model:
getOutEdges(cm1, type="decomposable")
## [,1] [,2]
## [1,] "LeanMeat" "Fat13"
## [2,] "Meat12" "Fat11"
data(carcass)
cm1 <- cmod(~LeanMeat:Meat12:Fat12+LeanMeat:Fat11:Fat12+Fat11:Fat12:Fat13+Fat12:Meat11:Meat13, data=carcass[1:20,])
plot(cm1)
evalInEdges()
in.ed <- getInEdges(cm1)
z <- testInEdges(cm1, edgeList=in.ed)
z
## statistic df p.value aic V1 V2 action
## 1 2.3180596 1 1.278795e-01 0.3180596 Fat11 Fat12 -
## 2 2.7529164 1 9.707721e-02 0.7529164 Fat11 Fat13 -
## 3 7.1618067 1 7.447215e-03 5.1618067 Fat11 LeanMeat -
## 4 1.8377898 1 1.752102e-01 -0.1622102 Meat11 Fat12 +
## 5 33.2906683 1 7.936256e-09 31.2906683 Meat11 Meat13 -
## 6 1.6517259 1 1.987242e-01 -0.3482741 Fat12 Meat12 +
## 7 5.1268178 1 2.355888e-02 3.1268178 Fat12 Fat13 -
## 8 1.5850908 1 2.080289e-01 -0.4149092 Fat12 Meat13 +
## 9 0.6102019 1 4.347118e-01 -1.3897981 Fat12 LeanMeat +
## 10 3.5761623 1 5.861442e-02 1.5761623 Meat12 LeanMeat -
Hence there are four edges which lead to a decrease in AIC. If we set
headlong=T
then the function exist as soon as one decrease in
AIC is found:
z <-testInEdges(cm1, edgeList=in.ed, headlong=T)
z
## statistic df p.value aic V1 V2 action
## 1 3.5761623 1 0.058614415 1.576162 Meat12 LeanMeat -
## 2 7.1618067 1 0.007447215 5.161807 Fat11 LeanMeat -
## 3 0.6102019 1 0.434711798 -1.389798 Fat12 LeanMeat +
evalOutEdges()
out.ed <- getOutEdges(cm1)
z <- testOutEdges(cm1, edgeList=out.ed)
Hence there are four edges which lead to a decrease in AIC. If we set
headlong=T
then the function exist as soon as one decrease in
AIC is found:
z <- testOutEdges(cm1, edgeList=out.ed, headlong=T)
z
## statistic df p.value aic V1 V2 action
## 1 61.63593 1 4.107825e-15 59.63593 Meat12 Meat13 +
It is worth looking at the information in the model object:
dm3 <- dmod(list(c("smoke", "systol"), c("smoke", "mental", "phys")),
data=reinis)
names(dm3)
## [1] "modelinfo" "varNames" "datainfo" "fitinfo" "isFitted" "call"
\begin{itemize}
str(terms(dm3))
## List of 2
## $ : chr [1:2] "smoke" "systol"
## $ : chr [1:3] "smoke" "mental" "phys"
str(dm3$glistNUM)
## NULL
The numeric representation of the generators refers back to
dm3$varNames
## [1] "smoke" "systol" "mental" "phys"
Notice the model object does not contain a graph object. Graph objects are generated on the fly when needed.
str(dm3[c("varNames","conNames","conLevels")])
## List of 3
## $ varNames: chr [1:4] "smoke" "systol" "mental" "phys"
## $ NA : NULL
## $ NA : NULL
isFitted
is a logical for whether the model is fitted;
data} is the data (as a table) and \code{fitinfo
consists of
fitted values, logL, df etc.
\end{itemize}A summary()
of a model:
summary(dm1) ## FIXME
## Length Class Mode
## modelinfo 4 -none- list
## varNames 6 -none- character
## datainfo 1 -none- list
## fitinfo 10 -none- list
## isFitted 1 -none- logical
## call 3 -none- call
str(fitted(dm1))
## 'table' num [1:2, 1:2, 1:2, 1:2, 1:2, 1:2] 0 0 0 0 3 0 1 0 0 1 ...
## - attr(*, "dimnames")=List of 6
## ..$ la10: chr [1:2] "1" "2"
## ..$ locc: chr [1:2] "1" "2"
## ..$ mp58: chr [1:2] "1" "2"
## ..$ c365: chr [1:2] "1" "2"
## ..$ p53a: chr [1:2] "1" "2"
## ..$ a367: chr [1:2] "1" "2"
str(dm1$data)
## List of 1
## $ data: 'table' num [1:2, 1:2, 1:2, 1:2, 1:2, 1:2] 0 0 0 0 3 0 1 0 0 1 ...
## ..- attr(*, "dimnames")=List of 6
## .. ..$ la10: chr [1:2] "1" "2"
## .. ..$ locc: chr [1:2] "1" "2"
## .. ..$ mp58: chr [1:2] "1" "2"
## .. ..$ c365: chr [1:2] "1" "2"
## .. ..$ p53a: chr [1:2] "1" "2"
## .. ..$ a367: chr [1:2] "1" "2"
Hence we can make a simple diagnostic plot of Pearson residuals as FIXME
\begin{figure}[h] \centering \includegraphics[]{figures/GRIM-pearson-1} \caption{A marginal model for a slice of the \reinis\ data.} {#fig:pearson-1} \end{figure}