32
where
by definition of pseudo data,
(
) (
)
and the diagonal
weight matrix,
, has elements
(
)
(
)
(Wood, 2006).
The following procedure is then iterated until convergence:
1.
Using the current
and
obtain
the pseudo data
and the iterative weights
√
.
2.
Minimize the sum of squares
‖√
‖
with respect to
in
order to obtain
̂
, and hence
̂
and
.
3.
Set
to
and repeat until
̂
converges.
It is common practice to use as initial values
and
(
)
or
a small
adjustment to
if
.
3.3.3 Diagnostics
Model diagnostics can be divided into two types: checking (1) for outliers and influential
observations and (2) the assumptions of the model.
Residual plots are very useful plots to check the adequacy of the model. For Generalized
Linear Models (GLMs) the Pearson and deviance residuals (Faraway, 2006) usually
provide good plots to look at because they are comparable to the standardized residuals
used for the linear models. In our case, however, the outcome variable is binary which
means that the plots have limited use.
However, one can consider influential observations and outliers.
Multi-collinearity
amongst the independent variables can also be considered.
According to Faraway (2006), for the linear model,
̂
, where
is the hat matrix that
projects the observed
data onto the fitted values, the diagonal elements of
are the
leverages
and represent the potential of the point to influence the fit of the model. For
GLMs (and thus logistic regression) leverages are different. The IRWLS algorithm used to
33
fit the GLM makes use of weights,
. These weights affect the leverage. With
and
matrix
( )
, the hat matrix is
(
)
.
The diagonals of
are
the leverages
. A large leverage value
indicates that the fit
may be sensitive to the response at case
. Leverage measures the potential to affect the fit
of the model.
Measures of influence assess the effect of each case on the fit of the model (Faraway,
2006). Influential points can be examined by looking at the Cook’s distance statistic:
(
̂
( )
̂)
(
)(
̂
( )
̂)
̂
where the dispersion parameter
is equal to 1 when the distribution is binomial (Equation
3.11). The way these leverage and Cook’s distance statistics are checked is by considering
their half-normal plots. Faraway (2006) explains that for a GLM,
we do not expect the
residuals to be normally distributed and, therefore, it is better to use half-normal plots to
identify outliers. Here sorted values are compared to values of the quantiles of the half-
normal distribution:
(
)
We then look for outliers which may be identified as points off the trend.
If some predictors are linear combinations of others, then
is singular. When this
happens there are serious problems with the estimation of the parameters. Collinearity
amongst the predictor variables can be detected in various ways:
1.
Looking at the correlation matrix of the predictors may reveal large pairwise correlations.
2.
Looking at the variance inflation factors.
The variance inflation factors are calculated as follows: when an independent variable
,
is regressed against all the other independent variables and
the multiple coefficient of
determination is
, the quantity
(
)
is called the variance inflation factor for the
parameter
(Mendenhall and Sincich, 2003). These variance inflation factors are
34
calculated for each numerical independent variable. Mendenhall and Sincich (2003) state
that any value greater than 10 would mean that there is a collinearity problem.
Do'stlaringiz bilan baham: