Wednesday, March 20, 2013

Veterinary Epidemiologic Research: GLM – Evaluating Logistic Regression Models (part 3)

from: http://denishaine.wordpress.com/2013/03/19/veterinary-epidemiologic-research-glm-evaluating-logistic-regression-models-part-3/

Third part on logistic regression (first here, second here).
Two steps in assessing the fit of the model: first is to determine if the model fits using summary measures of goodness of fit or by assessing the predictive ability of the model; second is to deterime if there’s any observations that do not fit the model or that have an influence on the model.
Covariate pattern
A covariate pattern is a unique combination of values of predictor variables.

mod3 <- glm(casecont ~ dcpct + dneo + dclox + dneo*dclox,
+ family = binomial("logit"), data = nocardia)
summary(mod3)
Call:
glm(formula = casecont ~ dcpct + dneo + dclox + dneo * dclox,
family = binomial("logit"), data = nocardia)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.9191 -0.7682 0.1874 0.5876 2.6755
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -3.776896 0.993251 -3.803 0.000143 ***
dcpct 0.022618 0.007723 2.928 0.003406 **
dneoYes 3.184002 0.837199 3.803 0.000143 ***
dcloxYes 0.445705 1.026026 0.434 0.663999
dneoYes:dcloxYes -2.551997 1.205075 -2.118 0.034200 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 149.72 on 107 degrees of freedom
Residual deviance: 103.42 on 103 degrees of freedom
AIC: 113.42
Number of Fisher Scoring iterations: 5
library(epiR)
Package epiR 0.9-45 is loaded
Type help(epi.about) for summary information
mod3.mf <- model.frame(mod3)
(mod3.cp <- epi.cp(mod3.mf[-1]))
$cov.pattern
id n dcpct dneo dclox
1 1 7 0 No No
2 2 38 100 Yes No
3 3 1 25 No No
4 4 1 1 No No
5 5 11 100 No Yes
8 6 1 25 Yes Yes
10 7 1 14 Yes No
12 8 4 75 Yes No
13 9 1 90 Yes Yes
14 10 1 30 No No
15 11 3 5 Yes No
17 12 9 100 Yes Yes
22 13 2 20 Yes No
23 14 8 100 No No
25 15 2 50 Yes Yes
26 16 1 7 No No
27 17 4 50 Yes No
28 18 1 50 No No
31 19 1 30 Yes No
34 20 1 99 No No
35 21 1 99 Yes Yes
40 22 1 80 Yes Yes
48 23 1 3 Yes No
59 24 1 1 Yes No
77 25 1 10 No No
84 26 1 83 No Yes
85 27 1 95 Yes No
88 28 1 99 Yes No
89 29 1 25 Yes No
105 30 1 40 Yes No
$id
[1] 1 2 3 4 5 1 1 6 5 7 5 8 9 10 11 11 12 1 12 1 5 13 14 2 15
[26] 16 17 18 1 2 19 2 14 20 21 12 14 5 8 22 14 5 5 5 1 14 14 23 2 12
[51] 14 12 11 5 15 2 8 2 24 2 2 8 2 17 2 2 2 2 12 12 12 2 2 2 5
[76] 2 25 2 17 2 2 2 2 26 27 13 17 28 29 14 2 12 2 2 2 2 2 2 2 2
[101] 2 2 2 2 30 2 2 5
 
There are 30 covariate patterns in the dataset. The pattern dcpct=100, dneo=Yes, dclox=No appears 38 times.
 
Pearson and deviance residuals
Residuals represent the difference between the data and the model. The Pearson residuals are comparable to standardized residuals used for linear regression models. Deviance residuals represent the contribution of each observation to the overall deviance.
 
residuals(mod3) # deviance residuals
residuals(mod3, "pearson") # pearson residuals
 
Goodness-of-fit test
All goodness-of-fit tests are based on the premise that the data will be divided into subsets and within each subset the predicted number of outcomes will be computed and compared to the observed number of outcomes. The Pearson \chi^2 and the deviance \chi^2 are based on dividing the data up into the natural covariate patterns. The Hosmer-Lemeshow test is based on a more arbitrary division of the data.
The Pearson \chi^2 is similar to the residual sum of squares used in linear models. It will be close in size to the deviance, but the model is fit to minimize the deviance and not the Pearson \chi^2 . It is thus possible even if unlikely that the \chi^2 could increase as a predictor is added to the model.
 
sum(residuals(mod3, type = "pearson")^2)
[1] 123.9656
deviance(mod3)
[1] 103.4168
1 - pchisq(deviance(mod3), df.residual(mod3))
[1] 0.4699251
 
The p-value is large indicating no evidence of lack of fit. However, when using the deviance statistic to assess the goodness-of-fit for a nonsaturated logistic model, the \chi^2 approximation for the likelihood ratio test is questionable. When the covariate pattern is almost as large as N, the deviance cannot be assumed to have a \chi^2 distribution.
Now the Hosmer-Lemeshow test, usually dividing by 10 the data:
 
hosmerlem <- function (y, yhat, g = 10) {
+ cutyhat <- cut(yhat, breaks = quantile(yhat, probs = seq(0, 1, 1/g)),
+ include.lowest = TRUE)
+ obs <- xtabs(cbind(1 - y, y) ~ cutyhat)
+ expect <- xtabs(cbind(1 - yhat, yhat) ~ cutyhat)
+ chisq <- sum((obs - expect)^2 / expect)
+ P <- 1 - pchisq(chisq, g - 2)
+ c("X^2" = chisq, Df = g - 2, "P(>Chi)" = P)
+ }
hosmerlem(y = nocardia$casecont, yhat = fitted(mod3))
Erreur dans cut.default(yhat, breaks = quantile(yhat, probs = seq(0, 1, 1/g)), (depuis #2) :
'breaks' are not unique

The model used has many ties in its predicted probabilities (too few covariate values?) resulting in an error when running the Hosmer-Lemeshow test. Using fewer cut-points (g = 5 or 7) does not solve the problem. This is a typical example when not to use this test. A better goodness-of-fit test than Hosmer-Lemeshow and Pearson / deviance \chi^2 tests is the le Cessie – van Houwelingen – Copas – Hosmer unweighted sum of squares test for global goodness of fit (also here) implemented in the rms package (but you have to implement your model with the lrm function of this package):

mod3b <- lrm(casecont ~ dcpct + dneo + dclox + dneo*dclox, nocardia,
+ method = "lrm.fit", model = TRUE, x = TRUE, y = TRUE,
+ linear.predictors = TRUE, se.fit = FALSE)
residuals(mod3b, type = "gof")
Sum of squared errors Expected value|H0 SD
16.4288056            16.8235055        0.2775694
Z            P
-1.4219860   0.1550303
 
The p-value is 0.16 so there’s no evidence the model is incorrect. Even better than these tests would be to check for linearity of the predictors.
Overdispersion
Sometimes we can get a deviance that is much larger than expected if the model was correct. It can be due to the presence of outliers, sparse data or clustering of data. The approach to deal with overdispersion is to add a dispersion parameter \sigma^2 . It can be estimated with: \hat{\sigma}^2 = \frac{\chi^2}{n - p} (p = probability of success). A half-normal plot of the residuals can help checking for outliers:
 
library(faraway)
halfnorm(residuals(mod1))
Half-normal plot of the residuals
Half-normal plot of the residuals
The dispesion parameter of model 1 can be found as:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
(sigma2 <- sum(residuals(mod1, type = "pearson")^2) / 104)
[1] 1.128778
drop1(mod1, scale = sigma2, test = "F")
Single term deletions
Model:
casecont ~ dcpct + dneo + dclox
scale: 1.128778
Df Deviance AIC F value Pr(>F)
<none> 107.99 115.99
dcpct 1 119.34 124.05 10.9350 0.001296 **
dneo 1 125.86 129.82 17.2166 6.834e-05 ***
dclox 1 114.73 119.96 6.4931 0.012291 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Message d'avis :
In drop1.glm(mod1, scale = sigma2, test = "F") :
le test F implique une famille 'quasibinomial'
The dispersion parameter is not very different than one (no dispersion). If dispersion was present, you could use it in the F-tests for the predictors, adding scale to drop1.
Predictive ability of the model
A ROC curve can be drawn:
1
2
3
4
5
6
7
8
9
10
11
12
predicted <- predict(mod3)
library(ROCR)
prob <- prediction(predicted, nocardia$casecont,
+ label.ordering = c('Control', 'Case'))
tprfpr <- performance(prob, "tpr", "fpr")
tpr <- unlist(slot(tprfpr, "y.values"))
fpr <- unlist(slot(tprfpr, "x.values"))
roc <- data.frame(tpr, fpr)
ggplot(roc) + geom_line(aes(x = fpr, y = tpr)) +
+ geom_abline(intercept = 0, slope = 1, colour = "gray") +
+ ylab("Sensitivity") +
+ xlab("1 - Specificity")
ROC curve
ROC curve
Identifying important observations
Like for linear regression, large positive or negative standardized residuals allow to identify points which are not well fit by the model. A plot of Pearson residuals as a function of the logit for model 1 is drawn here, with bubbles relative to size of the covariate pattern. The plot should be an horizontal band with observations between -3 and +3. Covariate patterns 25 and 26 are problematic.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
nocardia$casecont.num <- as.numeric(nocardia$casecont) - 1
mod1 <- glm(casecont.num ~ dcpct + dneo + dclox, family = binomial("logit"),
+ data = nocardia) # "logit" can be omitted as it is the default
mod1.mf <- model.frame(mod1)
mod1.cp <- epi.cp(mod1.mf[-1])
nocardia.cp <- as.data.frame(cbind(cpid = mod1.cp$id,
+ nocardia[ , c(1, 9:11, 13)],
+ fit = fitted(mod1)))
### Residuals and delta betas based on covariate pattern:
mod1.obs <- as.vector(by(as.numeric(nocardia.cp$casecont.num),
+ as.factor(nocardia.cp$cpid), FUN = sum))
mod1.fit <- as.vector(by(nocardia.cp$fit, as.factor(nocardia.cp$cpid),
+ FUN = min))
mod1.res <- epi.cpresids(obs = mod1.obs, fit = mod1.fit,
+ covpattern = mod1.cp)
mod1.lodds <- as.vector(by(predict(mod1), as.factor(nocardia.cp$cpid),
+ FUN = min))
plot(mod1.lodds, mod1.res$spearson,
+ type = "n", ylab = "Pearson Residuals", xlab = "Logit")
text(mod1.lodds, mod1.res$spearson, labels = mod1.res$cpid, cex = 0.8)
symbols(mod1.lodds, mod1.res$pearson, circles = mod1.res$n, add = TRUE)
Bubble plot of standardized residuals
Bubble plot of standardized residuals
The hat matrix is used to calculate leverage values and other diagnostic parameters. Leverage measures the potential impact of an observation. Points with high leverage have a potential impact. Covariate patterns 2, 14, 12 and 5 have the largest leverage values.
1
2
3
4
5
6
7
8
9
mod1.res[sort.list(mod1.res$leverage, decreasing = TRUE), ]
cpid leverage
2 0.74708052
14 0.54693851
12 0.54017700
5 0.42682684
11 0.21749664
1 0.19129427
...
Delta-betas provides an overall estimate of the effect of the j^{th} covariate pattern on the regression coefficients. It is analogous to Cook’s distance in linear regression. Covariate pattern 2 has the largest delta-beta (and represents 38 observations).
1
2
3
4
5
mod1.res[sort.list(mod1.res$sdeltabeta, decreasing = TRUE), ]
cpid sdeltabeta
2 7.890878470
14 3.983840529



Thursday, March 14, 2013

Random Forest Variable Importance

from: http://alandgraf.blogspot.hk/2012/07/random-forest-variable-importance.html

Random forests ™ are great. They are one of the best "black-box" supervised learning methods. If you have lots of data and lots of predictor variables, you can do worse than random forests. They can deal with messy, real data. If there are lots of extraneous predictors, it has no problem. It automatically does a good job of finding interactions as well. There are no assumptions that the response has a linear (or even smooth) relationship with the predictors.

As cited in the Wikipedia article, they do lack some interpretability. But what they lack in interpretation, they more than make up for in prediction power, which I believe is much more important than interpretation in most cases. Even though you cannot easily tell how one variable affects the prediction, you can easily create a partial dependence plot which shows how the response will change as you change the predictor. You can also do this for two variables at once to see the interaction of the two.

Also helping in the interpretation is that they can output a list of predictor variables that they believe to be important in predicting the outcome. If nothing else, you can subset the data to only include the most "important" variables, and use that with another model. The randomForest package in R has two measures of importance. One is "total decrease in node impurities from splitting on the variable, averaged over all trees." I do not know much about this one, and will not talk about it further. The other is based on a permutation test. The idea is that if the variable is not important (the null hypothesis), then rearranging the values of that variable will not degrade prediction accuracy. Random forests use out-of-bag (OOB) samples to measure prediction accuracy.

In my experience, it does a pretty good job of finding the most important predictors, but it has issues with correlated predictors. For example, I was working on a problem where I was predicting the price that electricity trades. One feature that I knew would be very important was the amount of electricity being used at that same time. But I thought there might also be a relationship between price and the electricity being used a few hours before and after. When I ran the random forest with these variables, the electricity used 1 hour after was found to be more important than the electricity used at the same time. When including the 1 hour after electricity use instead of the current hour electricity use, the cross validation (CV) error increased. Using both did not significantly change the CV error compared to using just current hour. Because the electricity used at the current hour and the hour after are very correlated, it had trouble telling which one was more important. In truth, given the electricity use at the current hour, the electricity use at the hour after did not improve the predictive accuracy.

Why does the importance measure give more weight correlated predictors? Strobl et al. give some intuition behind what is happening and propose a solution. Basically, the permutation test is ill-posed. The permutation test is testing that the variable is independent of the response as well as all other predictors. Since the correlated predictors are obviously not independent, we get high importance scores. They propose a permutation test where you condition on the correlated predictors. This is a little tricky when the correlated predictors are continuous, but you can read the paper for more details.

Another way to think of it is that, since each split only considers a subset of the possible variables, a variable that is correlated with an "important" variable may be considered without the "important" variable. This would cause the correlated variable to be selected for the split. The correlated value does hold some predictive value, but only because of the truly important variable, so it is understandable why this procedure would consider it important.

I ran a simulation experiment (similar to the one in the paper) to demonstrate the issue. I simulated 5 predictor variables. Only the first one is related to the response, but the second one has a correlation of about 0.7 with the first one. Luckily, Strobl et al. included their version of importance in the package party inR. I compare variable importance from the randomForest package and the importance with and without taking correlated predictors into account from the party package.


# simulate the data
x1=rnorm(1000)
x2=rnorm(1000,x1,1)
y=2*x1+rnorm(1000,0,.5)
df=data.frame(y,x1,x2,x3=rnorm(1000),x4=rnorm(1000),x5=rnorm(1000))

# run the randomForest implementation
library(randomForest)
rf1 <- randomForest(y~., data=df, mtry=2, ntree=50, importance=TRUE)
importance(rf1,type=1)

# run the party implementation
library(party)
cf1 <- cforest(y~.,data=df,control=cforest_unbiased(mtry=2,ntree=50))
varimp(cf1)
varimp(cf1,conditional=TRUE)


For the randomForest, the ratio of importance of the the first and second variable is 4.53. For party without accounting for correlation it is 7.35. And accounting for correlation, it is 369.5. The higher ratios are better because it means that the importance of the first variable is more prominent. party's implementation is clearly doing the job.

There is a downside. It takes much longer to calculate importance with correlated predictors than without. For the party package in this example, it took 0.39 seconds to run without and 204.34 seconds with. I could not even run the correlated importance on the electricity price example. There might be a research opportunity to get a quicker approximation.

Possibly up next: confidence limits for random forest predictions.

Thursday, March 7, 2013

Exporting plain, lattice, or ggplot graphics

This article was first published on G-Forge » R


A blend between a basic scatterplot, lattice scatterplot and a ggplot
A BLEND BETWEEN A BASIC SCATTERPLOT, LATTICE SCATTERPLOT AND A GGPLOT
In a recent post I compared the Cairo packages with the base package for exporting graphs. Matt Neilson was kind enough to share in a comment that the Cairo library is now by default included in R, although you need to specify the type=”cairo” option to invoke it. In this post I examine how the ggplot and the lattice packages behave when exporting.

basic plot

The nice thing about the export functions in the R package is that point size scales automatically with the resolution. You don’t need to multiply the “pointsize” parameter any more, you can just go with a higher resolution and everything will look just as nice. This is very convenient as you can adjust both the text and the plotted points by just changing one export parameter. Below is an example of a plot using pointsize 12.
?View Code RSPLUS
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
# Create the data
data <- rnorm(100, sd=15)+1:100
# Create a simple scatterplot
# with long labels to enhance 
# size comparison
my_sc_plot <- function(data){
  par(cex.lab=1.5, cex.main=2)
  plot(data, 
       main="A simple scatterplot", 
       xlab="A random variable plotted", 
       ylab="Some rnorm value",
       col="steelblue")
  x <- 1:100
  abline(lm(data~x), lwd=2)
}
png(filename="plain_PNG_pt12.png", 
    type="cairo",
    units="in", 
    width=5, 
    height=4, 
    pointsize=12, 
    res=96)
my_sc_plot(data)
dev.off()
A basic plot with the export point size set to 12
A BASIC PLOT WITH THE EXPORT POINT SIZE SET TO 12
Below is the same with slightly increased point size:
?View Code RSPLUS
1
2
3
4
5
6
7
8
9
png(filename="plain_PNG_pt14.png", 
    type="cairo",
    units="in", 
    width=5, 
    height=4, 
    pointsize=14, 
    res=96)
my_sc_plot(data)
dev.off()
Same image as previous but with the 2 points bigger point size (14pt).
SAME IMAGE AS PREVIOUS BUT WITH THE 2 POINTS BIGGER POINT SIZE (14PT).

ggplot2

This behavior is different for both the lattice and the ggplot2 packages. In both these packages the text is unchanged by the point size, although it might be good to know that the ggplot’s margins change with the point size:
?View Code RSPLUS
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
library(ggplot2)
my_gg_plot <- function(data){
  ggplot(data.frame(y=data, x=1:100), aes(y=y, x)) +
    geom_point(aes(x, y), colour="steelblue", size=3) +
    stat_smooth(method="lm", se=FALSE, color="black", lwd=1) + 
    ggtitle("A simple scatterplot") + theme_bw() +
    theme(title = element_text(size=20),
          axis.title.x = element_text(size=18),
          axis.title.y = element_text(size=18),
          axis.text.x  = element_text(size=14),
          axis.text.y  = element_text(size=14))+
    xlab("A random variable plotted") +
    ylab("Some rnorm value")
}
 
png(filename="GGplot_PNG_pt12.png", 
    type="cairo",
    units="in", 
    width=5, 
    height=4, 
    pointsize=12, 
    res=96)
my_gg_plot(data)
dev.off()
A ggplot with the point size set to 12
A GGPLOT WITH THE POINT SIZE SET TO 12
Now notice the marigns if we compare with a plot with a the point size x 3:
?View Code RSPLUS
1
2
3
4
5
6
7
8
9
png(filename="GGplot_PNG_pt36.png", 
    type="cairo",
    units="in", 
    width=5, 
    height=4, 
    pointsize=12*3, 
    res=96)
my_gg_plot(data)
dev.off()
A ggplot with the point sizQe set to 36
A GGPLOT WITH THE POINT SIZQE SET TO 36
My conclusion is that there seems to be an impact on the margins but not on the text.

lattice

The lattice package in turn seems to be unaffected by any change to the point size:
?View Code RSPLUS
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
library(lattice)
my_lt_plot <- function(data){
  xyplot(y~x, cex=1, 
         panel = function(x, y) {
           panel.xyplot(x, y)
           panel.abline(lm(y ~ x), lwd=2)
         },
         lwd=2,
         data=data.frame(y=data, x=1:100), 
         main=list(label="A simple scatterplot", cex=2), 
         xlab=list(label="A random variable plotted", cex=1.8), 
         ylab=list(label="Some rnorm value", cex=1.8),
         scales=list(cex=2))
}
 
png(filename="Lattice_PNG_pt12.png", 
    type="cairo",
    units="in", 
    width=5, 
    height=4, 
    pointsize=12, 
    res=96)
my_lt_plot(data)
dev.off()
A lattice plot with the point size set to 12
A LATTICE PLOT WITH THE POINT SIZE SET TO 12

And to compare with 3 times the point size:
?View Code RSPLUS
1
2
3
4
5
6
7
8
9
png(filename="Lattice_PNG_pt36.png", 
    type="cairo",
    units="in", 
    width=5, 
    height=4, 
    pointsize=12*3, 
    res=96)
my_lt_plot(data)
dev.off()
A lattice plot with the point size set to 36
A LATTICE PLOT WITH THE POINT SIZE SET TO 36
There seems to be no real difference.

panels

As a last exercise I checked what happens to the plots when using panels/side by side plots (hoping that I now covered the topic). Let’s start with the basic plot:
?View Code RSPLUS
1
2
3
4
5
6
7
8
9
10
11
12
13
png(filename="Plain_PNG_2x2.png", 
    type="cairo",
    units="in", 
    width=5*2, 
    height=4*2, 
    pointsize=12, 
    res=96)
par(mfrow=c(2,2))
my_sc_plot(data)
my_sc_plot(data)
my_sc_plot(data)
my_sc_plot(data)
dev.off()
Plain plot, there is a minor difference and the text seems a little smaller in the 2x2 plot
PLAIN PLOT, THERE IS A MINOR DIFFERENCE AND THE TEXT SEEMS A LITTLE SMALLER IN THE 2×2 PLOT
Both the ggplot and the lattice performed better:
?View Code RSPLUS
1
2
3
4
5
6
7
8
9
10
11
12
13
14
png(filename="GGplot_PNG_2x2.png", 
    type="cairo",
    units="in", 
    width=5*2, 
    height=4*2, 
    pointsize=12, 
    res=96)
p1 <- my_gg_plot(data)
pushViewport(viewport(layout = grid.layout(2, 2)))
print(p1, vp = viewport(layout.pos.row = 1, layout.pos.col = 1))
print(p1, vp = viewport(layout.pos.row = 1, layout.pos.col = 2))
print(p1, vp = viewport(layout.pos.row = 2, layout.pos.col = 1))
print(p1, vp = viewport(layout.pos.row = 2, layout.pos.col = 2))
dev.off()
The ggplot as a 2x2 plot
THE GGPLOT AS A 2×2 PLOT
?View Code RSPLUS
1
2
3
4
5
6
7
8
9
10
11
12
13
png(filename="Lattice_PNG_2x2.png", 
    type="cairo",
    units="in", 
    width=5*2, 
    height=4*2, 
    pointsize=12, 
    res=96)
p1 <- my_lt_plot(data)
print(p1, position=c(0, .5, .5, 1), more=TRUE)
print(p1, position=c(.5, .5, 1, 1), more=TRUE)
print(p1, position=c(0, 0, .5, .5), more=TRUE)
print(p1, position=c(.5, 0, 1, .5))
dev.off()
The lattice plot as a 2x2 plot
THE LATTICE PLOT AS A 2×2 PLOT

summary

In summary both ggplot and lattice seem to ignore the pointsize argument although the ggplot’s margins are affected. The panel plots work as expected, although regular plots may need a little tweaking.