YOU&ME: January 2013

Tuesday, January 22, 2013

Armadillo eigenvalues

source: http://gallery.rcpp.org/articles/armadillo-eigenvalues/

Today a (slightly confused) question on StackOverflow wondered how to access R’s facilities for eigenvalues calculations from C code.

For this, we need to step back and consider how this is done. In fact, R farms the calculation out to the BLAS. On could possibly access R’s functions—but would then have to wrestle with the data input/output issues which make Rcpp shine in comparison. Also, Rcpp gets us access to Armadillo (via the RcppArmadillo) package and Armadillo’s main focus are exactly the linear algebra calculations and decompositions.

And with facilities that were added to Rcpp in the 0.10.* release series, this effectively becomes a one-liner of code! (Nitpickers will note that there are also one include statement, two attributes declarations and the function name itself.)

#include <RcppArmadillo.h>

// [[Rcpp::depends(RcppArmadillo)]]

// [[Rcpp::export]]
arma::vec getEigenValues(arma::mat M) {
    return arma::eig_sym(M);
}

We can illustrate this easily via a random sample matrix.

set.seed(42)
X <- matrix(rnorm(4*4), 4, 4)
Z <- X %*% t(X)

getEigenValues(Z)

        [,1]
[1,]  0.3319
[2,]  1.6856
[3,]  2.4099
[4,] 14.2100

In comparison, R gets the same results (in reverse order) and also returns the eigenvectors.

eigen(Z)

$values
[1] 14.2100  2.4099  1.6856  0.3319

$vectors
         [,1]     [,2]    [,3]     [,4]
[1,]  0.69988 -0.55799  0.4458 -0.00627
[2,] -0.06833 -0.08433  0.0157  0.99397
[3,]  0.44100 -0.15334 -0.8838  0.03127
[4,]  0.55769  0.81118  0.1413  0.10493

Armadillo has other eigenvector computations too, see its documentation.

Using Eigen for eigenvalues

source: http://gallery.rcpp.org/articles/eigen-eigenvalues/

A previous post showed how to compute eigenvalues using the Armadillo library via RcppArmadillo.

Here, we do the same using Eigen and the RcppEigen package.

#include <RcppEigen.h>

// [[Rcpp::depends(RcppEigen)]]

using Eigen::Map;                // 'maps' rather than copies 
using Eigen::MatrixXd;                  // variable size matrix, double precision
using Eigen::VectorXd;                  // variable size vector, double precision
using Eigen::SelfAdjointEigenSolver;    // one of the eigenvalue solvers

// [[Rcpp::export]]
VectorXd getEigenValues(Map<MatrixXd> M) {
    SelfAdjointEigenSolver<MatrixXd> es(M);
    return es.eigenvalues();
}

We can illustrate this easily via a random sample matrix.

set.seed(42)
X <- matrix(rnorm(4*4), 4, 4)
Z <- X %*% t(X)

getEigenValues(Z)

[1]  0.3319  1.6856  2.4099 14.2100

In comparison, R gets the same results (in reverse order) and also returns the eigenvectors.

eigen(Z)

$values
[1] 14.2100  2.4099  1.6856  0.3319

$vectors
         [,1]     [,2]    [,3]     [,4]
[1,]  0.69988 -0.55799  0.4458 -0.00627
[2,] -0.06833 -0.08433  0.0157  0.99397
[3,]  0.44100 -0.15334 -0.8838  0.03127
[4,]  0.55769  0.81118  0.1413  0.10493

Eigen has other a lot of other decompositions, see its documentation for more details.

Using the GSL to compute eigenvalues

source: http://www.r-bloggers.com/using-the-gsl-to-compute-eigenvalues/?utm_source=feedburner&utm_medium=email&utm_campaign=Feed%3A+RBloggers+%28R+bloggers%29

Two posts showed how to compute eigenvalues using Armadillo andusing Eigen. As we also looked at using the
GNU GSL, this post will show how to conpute eigenvalues using GSL.

As mentioned in the previous GSL post, we instantiate C language pointers suitable for GSL (here the matrix M). Those must be freed manually, as shown before the return statement.

// [[Rcpp::depends(RcppGSL)]]

#include <RcppGSL.h>
#include <gsl/gsl_matrix.h>
#include <gsl/gsl_eigen.h>

// [[Rcpp::export]]
Rcpp::NumericVector getEigenValues(Rcpp::NumericMatrix sM) {

    RcppGSL::matrix<double> M(sM);  // create gsl data structures from SEXP
    int k = M.ncol();
    Rcpp::NumericVector N(k);   // to store results 

    gsl_vector *eigval = gsl_vector_alloc(k);
    gsl_eigen_symm_workspace *w = gsl_eigen_symm_alloc(k);
    gsl_eigen_symm (M, eigval, w);
    gsl_eigen_symm_free (w);

    for (int j = 0; j < k; j++) {
        N[j] = gsl_vector_get(eigval, j);
    }
    M.free();                          // important: GSL wrappers use C structure
    return N;    // return vector  
}

We can illustrate this easily via a random sample matrix.

set.seed(42)
X <- matrix(rnorm(4*4), 4, 4)
Z <- X %*% t(X)

getEigenValues(Z)

[1] 14.2100  2.4099  1.6856  0.3319

In comparison, R gets the same results (in reverse order) and also returns the eigenvectors.

eigen(Z)

$values
[1] 14.2100  2.4099  1.6856  0.3319

$vectors
         [,1]     [,2]    [,3]     [,4]
[1,]  0.69988 -0.55799  0.4458 -0.00627
[2,] -0.06833 -0.08433  0.0157  0.99397
[3,]  0.44100 -0.15334 -0.8838  0.03127
[4,]  0.55769  0.81118  0.1413  0.10493

Safely Loading Packages in R

source: http://www.r-bloggers.com/safely-loading-packages-in-r/?utm_source=feedburner&utm_medium=email&utm_campaign=Feed%3A+RBloggers+%28R+bloggers%29

Using R snippets written by other developers can be unendingly maddening. There are a variety of reasons for this, most of which boil down to a simple issue: most code is written such that a system must be configured in precisely the same way as the code’s author’s machine. Anyone who’s ever seen a line like this:
read.xls("C:/Users/MCaine/code/R/projecteuler/someotherdirectory/data.xls")
knows what I am talking about. To use this without modification, you must:

Use Windows.
Have exactly the directory structure specified by the address (which is highly unlikely, unless you were the one who wrote it).
Have the gdata package installed and included in the project (which is both unlikely, and difficult to know without already being a regular user of the package).

You can see how it would already be easier to just change the address to whatever works on your machine. For this problem, I’m afraid I have no easy solution. My preferred approach is to provide URLs, so that the directory structure doesn’t depend on the user’s machine, but this obviously provides its own host of problems. However, I can take aim at this third issue.

Package management in R is a silly thing, both because it is so easy and it is so easy to screw up. Most people who write R write it like analysts: they write exactly enough code the get the desired output on their machine, and leave it at that. When such R Code is made public, it tends to be very difficult to use it in actual replication. But there are some simple ways around that.

getPackage <- function(pkg){
  if(!require(pkg, character.only=TRUE)){
    install.packages(pkg)
    library(pkg, character.only=TRUE)
  }
  return(TRUE)
}

Consider this gist. Instead of calling library(package), which fails if the library is not installed on the user’s machine, getPackage(package) invokes that safer require function, which doesn’t fail when the local machine doesn’t have the requisite package. Instead, it returns false, which triggers the R script to download the package from the user’s default CRAN mirror and then bring it into the user’s working session. If anything goes wrong, then it throws an error, but it won’t do that for anything silly like the user not having a package that is a mere two commands away.

One note: because of the esoteric manner in which R treats package names, you must pass this function a string and not a package name. If you’re not on top of your type-know-how, this means that getPackage(plyr) will fail. You should instead writegetPackage("plyr").

Now, I’m sure there’s a very good reason hidden deep in R’s core that this is a bad way to do things, but it has saved me time and headaches. R is recognized as a difficult language both within the programming world (for its strange inconsistencies and its non-grown-up hacker culture), and outside of it (because programming is hard). I wish that more R functions were written in the defensive way to decrease the cognitive barriers to using R for the latter group.

Friday, January 4, 2013

Intro to Multiple Classification with Random Forests, Conditional Inference Trees, and Linear Discriminant Analysis

soure: http://www.r-bloggers.com/my-intro-to-multiple-classification-with-random-forests-conditional-inference-trees-and-linear-discriminant-analysis/?utm_source=feedburner&utm_medium=email&utm_campaign=Feed%3A+RBloggers+%28R+bloggers%29

After the work I did for my last post, I wanted to practice doing multiple classification. I first thought of using the famous iris dataset, but felt that was a little boring. Ideally, I wanted to look for a practice dataset where I could successfully classify data using both categorical and numeric predictors. Unfortunately it was tough for me to find such a dataset that was easy enough for me to understand.
The dataset I use in this post comes from a textbook called Analyzing Categorical Data by Jeffrey S Simonoff, and lends itself to basically the same kind of analysis done by blogger “Wingfeet” in his post predicting authorship of Wheel of Time books. In this case, the dataset contains counts of stop words (function words in English, such as “as”, “also, “even”, etc.) in chapters, or scenes, from books or plays written by Jane Austen, Jack London (I’m not sure if “London” in the dataset might actually refer to another author), John Milton, and William Shakespeare. Being a textbook example, you just know there’s something worth analyzing in it!! The following table describes the numerical breakdown of books and chapters from each author:

	# Books	# Chapters/Scenes
Austen	5	317
London	6	296
Milton	2	55
Shakespeare	12	173

Overall, the dataset contains 841 rows and 71 columns. 69 of those columns are the counted stop words (wow!), 1 is for what’s called the “BookID”, and the last is for the Author. I hope that the word counts are the number of times each word shows up per 100 words, or something that makes the counts comparable between authors/books.
The first thing I did after loading up the data into R was to create a training and testing set:

> authorship$randu = runif(841, 0,1)
> authorship.train = authorship[authorship$randu < .4,]
> authorship.test = authorship[authorship$randu >= .4,]

Then I set out to try to predict authorship in the testing data set using a Random Forests model, a Conditional Inference Tree model, and a Linear Discriminant Analysis model.
Random Forests Model
Here’s the code I used to train the Random Forests model (after finding out that the word “one” seemed to not be too important for the classification):

authorship.model.rf = randomForest(Author ~ a + all + also + an + any + are + as + at + be + been + but + by + can + do + down + even + every + for. + from + had + has + have + her + his + if. + in. + into + is + it + its + may + more + must + my + no + not + now + of + on + one + only + or + our + should + so + some + such + than + that + the + their + then + there + things + this + to + up + upon + was + were + what + when + which + who + will + with + would + your, data=authorship.train, ntree=5000, mtry=15, importance=TRUE)

It seemed to me that the mtry argument shouldn’t be too low, as we are trying to discriminate between authors! Following is a graph showing the Mean Decrease in Accuracy for each of the words in the Random Forests Model:

As you can see, some of the most important words for classification in this model were “was”, “be”, “to”, “the”, “her” and “had. At this point, I can’t help but think of the ever famous “to be, or not to be” line, and can’t help but wonder if these are the sort of words that would show up more often in Shakespeare texts. I don’t have the original texts to re-analyze, so I can only rely on what I have in this dataset to try to answer that question. Doing a simple average of how often the word “be” shows up per chapter per author, I see that it shows up an average of 12.9 times per scene in Shakespeare, and an average of 20.2 times per chapter in Austen! Shakespeare doesn’t win that game!!
Anyway, the Random Forests model does pretty well in classifying authors in the test set, as you can see in the counts and proportions tables below:

> authorship.test$pred.author.rf = predict(authorship.model.rf, authorship.test, type="response")
> table(authorship.test$Author, authorship.test$pred.author.rf)
> prop.table(table(authorship.test$Author, authorship.test$pred.author.rf),1)

              Austen London Milton Shakespeare
  Austen         182      4      0           0
  London           1    179      0           1
  Milton           0      1     33           2
  Shakespeare      1      2      0         102

                   Austen      London      Milton Shakespeare
  Austen      0.978494624 0.021505376 0.000000000 0.000000000
  London      0.005524862 0.988950276 0.000000000 0.005524862
  Milton      0.000000000 0.027777778 0.916666667 0.055555556
  Shakespeare 0.009523810 0.019047619 0.000000000 0.971428571

If you look on the diagonal, you can see that the model performs pretty well across authors. It seems to do the worst with Milton (although still pretty darned well), but I think that should be expected due to the low number of books and chapters from him.
Conditional Inference Tree
Here’s the code I used to train the Conditional Inference Tree model:

> authorship.model.ctree = ctree(Author ~ a + all + also + an + any + are + as + at + be + been + but + by + can + do + down + even + every + for. + from + had + has + have +  her + his + if. + in. + into + is + it + its + may + more + must + my + no + not + now + of + on + one + only + or + our + should + so + some + such + than + that + the + their + then + there + things + this + to + up + upon + was + were + what + when + which + who + will + with + would + your, data=authorship.train)

Following is the plot that shows the significant splits done by the Conditional Inference Tree:

As is painfully obvious at first glance, there are so many end nodes that seeing the different end nodes in this graph is out of the question. Still useful however are the ovals indicating what words formed the significant splits. Similar to the Random Forests model, we see “be” and “was” showing up as the most significant words in discriminating between authors. Other words it splits on don’t seem to be as high up on the list in the Random Forests model, but the end goal is prediction, right?
Here is how the Conditional Inference Tree model did in predicting authorship in the test set:

> authorship.test$pred.author.ctree = predict(authorship.model.ctree, authorship.test, type="response")
> table(authorship.test$Author, authorship.test$pred.author.ctree)
> prop.table(table(authorship.test$Author, authorship.test$pred.author.ctree),1)

              Austen London Milton Shakespeare
  Austen         173      8      1           4
  London          18    148      3          12
  Milton           0      6     27           3
  Shakespeare      6     10      5          84

                   Austen      London      Milton Shakespeare
  Austen      0.930107527 0.043010753 0.005376344 0.021505376
  London      0.099447514 0.817679558 0.016574586 0.066298343
  Milton      0.000000000 0.166666667 0.750000000 0.083333333
  Shakespeare 0.057142857 0.095238095 0.047619048 0.800000000

Overall it looks like the Conditional Inference Tree model is doing a worse job predicting authorship compared with the Random Forests model (again, looking at the diagonal). Again we see the Milton records popping up as having the lowest hit rate for classification, but I think it’s interesting/sad that only 80% of Shakespeare records were correctly classified. Sorry old man, it looks like this model thinks you’re a little bit like Jack London, and somewhat like Jane Austen and John Milton.
Linear Discriminant Analysis
Finally we come to the real star of this particular show. Here’s the code I used to train the model:

> authorship.model.lda = lda(Author ~ a + all + also + an + any + are + as + at + be + been + but + by + can + do + down + even + every + for. + from + had + has + have +  her + his + if. + in. + into + is + it + its + may + more + must + my + no + not + now + of + on + one + only + or + our + should + so + some + such + than + that + the + their + then + there + things + this + to + up + upon + was + were + what + when + which + who + will + with + would + your, data=authorship.train)

There’s a lot of summary information that the lda function spits out by default, so I won’t post it here, but I thought the matrix scatterplot of the records plotted along the 3 linear discriminants looked pretty great, so here it is:

From this plot you can see that putting all the words in the linear discriminant model seems to have led to pretty good discrimination between authors. However, it’s in the prediction that you see this model shine:

> authorship.test$pred.author.lda = predict(authorship.model.lda, authorship.test, type="response")
> authorship.test$pred.author.lda = predict(authorship.model.lda, authorship.test)$class
> table(authorship.test$Author, authorship.test$pred.author.lda)
> prop.table(table(authorship.test$Author, authorship.test$pred.author.lda),1)

              Austen London Milton Shakespeare
  Austen         185      1      0           0
  London           1    180      0           0
  Milton           0      0     36           0
  Shakespeare      0      0      0         105

                   Austen      London      Milton Shakespeare
  Austen      0.994623656 0.005376344 0.000000000 0.000000000
  London      0.005524862 0.994475138 0.000000000 0.000000000
  Milton      0.000000000 0.000000000 1.000000000 0.000000000
  Shakespeare 0.000000000 0.000000000 0.000000000 1.000000000

As you can see, it performed flawlessly with Milton and Shakespeare, and almost flawlessly with Austen and London.
Looking at an explanation of LDA from dtreg I’m thinking that LDA is performing best here because the discriminant functions created are more sensitive to the boundaries between authors (defined here by stop word counts) than the binary splits made on the predictor variables in the decision tree models. Does this hold for all cases of classification where the predictor variables are numeric, or does it break down if the normality of the predictors is grossly violated? Feel free to give me a more experienced/learned opinion in the comments!

How to Add an Extra Vertical Axis to R Plots

SOURCE: http://www.r-bloggers.com/how-to-add-an-extra-vertical-axis-to-r-plots/?utm_source=feedburner&utm_medium=email&utm_campaign=Feed%3A+RBloggers+%28R+bloggers%29

Especially when analyzing time series, we often need plots with two vertical axes. Researchers often expect the two series to \"move together,\" but with different locations and scales. To show that the series move together, you should give each series its own scale. One vertical scale should appear on the left side of the plot and the other on the right. In this post, I\'ll show you how to add the extra vertical axis on the right side of the plot.

Generating Sample Data

Let\'s start by quickly generating two fake time series. This can be done with the code below.

 set.seed(3571)
n <- 100
a1 <- rnorm(n)
a2 <- rnorm(n)

x <- NULL
y <- NULL

x[1] <- a1[1]
y[1] <- a2[1]

for (t in 2:n) {
x[t] <- -.2*x[t-1] + .5*y[t-1] + a1[t]
y[t] <- y[t-1] + a2[t]
}

The Problem

Now let\'s plot the data in the \"standard way\": plot one series using the high-level plot() function and then add the other series using the low-level lines() function.

plot(ts(x))
lines(ts(y), col = \"red\")

This code gives the plot below.

You can see one big problem--the second series goes off the scale! (We could fix this particular problem by plotting the series in the opposite order, but that does not work in general.)
A second problem is that the scales of the series are different. Both series are heading to infinity, but the x series is going there more slowly. The x series also has less variability than the y series. Both of these differences suggest we plot the series on different vertical scales. Indeed, our hypothesis usually suggests that the series should \"move together,\" not that particular values of x should correspond to the same values of y.

The Solution

You can easily add a separate vertical scale using the par(new=TRUE) command between two calls to the high-level plot() function. The code below does just this.

plot(ts(x))
par(new = T)
plot(ts(y), col = \"red\")

This code produces the figure below.

We can see from this plot that the two series indeed move together. Also notice two details of this plot. First, the vertical axes (one is plotted over the other on the left side) have different scales. Second, the horizontal axes are on the same scale (you can tell because R is plotting the exact same axis twice, making the axis fuzzy).
Now we just need to clean up the plot a little, moving one axis to the right side of the figure, plotting the horizontal axis only once, and cleaning up the labeling a little. Note that R defaults for axes and axis notation are not the best, but I deal with that in a separate post. For this exercise, I use the R defaults when they make sense, even if they are not the most aesthetically pleasing.
I make the following changes.

Adjust the mar option in the graphical parameters to give me a little more room for constructing the vertical axis on the right side.
Add the xlab=\"x\" argument to the first plot() function call. This improves the label of the vertical axis on the left.
Add the axes = F, xlab = NA, and ylab = NA arguments to the second plot() function call. This removes all the axes and axis notation from this call.
Add the vertical axis on the right with a simple call to the axis() function and setting side = 4.
Add a label to the vertical axis on the right with the mtext() command. Use the options side = 4 and line = 3 to position the text \"y\" correctly.

The new code is given below.

par(mar = c(5,5,2,5))
plot(ts(x), ylab = \"x\")
par(new = T)
plot(ts(y), col = \"red\", axes = F, xlab = NA, ylab = NA)
axis(side = 4)
mtext(side = 4, line = 3, \"y\")

This code produces the following plot, which could be cleaned up a lot, but has the basic feature we\'re looking for: a separate vertical axis for each series.

Summary

In this post, I\'ve shown how to add a separate vertical axis for each series in a plot. This can be accomplished by setting the parameter new=\"TRUE\" after plotting the first series. I\'ve focused on solving the main problem at the expense of nicer axes and axis notation. I\'ve written more about how to improve on R\'s axis defaults here and here.

I encourage you to share this with others and contribute to the conversation at How to Add an Extra Vertical Axis to R Plots, which first appeared at carlislerainey.com.For more of my thoughts and ideas, subscribe to my blog (via RSS or Email) and follow me on Twitter. You also might like to browse my archive and read my papers on Strategic Mobilization and Testing Hypotheses of No Meaningful Effect.

Clustering with selected Principal Components

SOURCE: http://www.r-bloggers.com/clustering-with-selected-principal-components/?utm_source=feedburner&utm_medium=email&utm_campaign=Feed%3A+RBloggers+%28R+bloggers%29

In the Visualizing Principal Components post, I looked at the Principal Components of the companies in the Dow Jones Industrial Average index over 2012. Today, I want to show how we can use Principal Components to create Clusters (i.e. form groups of similar companies based on their distance from each other)

Let’s start by loading the historical prices for the the companies in the Dow Jones Industrial Average index that we saved in the Visualizing Principal Components post.

###############################################################################

# Load Systematic Investor Toolbox (SIT)

# http://systematicinvestor.wordpress.com/systematic-investor-toolbox/

###############################################################################

setInternet2(TRUE)

con = gzcon(url('http://www.systematicportfolio.com/sit.gz', 'rb'))

    source(con)

close(con)

    #*****************************************************************

    # Load historical data

    #****************************************************************** 

    load.packages('quantmod')   

    # load data saved in the bt.pca.test() function

    load(file='bt.pca.test.Rdata')

    #*****************************************************************

    # Principal component analysis (PCA), for interesting discussion

    # http://machine-master.blogspot.ca/2012/08/pca-or-polluting-your-clever-analysis.html

    #****************************************************************** 

    prices = data$prices    

    ret = prices / mlag(prices) - 1

    p = princomp(na.omit(ret))

    loadings = p$loadings[]

    x = loadings[,1]

    y = loadings[,2]

    z = loadings[,3]

To create Clusters, I will use the hierarchical cluster analysis, hclust function, in stats package. The first argument in the hclust function is the distance (dissimilarity) matrix. To compute distance matrix, let’s take the first 2 principal components and compute the Euclidean distance between each company:

#*****************************************************************

# Create clusters

#******************************************************************         

# create and plot clusters based on the first and second principal components

hc = hclust(dist(cbind(x,y)), method = 'ward')

plot(hc, axes=F,xlab='', ylab='',sub ='', main='Comp 1/2')

rect.hclust(hc, k=3, border='red')

Similarly we can use the first three principal components:

# create and plot clusters based on the first, second, and third principal components

hc = hclust(dist(cbind(x,y,z)), method = 'ward')

plot(hc, axes=F,xlab='', ylab='',sub ='', main='Comp 1/2/3')

rect.hclust(hc, k=3, border='red')

Another option is to use the Correlation matrix as a proxy for a distance matrix:

# create and plot clusters based on the correlation among companies

hc = hclust(as.dist(1-cor(na.omit(ret))), method = 'ward')

plot(hc, axes=F,xlab='', ylab='',sub ='', main='Correlation')

rect.hclust(hc, k=3, border='red')

Please note that Clusters will be quite different, depending on the distance matrix you use.
To view the complete source code for this example, please have a look at the bt.clustering.test() function in bt.test.r at github.

Binary Classification – A Comparison of “Titanic” Proportions Between Logistic Regression, Random Forests, and Conditional Trees

SOURCE: http://www.r-bloggers.com/binary-classification-a-comparison-of-titanic-proportions-between-logistic-regression-random-forests-and-conditional-trees/?utm_source=feedburner&utm_medium=email&utm_campaign=Feed%3A+RBloggers+%28R+bloggers%29

Now that I’m on my winter break, I’ve been taking a little bit of time to read up on some modelling techniques that I’ve never used before. Two such techniques are Random Forests and Conditional Trees. Since both can be used for classification, I decided to see how they compare against a simple binomial logistic regression (something I’ve worked with a lot) for binary classification.

The dataset I used contains records of the survival of Titanic Passengers and such information as sex, age, fare each person paid, number of parents/children aboard, number of siblings or spouses aboard, passenger class and other fields (The titanic dataset can be retrieved from a page on Vanderbilt’s website replete with lots of datasets; look for “titanic3″).

I took one part of the dataset to train my models, and another part to test them. The factors that I focused on were passenger class, sex, age, and number of siblings/spouses aboard.

First, let’s look at the GLM:

titanic.survival.train = glm(survived ~ pclass + sex + pclass:sex + age + sibsp,
family = binomial(logit), data = titanic.train)

As you can see, I worked in an interaction effect between passenger class and sex, as passenger class showed a much bigger difference in survival rate amongst the women compared to the men (i.e. Higher class women were much more likely to survive than lower class women, whereas first class Men were more likely to survive than 2nd or 3rd class men, but not by the same margin as amongst the women). Following is the model summary output, if you’re interested:

> summary(titantic.survival.train)

Call:
glm(formula = survived ~ pclass + sex + pclass:sex + age + sibsp, 
    family = binomial(logit), data = titanic.train)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.9342  -0.6851  -0.5481   0.5633   2.3164  

Coefficients:
                Estimate Std. Error z value Pr(>|z|)    
(Intercept)     6.556807   0.878331   7.465 8.33e-14 ***
pclass         -1.928538   0.278324  -6.929 4.24e-12 ***
sexmale        -4.905710   0.785142  -6.248 4.15e-10 ***
age            -0.036462   0.009627  -3.787 0.000152 ***
sibsp          -0.264190   0.106684  -2.476 0.013272 *  
pclass:sexmale  1.124111   0.299638   3.752 0.000176 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 858.22  on 649  degrees of freedom
Residual deviance: 618.29  on 644  degrees of freedom
AIC: 630.29

Number of Fisher Scoring iterations: 5

So, after I used my model to predict survival probabilities on the testing portion of the dataset, I checked to see how many records showed a probability of over .5 (or 50%), and then how many of those records were actual survivors. For the GLM, 146/164 (89%) of those records scored at 50% or higher were actual survivors. Not bad!

Now let’s move on to Random Forests:

> titanic.survival.train.rf

Call:
 randomForest(formula = as.factor(survived) ~ pclass + sex + age +      sibsp, data = titanic.train, ntree = 5000, importance = TRUE) 
               Type of random forest: classification
                     Number of trees: 5000
No. of variables tried at each split: 2

        OOB estimate of  error rate: 22.62%
Confusion matrix:
    0   1 class.error
0 370  38  0.09313725
1 109 133  0.45041322

> importance(titanic.survival.train.rf)
               0          1 MeanDecreaseAccuracy MeanDecreaseGini
pclass  67.26795 125.166721            126.40379         34.69266
sex    160.52060 221.803515            224.89038         62.82490
age     70.35831  50.568619             92.67281         53.41834
sibsp   60.84056   3.343251             52.82503         14.01936

It seems to me that the output indicates that the Random Forests model is better at creating true negatives than true positives, with regards to survival of the passengers, but when I asked for the predicted survival categories in the testing portion of my dataset, it appeared to do a pretty decent job predicting who would survive and who wouldn’t:

For the Random Forests model, 155/184 (84%) of those records predicted to survive actually did survive! Again, not bad.

Finally, lets move on to the Conditional Tree model:

> titanic.survival.train.ctree

  Conditional inference tree with 7 terminal nodes

Response:  as.factor(survived) 
Inputs:  pclass, sex, age, sibsp 
Number of observations:  650 

1) sex == {female}; criterion = 1, statistic = 141.436
  2) pclass <= 2; criterion = 1, statistic = 55.831
    3) pclass <= 1; criterion = 0.988, statistic = 8.817       
     4)*  weights = 74      3) pclass > 1
      5)*  weights = 47 
  2) pclass > 2
    6)*  weights = 105 
1) sex == {male}
  7) pclass <= 1; criterion = 1, statistic = 15.095     
   8)*  weights = 88    
  7) pclass > 1
    9) age <= 12; criterion = 1, statistic = 14.851
      10) sibsp <= 1; criterion = 0.998, statistic = 12.062               11)*  weights = 18        
      10) sibsp > 1
       12)*  weights = 12 
    9) age > 12
      13)*  weights = 306

I really happen to like the graph output by plotting the conditional tree model. I find it pretty easy to understand. As you can see, the model started the split of the data according to sex, which it found to be most significant, then pclass, age, and then siblings/spouses. That being said, let’s look at how the prediction went for comparison purposes:

134/142 (94%) of records predicted to survive actually did survive! Great!! Now let’s bring all those prediction stats into one table for a final look:

                     glm randomForests ctree
Actually Survived    146           155   134
Predicted to Survive 164           184   142
% Survived           89%           84%   94%

So, in terms of finding the highest raw number of folks who actually did survive, the Random Forests model won, but in terms of having the highest true positive rate, the Conditional Tree model takes the cake! Neat exercise!

Notes:

(1) There were numerous records without an age value. I estimated ages using a regression model taken from those who did have ages. Following are the coefficients for that model:
44.59238 + (-5.98582*pclass) + (.15971*fare) + (-.14141*pclass:fare)

(2) Although I used GLM scores of over .5 to classify records as survived, I could have played with that to get different results. I really just wanted to have some fun, and not try all possibilities here:)

(3) I hope you enjoyed this post! If you have anything to teach me about Random Forests or Conditional Trees, please leave a comment and be nice (I’m trying to expand my analysis skills here!!).

Basics of Histograms

SOURCE: http://www.r-bloggers.com/basics-of-histograms/?utm_source=feedburner&utm_medium=email&utm_campaign=Feed%3A+RBloggers+%28R+bloggers%29

Histograms are used very often in public health to show the distributions of your independent and dependent variables. Although the basic command for histograms (hist()) in R is simple, getting your histogram to look exactly like you want takes getting to know a few options of the plot. Here I present ways to customize your histogram for your needs.

First, I want to point out that ggplot2 is a package in R that does some amazing graphics, including histograms. I will do a post on ggplot2 in the coming year. However, the hist() function in base R is really easy and fast, and does the job for most of your histogram-ing needs. However, if you want to do complicated histograms, I would recommend reading up on ggplot2.

Okay so for our purposes today, instead of importing data, I'll create some normally distributed data myself. In R, you can generate normal data this way using the rnorm() function:

BMI<-rnorm(n=1000, m=24.2, sd=2.2)

So now we have some BMI data, and the basic histogram plot that comes out of R looks like this:

hist(BMI)

Which is actually pretty nice. There are a number of things that R does by default in creating this histogram, and I think it's useful to print out that information to understand the parameters of this histogram. You can do this by saving the histogram as an object and then printing it like this:

histinfo<-hist(BMI)
histinfo

And you get the output below:

This is helpful because you can see how R has decided to break up your data by default. It shows the breaks, which are the cutoff points for the bins. It shows the counts, intensity/density for each bin (same thing but two different names for R version compatibility), the midpoints of each bin, and then the name of the variable, whether the bins are equidistant, and the class of the object. You can of course take any one of these outputs by itself, i.e. histinfo$counts would give you just the vector of counts.

Now we can manipulate this information to customize our plot.

1. Number of bins

R chooses how to bin your data for you by default using an algorithm, but if you want coarser or finer groups, there are a number of ways to do this. We can see that right now from the output above that the breaks go from 17 to 32 by 1. You can use the breaks() option to change this in a number of ways. An easy way is just to give it one number that gives the number of cells for the histogram:

hist(BMI, breaks=20, main="Breaks=20")
hist(BMI, breaks=5, main="Breaks=5")

The bins don't correspond to exactly the number you put in, because of the way R runs its algorithm to break up the data but it gives you generally what you want. If you want more control over exactly the breakpoints between bins, you can be more precise with the breaks() option and give it a vector of breakpoints, like this:

hist(BMI, breaks=c(17,20,23,26,29,32), main="Breaks is vector of breakpoints")

This dictates exactly the start and end point of each bin. Of course, you could give the breaks vector as a sequence like this to cut down on the messiness of the code:

hist(BMI, breaks=seq(17,32,by=3), main="Breaks is vector of breakpoints")

Note that when giving breakpoints, the default for R is that the histogram cells are right-closed (left open) intervals of the form (a,b]. You can change this with the right=FALSE option, which would change the intervals to be of the form [a,b). This is important if you have a lot of points exactly at the breakpoint.

2. Frequency vs Density

Often, we are more interested in density than frequency, since frequency is relative to your sample size. Instead of counting the number of datapoints per bin, R can give the probability densities using the freq=FALSE option:

hist(BMI, freq=FALSE, main="Density plot")

Notice the y-axis now. If the breaks are equidistant, with difference between breaks=1, then the height of each rectangle is proportional to the number of points falling into the cell, and thus the sum of the probability densities adds up to 1. Here I specify plot=FALSE because I just want the histogram output, not the plot, and show how the sum of all of the densities is 1:

However, if you choose to make bins that are not all separated by 1 (like breaks=c(17,25,26, 32) or something like that), then the plot still has an area of 1, but the area of the rectangles is the fraction of data points falling into the cells. The densities are calculated like this as counts/(n*diff(breaks). Thus, this adds up to 1 if add up the areas of the rectangles, i.e. you multiply each density by the difference in the breaks like this:

3. Plot aesthetics

Finally, we can make the histogram better looking by adjusting the x-axis, y-axis, axis labels, title, and color like this:

hist(BMI, freq=FALSE, xlab="Body Mass Index", main="Distribution of Body Mass Index", col="lightgreen", xlim=c(15,35), ylim=c(0, .20))

Here along with our frequency option, I changed the x-axis label, changed the main title, made the color light green, and provided limits for both the x-axis and y-axis. Note that defining the look of you axis using xlim() will not have any impact on the bins - this option is only for the aesthetics of the plot itself.

Finally, I can add a nice normal distribution curve to this plot using the curve() function, in which I specify a normal density function with mean and standard deviation that is equal to the mean and standard deviation of my data, and I add this to my previous plot with a dark blue color and a line width of 2. You can play around with these options to get the kind of line you want:

curve(dnorm(x, mean=mean(BMI), sd=sd(BMI)), add=TRUE, col="darkblue", lwd=2)

And we get this!

100 most read R posts in 2012

source: http://www.r-bloggers.com/100-most-read-r-posts-for-2012-stats-from-r-bloggers-big-data-visualization-data-manipulation-and-other-languages/?utm_source=feedburner&utm_medium=email&utm_campaign=Feed%3A+RBloggers+%28R+bloggers%29

Top 100 R posts of 2012
R-bloggers’ success is thanks to the content submitted by the over 400 R bloggers who have joined r-bloggers. The R community currently has around 245 active R bloggers (links to the blogs are clearly visible in the right navigation bar on the R-bloggers homepage). In the past year, these bloggers wrote around 3200 posts about R!
Here is a list of the top visited posts on the site in 2012 (you can see the number of page views in parentheses):