>> Noel O'Boyle

Supervised Classification

Here I discuss measuring variable importance in CART, how to carry out leave-one-out cross validation for linear discriminant analysis, and summarise the usage of a number of different classification methods in R.

Variable importance in CART (Classification & Regression Trees)

The library 'rpart' implements the CART algorithm of Breiman et al., as described in their excellent 1984 book. This library should be used instead of 'tree' (for the reasons why, search the R-help mailing list). I have written a function to calculate variable importance.

Variable importance in the CART book is defined for each variable as the sum of the decrease in impurity for each of the surrogate variables at each node. I believe it also includes the decrease in impurity for the best split at that node (although this is not explicitly stated in the book). According to Therneau in a posting to the S users mailing list: "For variable importance, Breiman et al credit each variable with improvement if it is a primary split, [and] improvement*agreement, if it is not a primary split." I don't know whether this is in some other published work of Breiman, or what the story is, but this is what I've implemented in the following function, importance.R.

Note: it's probably a good idea to include as many surrogates as possible as shown in the example below. Probable bug: only the best scoring surrogate for each variable should be considered - I don't do this at present.

An example of its usage is:

> data(iris)
> mydata <- cbind(iris[5],iris[1:4]) # The first column needs to be the response
> rp <- rpart(Species ~ ., mydata,control=c(maxsurrogate=100))
> source("importance.R")
> a <- importance(rp)
> summary(a)
 Looking at nodeID: 1
  (1) Need to find subset
    Choices made to get here:...
    Size of current subset: 150
  (2) Look at importance of chosen split
    The best delta_i is: 50 for Petal.Length and 2.45
  (3) Look at importance of surrogate splits
    The best delta_i is: 50 for Petal.Width and 0.8
    The best delta_i is: 34.1640502354788 for Sepal.Length and 5.45
    The best delta_i is: 19.0385075340828 for Sepal.Width and 3.35
 Looking at nodeID: 3
  (1) Need to find subset
    Choices made to get here:...
     Came from nodeID: 1
       Applying command subset(allData, Petal.Length >= 2.45 )
    Size of current subset: 100
  (2) Look at importance of chosen split
    The best delta_i is: 38.9694041867955 for Petal.Width and 1.75
  (3) Look at importance of surrogate splits
    The best delta_i is: 37.3535353535353 for Petal.Length and 4.75
    The best delta_i is: 10.6868686868687 for Sepal.Length and 6.15
    The best delta_i is: 3.41414141414142 for Sepal.Width and 2.95
> print(a)
Sepal.Length  Sepal.Width Petal.Length  Petal.Width
    39.23234     18.15290     83.99172     88.96940

(That is, Petal.Width is most important; followed closely by Petal.Length.)

Cross-validation of linear-discriminant analysis models

Bare background

Linear discriminant analysis uses a linear combination of the variables to map the samples to an n-dimensional space, in such a way that the ratio of between-group variance to within-group variance is maximised.

library(MASS)
mylda <- lda(Species ~ ., iris)
mylda.pred <- predict(mylda, iris)
table(mylda.pred$class,iris$Species)
#             setosa versicolor virginica
#  setosa     50      0          0
#  versicolor  0     48          1
#  virginica   0      2         49

# If you have more than 1 linear discriminant, you can plot the results
# in the tranformed space as follows (samples *and* variables)
plot(mylda)
a <- mylda$scaling/2 # 2 is a scaling factor for the lines
for (i in 1:length(iris[,1])) {lines(x=c(0,a[i,1]), y=c(0,a[i,2]), lty=3, col="green" )}
text(a,dimnames(a)[[1]],cex=0.5,col="blue")

# If you have a single linear discriminant, try:
plot(mylda)
# and
plot(mylda,type="density")

Cross validation

In the above example, the 'apparent error rate' is 3/150. This is the misclassification rate for predicting the class of the training set. A more realistic measure for the performance of the training set in predicting the class of a test set, is the leave-one-out crossvalidation (LOOCV) score. For each member of the training set, you build a model based on the other n-1 members, and try to predict the class of the remaining member. The overall misclassification rate is the LOOCV score. Some classification methods in R will calculate this for you. For example, for LDA:

mylda <- lda(Species ~ ., iris, CV=TRUE)
table(mylda$class, iris$Species)
#               setosa versicolor virginica
#    setosa     50      0          0
#    versicolor  0     48          1
#    virginica   0      2         49

Unfortunately, if the variance of any column is 0, LDA doesn't work. That's not so bad, you might say, you can just remove any column with 0 variance before you create the model. True, but the same holds for LOOCV. If any column is one value away from zero variance (for example, you have all 2's, except for one 5), then not only will LOOCV not work, it will not tell you it hasn't worked, and it will produce a nonsense result (in one case, for me, it classified everything as category 1 - I don't know whether this is true in general).

To find out the name of any column with zero variance, use:

names(data)[sapply(data,var)==0]

To find out whether any column is one row away from zero variance, use:

for (i in 1:length(data[,1])) {print(names(data)[sapply(data[-i,],var)==0])}

So what do you do if you have this problem and you want to do LOOCV? Well, you just need to do it by hand, and remove any columns with zero variance as you leave out each row. Here's the code, preceded by an example of its use (note that the built-in LOOCV is much faster than doing it by hand):

source("ldacv.R") # loads my function from an appropriately-named file
mycv <- ldacv(Species ~ ., iris)
table(mycv, iris$Species)
# mycv         setosa versicolor virginica
#   setosa     50      0          0
#   versicolor  0     48          1
#   virginica   0      2         49

ldacv <- function(myformula, data) {
    response <- all.vars(myformula)[1]
    ans <- numeric(length(data[,1]))
    for (i in 1:length(data[,1])) {
        train <- data[-i,]
        test <- data[i,]
        train.var <- sapply(train,var) # Applies var to every row
        train <- train[,train.var!=0] # Removes columns with zero variance
        test <- test[,train.var!=0]
        mylda <- lda(myformula, train)
        ans[i] <- predict(mylda, test)$class
        }
    factor(ans,labels=levels(data[,response]))
    }

Other methods available in R

There are a number of methods of supervised classification using R, and which by-and-large have the same interface. This is highlighted by the examples below. Of course, each method also has additional options which are not shown here.

This list is not exhaustive. There are a couple of methods I have not included (because I didn't see them immediately): bagging methods, kernal discrimination analysis (ks), projection pursuit, k-Nearest Neighbours (knntree, knncat), recursive partitioning (rpart), mixture and flexible discrimination analysis (in package mda), and methods suited to binary classification (logistic regression, boosting methods (boost) and binary QSAR).

These examples use the famous (Fisher's or Anderson's) iris data set, which gives the measurements in centimeters of the variables sepal length and width and petal length and width, respectively, for 50 flowers from each of 3 species of iris. The species are Iris setosa, versicolor, and virginica. This dataset may be loaded using:

library(MASS)
data(iris)

Decision trees

library(rpart)
mytree <- rpart(Species ~ ., iris)
plot(mytree); text(mytree) # in "iris_tree.ps"
summary(mytree)
mytree.pred <- predict(mytree,iris,type="class")
table(mytree.pred,iris$Species)
# mytree.pred  setosa versicolor virginica
#   setosa     50      0          0
#   versicolor  0     47          1
#   virginica   0      3         49

# Leave-1-Out Cross Validation (LOOCV)
ans <- numeric(length(iris[,1]))
for (i in 1:length(iris[,1])) {
  mytree <- rpart(Species ~ ., iris[-i,])
  mytree.pred <- predict(mytree,iris[i,],type="class")
  ans[i] <- mytree.pred
  }
ans <- factor(ans,labels=levels(iris$Species)))
table(ans,iris$Species)
# The same as above in this case

Quadratic Discriminant Analysis

library(MASS)
myqda <- qda(Species ~ ., iris)
myqda.pred <- predict(myqda, iris)
table(myqda.pred$class,iris$Species)
#              setosa versicolor virginica
#   setosa     50      0          0
#   versicolor  0     48          1
#   virginica   0      2         49

Regularised Discriminant Analysis

library(klaR)
myrda <- rda(Species ~ ., iris)
myrda.pred <- predict(myrda, iris)
table(myrda.pred$class,iris$Species)
#              setosa versicolor virginica
#   setosa     50      0          0
#   versicolor  0     48          0
#   virginica   0      2         50

Support Vector Machines

library(e1071)
mysvm <- svm(Species ~ ., iris)
mysvm.pred <- predict(mysvm, iris)
table(mysvm.pred,iris$Species)
# mysvm.pred   setosa versicolor virginica
#   setosa     50      0          0
#   versicolor  0     48          2
#   virginica   0      2         48

Random Forests

library(randomForest)
myrf <- randomForest(Species ~ .,iris)
myrf.pred <- predict(myrf, iris)
table(myrf.pred, iris$Species)
# myrf.pred    setosa versicolor virginica
#   setosa     50      0          0
#   versicolor  0     50          0
#   virginica   0      0         50
# (Suspiciously correct! - need to read the manual)

Neural Nets

library(nnet)
mynnet <- nnet(Species ~ ., iris, size=1)
mynnet.pred <- predict(mynnet,iris,type="class")
table(mynnet.pred,iris$Species)
# mynnet.pred  setosa versicolor virginica
#   setosa     50      0          0
#   versicolor  0     49          1
#   virginica   0      1         49

Naive Bayes

library(klaR)
mynb <- NaiveBayes(Species ~ ., iris)
mynb.pred <- predict(mynb,iris)
table(mynb.pred$class,iris$Species)
#              setosa versicolor virginica
#   setosa     50      0          0
#   versicolor  0     47          3
#   virginica   0      3         47