>> Noel O'Boyle

Regression

Given the values for a number of predictor variables, how well can we predict the value of a response variable?

Multilinear regression

Assuming that the response variable has a linear dependancy on the predictor variables, we can use multilinear regression. The residuals are the differences between the predicted values and the actual values. The quality of a model can be assessed by the sum of the residuals squared (RSS).

> library(randomForest)
> (myLM <- lm(Ozone ~ ., data=airquality))

Call:
lm(formula = Ozone ~ ., data = airquality)

Coefficients:
(Intercept)      Solar.R         Wind         Temp        Month          Day
  -64.11632      0.05027     -3.31844      1.89579     -3.03996      0.27388

> sum(residuals(myLM)**2)
[1] 45682.93

Forward selection/backward elimination

Could we leave out some of the predictor variables and still get a good fit? Clearly, the sum of the residuals will increase or remain the same if we do this, but it may be worth the cost, as fewer variables implies less chance of overfitting. We can use a stepwise forward selection or backward elimination to investigate this effect:

> step(mlm,direction=c("backward"))
Start:  AIC= 680.21
 Ozone ~ Solar.R + Wind + Temp + Month + Day

          Df Sum of Sq   RSS   AIC
- Day      1       619 46302   680
<none>                 45683   680
- Month    1      1755 47438   682
- Solar.R  1      2005 47688   683
- Wind     1     11534 57217   703
- Temp     1     20845 66528   720

Step:  AIC= 679.71
 Ozone ~ Solar.R + Wind + Temp + Month

          Df Sum of Sq   RSS   AIC
<none>                 46302   680
- Month    1      1701 48003   682
- Solar.R  1      1953 48254   682
- Wind     1     11520 57822   702
- Temp     1     20420 66721   718

Call:
lm(formula = Ozone ~ Solar.R + Wind + Temp + Month, data = airquality)

Coefficients:
(Intercept)      Solar.R         Wind         Temp        Month
   -58.0538       0.0496      -3.3165       1.8709      -2.9916

(To understand this fully, you should look up AIC and the 'step' function). The first table shows the effect of removal of a single variable from the model (note that <none> has the same value for RSS as we computed earlier). Removal of the 'Day' variable has the least effect on the RSS (that is, the smallest different in the sum of squares, 619). The second table involves dropping a second variable from the four remaining. However, there is no reduction in AIC for any of these, and so the final selected model has four components.

The best way to assess the quality of a model and to perform the step function is to use a test and training set. Otherwise, the RSS is overoptimistic as a measure of predictive accuracy.

Subset/feature selection using Ant Colony Optimization

The step function, described above, does not sample every possible combination of descriptors. In situations with a large number of descriptors, it may not perform very well (I think!). What is required are methods that rapidly sample large area of 'feature space' such as simulated annealing, genetic and evolutionary algorithms, or the new kid on the block, ant colony optimization. Here I discuss a modified ant colony optimization algorithm by Shen, Jiang, Tao, Shen and Yu (J. Chem. Inf. Model., 2005, 45, 1024), which sounds quite interesting.

I have implemented this algorithm in R, and will post it here as soon as it is in a decent state.