Given the values for a number of predictor variables, how well can we predict the value of a response variable?
Assuming that the response variable has a linear dependancy on the predictor variables, we can use multilinear regression. The residuals are the differences between the predicted values and the actual values. The quality of a model can be assessed by the sum of the residuals squared (RSS).
> library(randomForest) > (myLM <- lm(Ozone ~ ., data=airquality)) Call: lm(formula = Ozone ~ ., data = airquality) Coefficients: (Intercept) Solar.R Wind Temp Month Day -64.11632 0.05027 -3.31844 1.89579 -3.03996 0.27388 > sum(residuals(myLM)**2) [1] 45682.93
Could we leave out some of the predictor variables and still get a good fit? Clearly, the sum of the residuals will increase or remain the same if we do this, but it may be worth the cost, as fewer variables implies less chance of overfitting. We can use a stepwise forward selection or backward elimination to investigate this effect:
> step(mlm,direction=c("backward")) Start: AIC= 680.21 Ozone ~ Solar.R + Wind + Temp + Month + Day Df Sum of Sq RSS AIC - Day 1 619 46302 680 <none> 45683 680 - Month 1 1755 47438 682 - Solar.R 1 2005 47688 683 - Wind 1 11534 57217 703 - Temp 1 20845 66528 720 Step: AIC= 679.71 Ozone ~ Solar.R + Wind + Temp + Month Df Sum of Sq RSS AIC <none> 46302 680 - Month 1 1701 48003 682 - Solar.R 1 1953 48254 682 - Wind 1 11520 57822 702 - Temp 1 20420 66721 718 Call: lm(formula = Ozone ~ Solar.R + Wind + Temp + Month, data = airquality) Coefficients: (Intercept) Solar.R Wind Temp Month -58.0538 0.0496 -3.3165 1.8709 -2.9916
(To understand this fully, you should look up AIC and the 'step' function). The first table shows the effect of removal of a single variable from the model (note that <none> has the same value for RSS as we computed earlier). Removal of the 'Day' variable has the least effect on the RSS (that is, the smallest different in the sum of squares, 619). The second table involves dropping a second variable from the four remaining. However, there is no reduction in AIC for any of these, and so the final selected model has four components.
The best way to assess the quality of a model and to perform the step function is to use a test and training set. Otherwise, the RSS is overoptimistic as a measure of predictive accuracy.
The step function, described above, does not sample every possible combination of descriptors. In situations with a large number of descriptors, it may not perform very well (I think!). What is required are methods that rapidly sample large area of 'feature space' such as simulated annealing, genetic and evolutionary algorithms, or the new kid on the block, ant colony optimization. Here I discuss a modified ant colony optimization algorithm by Shen, Jiang, Tao, Shen and Yu (J. Chem. Inf. Model., 2005, 45, 1024), which sounds quite interesting.
I have implemented this algorithm in R, and will post it here as soon as it is in a decent state.