Using Genetic Algorithms and Decision Tree for Feature Selection with Cross-validation

library(genalg) library(party) require(caret) dataset = read.csv(, header=T) X = dataset[,1:ncol(dataset)-1]; # extract predictors Y = data.frame(dataset[,ncol(dataset)]); # extract target attribute, it must be the last attribute n = 10 # number of folds trainindex = list # list of 10 training sets testindex = list # list of 10 testing sets len = length(data.frame(X)) + 1 folds = createFolds(Y 1, n) # create the folds for (k in 1:n) { # 10 folds, so 10 training sets and 10 testing sets trainindex k = foldsk for (i in 0:(n-3)){ if (k + i + 1 <= n)       trainindex k = c(trainindex k, foldsk+i+1 ) else trainindex k = c(trainindex k, foldsk+i+1-n ) }   if (k + (n - 1) <= n)      testindex k = foldsk+n-1 else testindex k = foldsk - 1 } evaluateFunc <- function(indices) { matrix = list # list of 10 confusion matrix result = 1 if (sum(indices) > 2) { for (j in 1:n) { # 10 folds, so 10 training sets and 10 testing sets trainValues = X[trainindex j ,] trainTarget = data.frame(Y[trainindex j ,1]) colnames(trainTarget) = c('Target'); testValues = X[testindex j, ] testTarget = data.frame(Y[testindex j ,1]) colnames(testTarget) = c('Target'); # Decision tree model training model = ctree(Target~., data=data.frame(trainValues[,indices==1], trainTarget)); # Use the test set for prediction predictions <- Predict(model,data.frame(testValues[,indices==1], testTarget)); # Generate the confusion matrix for each fold matrix j = table(Y[testindex t ,1],predictions) }     cfMatrix = Reduce('+', matrix) # Combine all the confusion matrices into a single one result = 1-(sum(diag(cfMatrix))/sum(cfMatrix)) # Error rate will be returned }   result } monitor <- function(obj) { minEval = min(obj$evaluations); filter = obj$evaluations == minEval; bestObjectCount = sum(rep(1, obj$popSize)[filter]); # Deal with the situation that more than one object is best if (bestObjectCount > 1) { bestSolution = obj$population[filter,][1,]; } else { bestSolution = obj$population[filter,]; }   outputBest = paste(obj$iter, " #selected=", sum(bestSolution),                       " Best (Error=", minEval, "): ", sep=""); for (var in 1:length(bestSolution)) { outputBest = paste(outputBest,                        bestSolution[var], " ",                         sep=""); }   outputBest = paste(outputBest, "\n", sep=""); cat(outputBest); #print out the best individual plot(obj, type="hist"); # print out the historygram of the current generation } model <- rbga.bin(size=ncol(dataset)-1, mutationChance=0.1, zeroToOneRatio=20,  evalFunc=evaluateFunc, verbose=TRUE, monitorFunc=monitor)
 * 1) Introduction: In this session, given a dataset (predictors and target attributes must be identified before hand), we will do the Feature Selection for selecting the most significant attributes with the help of Genetic Algorithm (GA) and Decision tree evaluation (fitness function). During the process, 10-fold cross-validation is applied to avoid overfitting.
 * 2) Required packages
 * 3) * genalg: A simple Genetic Algorithm in R
 * 4) * caret: To deal with n-fold cross-validation
 * 5) * party: A very comprehensive package for Decision Tree in R
 * 6) Monitor Function (monitorFunc): A method runs in each generation, it allows monitoring variables, progress. It can be used for debugging.
 * 7) Fitness Function (evalFunc): User supplied method to calculate the evaluation function for the given chromosome.
 * 8) The code:
 * 9) import required packages
 * 1) read csv file to a data frame
 * 1) Evaluation function, using Decision tree
 * 2) The evaluation for each chromosome is based on its Decision tree model error rate.
 * 1) Monitor function for progress monitoring
 * 1) The actual GA execution