Random Forests

Introduction

Following on from the previous post about decision trees let us move on to Random Forests. Let us use the Soybean data from the ‘mlbench’ package. There are 35 features and 683 observations with 16 varieties of Soybean.

Why care about Random Forests?

Let us look at how our decision trees predict previous unseen data. First we will load the data in:

library(mlbench)
library(caret)
data("BreastCancer")
dim(BreastCancer)

Let us now split the data up into a training and test data set.

BreastCancer <- na.omit(BreastCancer)
BreastCancer$Id <- as.factor(BreastCancer$Id)
BreastCancer <- subset(BreastCancer, select = -Id)
index <- createDataPartition(y = BreastCancer$Class, times = 1, p = 0.75, list=FALSE)
train0 <- BreastCancer[index,]
test0 <- BreastCancer[-index,]

We can now train a decision tree like we did in the previous tutorial and see how the model did predicting values of the test set.

Tree <- caret::train(Class ~., data = train0, method = 'rpart')
#Tree <- rpart(Class ~., data = train0)
results <- caret::predict.train(Tree, newdata = test0)
hello <- caret::confusionMatrix(results, test0$Class)
overall <- hello$overall
kable(t(overall), digits = 3, format='markdown')
Accuracy Kappa AccuracyLower AccuracyUpper AccuracyNull AccuracyPValue McnemarPValue
0.941 0.871 0.894 0.971 0.653 0 0.752
print(knitr::kable(hello$table))
benign malignant
benign 105 4
malignant 6 55

That is a pretty good model but lets see if we can improve it. Decision trees are know to have lower bias and this equates to a higher variance. That is decision trees overfit the training data. Random forests allow us to overcome this issue.

RF <- caret::train(Class ~ ., data=train0, method = 'rf', ntree= 300)
results <- caret::predict.train(RF, newdata = test0)
hello1 <- caret::confusionMatrix(results, test0$Class)
overall1 <- hello$overall
print(knitr::kable(t(overall1), digits = 3))
Accuracy Kappa AccuracyLower AccuracyUpper AccuracyNull AccuracyPValue McnemarPValue
0.965 0.923 0.925 0.987 0.653 0 0.683
print(knitr::kable(hello1$table))
benign malignant
benign 107 2
malignant 4 57

See, we achieved a slightly higher accuracy. That means we successfully classified more cancer patients. That is a pretty good thing, well worth using random forests over a single decision tree.

The Algorithm

numberOffTrees <- 300
NumberOfTreesInForest <- seq(1, numberOffTrees)
RandomForest <- list()
for(i in NumberOfTreesInForest) {
  # Get a sample of the data size N
  sampleData <- sample(train, size = N)
  
  # Grow a tree
  singleTree <- rpart(Class ~ ., data = train)
  
  # Add the tree to the forest
  RandomForest[i] <- singleTree 
}

Regression Prediction

predictionData
for(i in NumberOfTreesInForest) {
  # Make a single prediction
  singlePrediction <- rpart.predict(ContinuousY ~ ., data = predictionData)
  
  # Add the tree to the forest
  regressionPrediction =  regressionPrediction + singlePrediction
}
regressionPrediction <- regressionPrediction / numberOffTrees

Classification Prediction

predictionData
for(i in NumberOfTreesInForest) {
  # Make a single prediction
  singlePrediction <- rpart.predict(Class ~ ., data = predictionData)
  
  # Add the tree to the forest
  classificationPrediction[i] =  singlePrediction
}
max(table(classificationPrediction))

The benefits

The MAJOR benefit of random forests are that they are easy to implement and require little tinkering. Many ML models require tuning parameters whilst Random Forests are about as close to a working model out of the box as you can get.

Related

comments powered by Disqus