Random Forests

Jan 31, 2018 3 min read Data Analysis

Introduction

Following on from the previous post about decision trees let us move on to Random Forests. Let us use the Soybean data from the ‘mlbench’ package. There are 35 features and 683 observations with 16 varieties of Soybean.

Why care about Random Forests?

Let us look at how our decision trees predict previous unseen data. First we will load the data in:

library(mlbench)
library(caret)
data("BreastCancer")
dim(BreastCancer)

Let us now split the data up into a training and test data set.

BreastCancer <- na.omit(BreastCancer)
BreastCancer$Id <- as.factor(BreastCancer$Id)
BreastCancer <- subset(BreastCancer, select = -Id)
index <- createDataPartition(y = BreastCancer$Class, times = 1, p = 0.75, list=FALSE)
train0 <- BreastCancer[index,]
test0 <- BreastCancer[-index,]

We can now train a decision tree like we did in the previous tutorial and see how the model did predicting values of the test set.

Tree <- caret::train(Class ~., data = train0, method = 'rpart')
#Tree <- rpart(Class ~., data = train0)
results <- caret::predict.train(Tree, newdata = test0)
hello <- caret::confusionMatrix(results, test0$Class)
overall <- hello$overall

kable(t(overall), digits = 3, format='markdown')

Accuracy	Kappa	AccuracyLower	AccuracyUpper	AccuracyNull	AccuracyPValue	McnemarPValue
0.941	0.871	0.894	0.971	0.653	0	0.752

print(knitr::kable(hello$table))

	benign	malignant
benign	105	4
malignant	6	55

That is a pretty good model but lets see if we can improve it. Decision trees are know to have lower bias and this equates to a higher variance. That is decision trees overfit the training data. Random forests allow us to overcome this issue.

RF <- caret::train(Class ~ ., data=train0, method = 'rf', ntree= 300)
results <- caret::predict.train(RF, newdata = test0)
hello1 <- caret::confusionMatrix(results, test0$Class)
overall1 <- hello$overall
print(knitr::kable(t(overall1), digits = 3))

Accuracy	Kappa	AccuracyLower	AccuracyUpper	AccuracyNull	AccuracyPValue	McnemarPValue
0.965	0.923	0.925	0.987	0.653	0	0.683

print(knitr::kable(hello1$table))

	benign	malignant
benign	107	2
malignant	4	57

See, we achieved a slightly higher accuracy. That means we successfully classified more cancer patients. That is a pretty good thing, well worth using random forests over a single decision tree.

The Algorithm

numberOffTrees <- 300
NumberOfTreesInForest <- seq(1, numberOffTrees)
RandomForest <- list()
for(i in NumberOfTreesInForest) {
  # Get a sample of the data size N
  sampleData <- sample(train, size = N)
  
  # Grow a tree
  singleTree <- rpart(Class ~ ., data = train)
  
  # Add the tree to the forest
  RandomForest[i] <- singleTree 
}

Regression Prediction

predictionData
for(i in NumberOfTreesInForest) {
  # Make a single prediction
  singlePrediction <- rpart.predict(ContinuousY ~ ., data = predictionData)
  
  # Add the tree to the forest
  regressionPrediction =  regressionPrediction + singlePrediction
}
regressionPrediction <- regressionPrediction / numberOffTrees

Classification Prediction

predictionData
for(i in NumberOfTreesInForest) {
  # Make a single prediction
  singlePrediction <- rpart.predict(Class ~ ., data = predictionData)
  
  # Add the tree to the forest
  classificationPrediction[i] =  singlePrediction
}
max(table(classificationPrediction))

The benefits

The MAJOR benefit of random forests are that they are easy to implement and require little tinkering. Many ML models require tuning parameters whilst Random Forests are about as close to a working model out of the box as you can get.

Data Analysis Tutorial Random Forests R