Introduction
Following on from the previous post about decision trees let us move on to Random Forests. Let us use the Soybean data from the ‘mlbench’ package. There are 35 features and 683 observations with 16 varieties of Soybean.
Why care about Random Forests?
Let us look at how our decision trees predict previous unseen data. First we will load the data in:
library(mlbench)
library(caret)
data("BreastCancer")
dim(BreastCancer)
Let us now split the data up into a training and test data set.
BreastCancer <- na.omit(BreastCancer)
BreastCancer$Id <- as.factor(BreastCancer$Id)
BreastCancer <- subset(BreastCancer, select = -Id)
index <- createDataPartition(y = BreastCancer$Class, times = 1, p = 0.75, list=FALSE)
train0 <- BreastCancer[index,]
test0 <- BreastCancer[-index,]
We can now train a decision tree like we did in the previous tutorial and see how the model did predicting values of the test set.
Tree <- caret::train(Class ~., data = train0, method = 'rpart')
#Tree <- rpart(Class ~., data = train0)
results <- caret::predict.train(Tree, newdata = test0)
hello <- caret::confusionMatrix(results, test0$Class)
overall <- hello$overall
kable(t(overall), digits = 3, format='markdown')
Accuracy | Kappa | AccuracyLower | AccuracyUpper | AccuracyNull | AccuracyPValue | McnemarPValue |
---|---|---|---|---|---|---|
0.941 | 0.871 | 0.894 | 0.971 | 0.653 | 0 | 0.752 |
print(knitr::kable(hello$table))
benign | malignant | |
---|---|---|
benign | 105 | 4 |
malignant | 6 | 55 |
That is a pretty good model but lets see if we can improve it. Decision trees are know to have lower bias and this equates to a higher variance. That is decision trees overfit the training data. Random forests allow us to overcome this issue.
RF <- caret::train(Class ~ ., data=train0, method = 'rf', ntree= 300)
results <- caret::predict.train(RF, newdata = test0)
hello1 <- caret::confusionMatrix(results, test0$Class)
overall1 <- hello$overall
print(knitr::kable(t(overall1), digits = 3))
Accuracy | Kappa | AccuracyLower | AccuracyUpper | AccuracyNull | AccuracyPValue | McnemarPValue |
---|---|---|---|---|---|---|
0.965 | 0.923 | 0.925 | 0.987 | 0.653 | 0 | 0.683 |
print(knitr::kable(hello1$table))
benign | malignant | |
---|---|---|
benign | 107 | 2 |
malignant | 4 | 57 |
See, we achieved a slightly higher accuracy. That means we successfully classified more cancer patients. That is a pretty good thing, well worth using random forests over a single decision tree.
The Algorithm
numberOffTrees <- 300
NumberOfTreesInForest <- seq(1, numberOffTrees)
RandomForest <- list()
for(i in NumberOfTreesInForest) {
# Get a sample of the data size N
sampleData <- sample(train, size = N)
# Grow a tree
singleTree <- rpart(Class ~ ., data = train)
# Add the tree to the forest
RandomForest[i] <- singleTree
}
Regression Prediction
predictionData
for(i in NumberOfTreesInForest) {
# Make a single prediction
singlePrediction <- rpart.predict(ContinuousY ~ ., data = predictionData)
# Add the tree to the forest
regressionPrediction = regressionPrediction + singlePrediction
}
regressionPrediction <- regressionPrediction / numberOffTrees
Classification Prediction
predictionData
for(i in NumberOfTreesInForest) {
# Make a single prediction
singlePrediction <- rpart.predict(Class ~ ., data = predictionData)
# Add the tree to the forest
classificationPrediction[i] = singlePrediction
}
max(table(classificationPrediction))
The benefits
The MAJOR benefit of random forests are that they are easy to implement and require little tinkering. Many ML models require tuning parameters whilst Random Forests are about as close to a working model out of the box as you can get.