Data Analysis | Justin Dixon

Decision Trees

Have you been struggling to learn about what decision trees are? Finding it difficult to link pictures of trees with machine learning algorithms? If you answered yes to these questions then this post is for you. Decision trees are an amazingly powerful predictive machine learning method that all Data Analysts should know. When I was researching tree-based methods I could never find a hand worked problem. Most other souces simply list the maths, or show the results of a grown tree.

Random Forests

Introduction Following on from the previous post about decision trees let us move on to Random Forests. Let us use the Soybean data from the ‘mlbench’ package. There are 35 features and 683 observations with 16 varieties of Soybean. Why care about Random Forests? Let us look at how our decision trees predict previous unseen data. First we will load the data in: library(mlbench) library(caret) data("BreastCancer") dim(BreastCancer) Let us now split the data up into a training and test data set.

Who is the angriest?

Overall sentiments - magnitude overallData <- subset(sentimentData, select = c('file','Date','magnitude','score')) p <- ggplot(overallData, aes(x=Date, y = magnitude, colour=file)) + geom_line() + ggtitle('Overall show sentiment magnitude') + xlab('Date') + ylab('Magnitude') + labs(color="Shock Jock") + theme_bw() p ggsave('1.png',p) Overall sentiments - score p <- ggplot(overallData, aes(x=Date, y = score, colour=file)) + geom_line() + ggtitle('Overall show sentiment score') + xlab('Date') + ylab('Score') + labs(color="Shock Jock") + theme_bw() p ggsave('2.png',p) Segment Analysis - By Day - 1st August dateData <- filter(sentimentData, sentimentData$Date == '2018-08-01') dateData <- mutate(dateData, percentageDone = case_when( file == 'Ben Fordham' ~ X / nrow(filter(dateData, file == 'Ben Fordham')), file == 'Ray Hadley' ~ X / nrow(filter(dateData, file == 'Ray Hadley')), file == 'Chris Smith' ~ X / nrow(filter(dateData, file == 'Chris Smith')), file == 'Alan Jones' ~ X / nrow(filter(dateData, file == 'Alan Jones')) )) p <- ggplot(dateData, aes(x=percentageDone, y = sentiment.