” Where does a machine learning engineer go camping… ? ”
Random Forest is a useful algorithm for solving classification problems in machine learning. In this post, I will try and explain my understanding of the basics of the algorithm and some of its potential applications.
As the name suggests, a simple analogy to the algorithm is the forest which is made of multiple trees. A random forest is essentially a collection of multiple and different decision trees, that is created using random sampling of the data with replacement. Let’s think of a simple example.
Say you have to classify whether an animal is a dog or not in a data set of different animals, and although a 3-year-old could achieve this easily, it does take some time for a computer to accomplish this. This is an example of a binary classification task, where the output can only take on 2 possible values, and this can be made more complicated by having an output that takes multiple values (in this case it would be classifying cats, parrots, and other animals in the data set instead of just outputting whether the animal is a dog or not).
The decision tree has a hierarchical structure, where the top layer contains the unclassified raw data and each following layer organizes the data set based on certain features until we reach the final layer whose nodes are the “leaves” of this decision tree which can then be formatted to give a suitable output. Each node is split into a new layer based on the features that maximize the information gain which is a function of the entropy of the split feature. An example of a splitting feature would be the ear shape of a dog and whether it is pointy or not, and for each split we want to pick the feature that minimizes the entropy so that the data gets more organized in each node resulting in a purer data set (grouping more dogs together in each node).
There are different criteria for deciding the depth of the tree (i.e. the number of layers), and the user can set a threshold for the classification accuracy, and one might think that we should keep splitting the data set until we have perfectly classified all the data but this makes the algorithm very prone to overfitting (which means that it does not translate well to new training examples).
This is a major problem with a single decision tree, the algorithm has high variance which can cause the model to overfit to the training data and result in poor performance on new data. It makes a single decision tree a weak learner, because it is not adaptable. It is roughly analogous to rote memorization of a single textbook for a class, where you might get a good grade but you are unlikely to perform well on problems not mentioned in the textbook!
This is where the random forest algorithm shows up. If we combine multiple decision trees and calculate the average result of all these trees, then we have a more robust algorithm where each decision tree has lower correlation to others, making it less prone to overfitting. This process has 3 important steps: bootstrapping, training, aggregation.
Bootstrapping is a sampling with replacement technique where we randomly pick a subset of training examples to train a decision tree. Let’s come back to the dog classification example. Imagine placing the animals in a very large opaque bag and then picking a fixed number of animals randomly individually while classifying the animal type and finally putting the animal back before picking another one. This is what sampling with replacement means, and we can use this to create multiple data sets to train different decision trees.
The next step in the process is to train each decision tree, and this involves splitting a node based on features. Instead of choosing the feature that maximizes the information gain, the algorithm instead picks the features randomly. Since the random forest combines multiple trees together there might be certain “strong” features in the data set that might end up being used repeatedly by all the trees thus defeating the purpose of having low variance. Once each decision tree has been trained, it gives an output that needs to be analyzed.
In the final and most revealing step of the process, all the outputs are averaged to give the final prediction. This again is analogous to the wisdom of the crowd, where a famous prediction of a single individual is unlikely to be correct but in aggregate the result is far more accurate.
You might have observed that random sampling might omit or repeat certain data points, and that is perfectly fine because these remaining training examples form the “Out-of-bag” data which can be used to test the accuracy of the training algorithm.
In conclusion, a random forest consists of multiple decision trees trained on different data sets created from the original data set using sampling with replacement, and the aggregate output is used to make a prediction.
” … in a random forest 😉 “