If you understand how CART decision trees are trained, then you can see why random forests are powerful by thinking through their training process.
In the first step of the decision tree training we pick the best feature split, divide the data into two groups, then train each of the two data groups independently. But what about the 2nd-best feature split? In some sense, we lose the information the other splits could provide.
To see this, when testing queries, the first step is to look at that best split and pass the query to one of the two sub-trees. But those trees have only been trained with half of the training set data, and thus have weaker discriminatory power. Every split down the tree has diminishing returns in terms of how much information it provides.
Now think about what the random forest does. If the feature which contains the best split is chosen for a particular tree, the split will be the same. But if it doesn't, then if the feature for the second split is present then it will be chosen. If the top 2 features aren't present then the third best split will be chosen, and so on.
Thus, across our forest we have representatives of a range of feature-splits, each trained on more data and thus have more discriminatory power per split. The aggregation step at the end combines the information gleaned from these different models. Each one of them is weaker than the original CART decision tree, but has gotten more information out of the data for the features it was given. Thus, together, they are much better predictors than by themselves.
In the first step of the decision tree training we pick the best feature split, divide the data into two groups, then train each of the two data groups independently. But what about the 2nd-best feature split? In some sense, we lose the information the other splits could provide.
To see this, when testing queries, the first step is to look at that best split and pass the query to one of the two sub-trees. But those trees have only been trained with half of the training set data, and thus have weaker discriminatory power. Every split down the tree has diminishing returns in terms of how much information it provides.
Now think about what the random forest does. If the feature which contains the best split is chosen for a particular tree, the split will be the same. But if it doesn't, then if the feature for the second split is present then it will be chosen. If the top 2 features aren't present then the third best split will be chosen, and so on.
Thus, across our forest we have representatives of a range of feature-splits, each trained on more data and thus have more discriminatory power per split. The aggregation step at the end combines the information gleaned from these different models. Each one of them is weaker than the original CART decision tree, but has gotten more information out of the data for the features it was given. Thus, together, they are much better predictors than by themselves.