Elements of Statistical learning – Hastie, Tibshirani, Friedman

2. Overview of Supervised Learning

Statistical decision theory

10. Boosting and Additive trees

10.1 Discrete AdaBoost

The AdaBoost algorithm

Set $~w_i = 1/N, ~~i=1,2,\ldots, N$

For $~m=1,\ldots, M$

Return

10.2 Boosting and additive models

10.3 Forward Stagewise Additive Modelling (FSAM)

FSAM algorithm

Set $~f_0(x) = 0$

For $~m=1,\ldots, M$

Return $~f(x)$

10.4 Exponential Loss and AdaBoost

We can show that AdaBoost is equivalent to FSAM with loss:

For AdaBoost, the basis functions are classifiers $G_m(x) \in \{ -1, 1 \}$. With an exponential loss function in FSAM, we have:

Because $\exp[-y_i f_{m-1}(x_i)]$ is independent of $\beta$ and G, it can be regarded as a weight on the observations for that particular iteration $w_i^{(m)}$.

To find the minimising $\beta$ and G, we manipulate the sum:

By observation, the minimising G:

We can find $\beta$ by differentiating and setting to zero, giving us:

where $\epsilon_m$ is the weighted error. All together, this is equivalent to saying that:

which is the same as in AdaBoost.

10.5 Why Exponential Loss?

10.6 Loss functions and robustness

10.9 Boosting trees

10.10 Numerical Optimisation via Gradient Boosting

It is possible to numerically find fast approximations for any differentiable loss function. Generally, we want to minimise a loss function:

This can be treated as a numerical optimisation problem

with ‘parameters’ $\mathbf{f} \in \mathbb{R}^N$ being the approximating function $f(x_i)$ at each of the N data points, i.e:

Numerical optimisation procedures solve:

$\mathbf{f}_0 = \mathbf{h}_0$ is the initial guess, then each subsequent $\mathbf{f}_m$ is incremented with a step $\mathbf{h}_m$. Different algorithms choose this step differently.

10.10.1 Steepest Descent

In steepest descent, we choose $\mathbf{h}_m = -\rho_m \mathbf{g}_m$, where $\rho_m$ is the scalar step length and $\mathbf{g}_m \in \mathbf{R}^N$ is the gradient of $L(\mathbf{f}_{m-1})$.

The step length can either be constant, or variable as below:

Then the update is simply:

Steepest descent is very greedy in that it has no consideration for global optima: the only thing that matters is the steepest descent in this iteration. Additionally, it only looks at the training set. However, it is computationally efficient and can work for any differentiable loss function.

10.10.2 Gradient boosting

FSAM is similar to steepest descent; the new tree added at each iteration can be thought of as the gradient update.

The goal of boosted trees is to minimise

But this is difficult for robust loss functions. We can attempt to combine the benefits of steepest descent with those of boosted trees by fitting new trees to the (negative) gradient (called the pseudoresidual)

This is a proxy method: fitting trees to the gradient (which points in the direction of decreasing loss) should decrease the loss without explicitly having to minimise the loss function. This is computationally efficient.

Gradient tree boosting algorithm

Set $~f_0(x) = \arg \min_\gamma \sum_{i=1}^N L(y_i, \gamma)$

For $~m=1,\ldots, M$

Return $~f_M(x)$

10.11 Right-sized trees for boosting

10.12 Regularisation

10.13 Interpretation

For a single tree, the (square) importance of variable $X_l$ is given by summing over all nodes where $l$ is the splitting variable the decrease in error caused by that split:

This is generalised to additive trees by averaging:

These I measures are relative: convention is to set the maximum importance to 100 then scale the rest accordingly.