Boosting Methods#
Creating a strong classifier from a bunch(ensemble) of weak classifiers.
Basic concept is to build a second model correcting the errors made by the first model.
[1]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
Adaboost Classfication#
References
As compared to Random Forest - In random forest a bunch of decision trees are selected and these trees are full sized trees but of different sizes with no predetermined depths. - In random forest each tree has equal weight/vote on final classification. - In random forest each decision tree is independent to the other trees.
Adaptive Boosting
In Adaboost forest of trees used, are trees with just one node and two leaves, called
stump
.(we can say that it is a forest of stumps)________ |Question| |________| / \ ________/ \________ |___Yes__| |___No___|
Stumps is basically a weak learners, bad classifiers as they have only one feature to learn upon and the results are inconclusive mostly.
But Adaboost needs bad learners only.
In Adaboost some stumps get
more say
in the classification that others.In Adaboost stumps are dependent. How? Error made by first stump determines how the second stump will be made, and second will influence the third one and so on.
Algorithm AdaBoost.M1#
Initialize the observation weights \(w_i\) = 1/N, i = 1, 2, … , N.
For m = 1 to M:
Fit a classifier \(G_m(x)\) to the training data using weights \(w_i\)
Compute \(err_m = \frac{\sum_{i=0}^{N} w_i I(y_i \ne G_m(x_i))}{\sum_{i=1}^{N} w_i}\)
Compute \(\alpha_m = log((1 − err_m)/err_m)\)
Set \(w_i \leftarrow w_i · exp[\alpha_m · I(y_i \ne G_m(x_i))]\), i = 1, 2, … , N
Output \(G(x) = sign[\sum_{m=1}^{M} \alpha_m G_m(x) ]\)
okay, so this is confusing but I tried to solve it line by line below with code.
Installing mightypy, package developed by me to have more control over the doc.
!pip install git+https://github.com/nishantbaheti/mightypy
[5]:
from mightypy.ml.tree_models import DecisionTreeClassifier
[6]:
sample_data = np.array([
['yes', 'yes', 205, 'yes'],
['no', 'yes', 180, 'yes'],
['yes', 'no', 210, 'yes'],
['yes', 'yes', 167, 'yes'],
['no', 'yes', 156, 'no'],
['no', 'yes', 125, 'yes'],
['yes', 'no', 168, 'no'],
['yes', 'yes', 172, 'no']
],dtype='O')
headers = ['Chest Pain', 'Blocked Arteries', 'Patient Weight', 'Heart Disease']
df = pd.DataFrame(sample_data,columns=headers)
df
[6]:
Chest Pain | Blocked Arteries | Patient Weight | Heart Disease | |
---|---|---|---|---|
0 | yes | yes | 205 | yes |
1 | no | yes | 180 | yes |
2 | yes | no | 210 | yes |
3 | yes | yes | 167 | yes |
4 | no | yes | 156 | no |
5 | no | yes | 125 | yes |
6 | yes | no | 168 | no |
7 | yes | yes | 172 | no |
[7]:
X = sample_data[...,[0,1,2]]
y = sample_data[...,[-1]]
feature_name =headers[:-1]
target_name = headers[-1]
stump = DecisionTreeClassifier(max_depth=0)
stump.train(
X=X,
y=y,
feature_name=feature_name,
target_name=target_name
)
stump.print_tree()
|-Is Patient Weight >= 210 ? | gini :0.46875
|---> True:
|-- Predict : {'yes': 1}
|---> False:
|-- Predict : {'no': 3, 'yes': 4}
Sample weight#
At start all samples get the same weight.
sample weight = \(\frac{1}{n_{samples}}\)
[8]:
sample_weight = 1 / X.shape[0]
X.shape[0], sample_weight
[8]:
(8, 0.125)
[9]:
sample_weights = np.ones((X.shape[0],1)) * sample_weight
df['sample_weights'] = sample_weights
df
[9]:
Chest Pain | Blocked Arteries | Patient Weight | Heart Disease | sample_weights | |
---|---|---|---|---|---|
0 | yes | yes | 205 | yes | 0.125 |
1 | no | yes | 180 | yes | 0.125 |
2 | yes | no | 210 | yes | 0.125 |
3 | yes | yes | 167 | yes | 0.125 |
4 | no | yes | 156 | no | 0.125 |
5 | no | yes | 125 | yes | 0.125 |
6 | yes | no | 168 | no | 0.125 |
7 | yes | yes | 172 | no | 0.125 |
Total error(TE)#
sum of sample weights for wrongly classified samples gives Total Error of the stump.
[11]:
predicted = stump.predict(X)
df['predicted'] = predicted
df
[11]:
Chest Pain | Blocked Arteries | Patient Weight | Heart Disease | sample_weights | predicted | |
---|---|---|---|---|---|---|
0 | yes | yes | 205 | yes | 0.125 | yes |
1 | no | yes | 180 | yes | 0.125 | yes |
2 | yes | no | 210 | yes | 0.125 | yes |
3 | yes | yes | 167 | yes | 0.125 | yes |
4 | no | yes | 156 | no | 0.125 | yes |
5 | no | yes | 125 | yes | 0.125 | yes |
6 | yes | no | 168 | no | 0.125 | yes |
7 | yes | yes | 172 | no | 0.125 | yes |
get all the results where classification is not equal to original results(not classified correctly)
[12]:
predicted != y
[12]:
array([[False],
[False],
[False],
[False],
[ True],
[False],
[ True],
[ True]])
now get sum of the weights of wrongly classified results
[13]:
(predicted != y) * sample_weights
[13]:
array([[0. ],
[0. ],
[0. ],
[0. ],
[0.125],
[0. ],
[0.125],
[0.125]])
[14]:
total_error = ((predicted != y) * sample_weights).sum()
total_error
[14]:
0.375
Amount of say#
Amount of Say = \(\frac{1}{2} \log(\frac{1 - TE}{TE})\)
As the name suggests what is the amount of say that a stump has, with Total Error(TE).
[15]:
def amount_of_say(total_error):
return np.log((1 - total_error)/total_error)/2
plotting amount of say vs total error graph
[16]:
%matplotlib inline
te = np.linspace(0.01,0.99,num=50)
aos = amount_of_say(te)
ax = plt.subplot(111)
ax.plot(te,aos,'.-')
ax.set_xlabel("Total Error")
ax.set_ylabel("Amount of Say")
plt.grid()
plt.show()

It is visible from the graph that when total error is less, amount of say is high and when total error is high amount of say decreases.
lets say if the total error is 0.5(50-50), then amount of say is 0. that means this stump doesn’t matter to the final results.
Amount of say of the stump
[17]:
stump_aos = amount_of_say(total_error)
total_error, stump_aos
[17]:
(0.375, 0.25541281188299536)
Updating Sample weight#
Increase sample weight for incorrectly classified samples#
New Sample Weight = Sample Weight x \(e^{\text{amount of say}}\)
[18]:
plt.plot(np.linspace(0,5,50),np.exp(np.linspace(0,5,50)), label="increase")
plt.xlabel("Amount of say")
plt.ylabel("New sample weight")
plt.legend()
plt.grid()

So the new sample weight will be little larger than the sample weight for the positive amount of say.
Decrease sample weight for correctly classified samples#
New Sample Weight = Sample Weight x \(e^{-\text{amount of say}}\)
[19]:
plt.plot(np.linspace(0,5,50),np.exp(-np.linspace(0,5,50)), label="decrease")
plt.xlabel("Amount of say")
plt.ylabel("New sample weight")
plt.legend()
plt.grid()

If amount of say of last stump was high then new sample weight will be close to zero or if amount of say of last stump was low then new sample weight will be close to 1.
[20]:
def update_sample_weight(sample_weight, amount_of_say, increase=True):
if increase:
return sample_weight * np.exp(amount_of_say)
else:
return sample_weight * np.exp(-amount_of_say)
[21]:
inc_class_weights = (predicted != y) * sample_weights
inc_class_weights
[21]:
array([[0. ],
[0. ],
[0. ],
[0. ],
[0.125],
[0. ],
[0.125],
[0.125]])
[22]:
corr_class_weights = (predicted == y) * sample_weights
corr_class_weights
[22]:
array([[0.125],
[0.125],
[0.125],
[0.125],
[0. ],
[0.125],
[0. ],
[0. ]])
[23]:
new_inc_class_weights = update_sample_weight(inc_class_weights,stump_aos,increase=True)
new_inc_class_weights
[23]:
array([[0. ],
[0. ],
[0. ],
[0. ],
[0.16137431],
[0. ],
[0.16137431],
[0.16137431]])
[24]:
new_corr_class_weights = update_sample_weight(corr_class_weights,stump_aos,increase=False)
new_corr_class_weights
[24]:
array([[0.09682458],
[0.09682458],
[0.09682458],
[0.09682458],
[0. ],
[0.09682458],
[0. ],
[0. ]])
[25]:
new_sample_weights = new_inc_class_weights + new_corr_class_weights
df['new_sample_weights'] = new_sample_weights
df
[25]:
Chest Pain | Blocked Arteries | Patient Weight | Heart Disease | sample_weights | predicted | new_sample_weights | |
---|---|---|---|---|---|---|---|
0 | yes | yes | 205 | yes | 0.125 | yes | 0.096825 |
1 | no | yes | 180 | yes | 0.125 | yes | 0.096825 |
2 | yes | no | 210 | yes | 0.125 | yes | 0.096825 |
3 | yes | yes | 167 | yes | 0.125 | yes | 0.096825 |
4 | no | yes | 156 | no | 0.125 | yes | 0.161374 |
5 | no | yes | 125 | yes | 0.125 | yes | 0.096825 |
6 | yes | no | 168 | no | 0.125 | yes | 0.161374 |
7 | yes | yes | 172 | no | 0.125 | yes | 0.161374 |
Normalizing Sample Weights#
After updating with new sample weights, these new sample weights need to be normalized so that when the are added up, sum will be 1(approx). Now use these sample weights to create new stump.
[26]:
new_sample_weights.sum()
[26]:
0.9682458365518543
sum is not 1. lets make it(normalize)
[27]:
new_sample_weights = new_sample_weights/ new_sample_weights.sum()
df['new_sample_weights'] = new_sample_weights
df
[27]:
Chest Pain | Blocked Arteries | Patient Weight | Heart Disease | sample_weights | predicted | new_sample_weights | |
---|---|---|---|---|---|---|---|
0 | yes | yes | 205 | yes | 0.125 | yes | 0.100000 |
1 | no | yes | 180 | yes | 0.125 | yes | 0.100000 |
2 | yes | no | 210 | yes | 0.125 | yes | 0.100000 |
3 | yes | yes | 167 | yes | 0.125 | yes | 0.100000 |
4 | no | yes | 156 | no | 0.125 | yes | 0.166667 |
5 | no | yes | 125 | yes | 0.125 | yes | 0.100000 |
6 | yes | no | 168 | no | 0.125 | yes | 0.166667 |
7 | yes | yes | 172 | no | 0.125 | yes | 0.166667 |
[28]:
new_sample_weights.sum()
[28]:
1.0
Generating new samples#
normally these new samples should be resampled and get mostly incorrected classified results from the previous sample
Generate empty matrix for new samples
[29]:
new_sample_data = np.zeros_like(sample_data)
new_sample_data
[29]:
array([[0, 0, 0, 0],
[0, 0, 0, 0],
[0, 0, 0, 0],
[0, 0, 0, 0],
[0, 0, 0, 0],
[0, 0, 0, 0],
[0, 0, 0, 0],
[0, 0, 0, 0]], dtype=object)
pickup samples from last samples
for this example we need mostly 4,6 and 7.
Numpy random choice with p(probabilities) gives us the random choice of indexes from the range.
[30]:
idxs = np.random.choice(range(sample_data.shape[0]),size=(sample_data.shape[0],),p=new_sample_weights[...,-1])
idxs
[30]:
array([7, 7, 1, 1, 4, 6, 3, 5])
[31]:
new_sample_data = sample_data[idxs]
new_sample_data
[31]:
array([['yes', 'yes', 172, 'no'],
['yes', 'yes', 172, 'no'],
['no', 'yes', 180, 'yes'],
['no', 'yes', 180, 'yes'],
['no', 'yes', 156, 'no'],
['yes', 'no', 168, 'no'],
['yes', 'yes', 167, 'yes'],
['no', 'yes', 125, 'yes']], dtype=object)
[32]:
X = new_sample_data[...,[0,1,2]]
y = new_sample_data[...,[-1]]
feature_name =headers[:-1]
target_name = headers[-1]
stump = DecisionTreeClassifier(max_depth=0)
stump.train(
X=X,
y=y,
feature_name=feature_name,
target_name=target_name
)
stump.print_tree()
|-Is Blocked Arteries == no ? | gini :0.5
|---> True:
|-- Predict : {'no': 1}
|---> False:
|-- Predict : {'no': 3, 'yes': 4}
Gradient Boosting Regression#
So how it is different from Adaboost. (we already know regression is where we try to predict continuous values)
instead of starting with a stump, in gradient boosting we start with a single leaf. This single leaf can be average of target data.
then instead of using stumps for upcoming iterations we use a tree with leaves no more than 4.(in practice it can be 8-32). So like adaboost we are using fixed sized trees but here tree size is bigger than adaboost.
Gradient boosting scale all the trees with the same amount.
Algorithm#
Reference https://en.wikipedia.org/wiki/Gradient_boosting
Initial model with a constant value \(F_0(x) = {\text{arg min}\atop{\gamma}} {\sum_{i=1}^{n} L(y_i, \gamma)}\).
For m=1 to M:
Compute so-called pseudo-residuals: \(r_{i,m} = -[ \frac{\partial L(y_i, F(x_i))}{\partial F(x_i)} ]_{F(x) = F_{m-1}(x)}\), for i = 1,…,n.
Fit a regression tree to the \(r_{i,m}\) values and create terminal regions \(R_{jm}\) for \(j = 1,...,J_m\).
For \(j = 1,...,J_m\) compute \(\gamma_{jm} = {\text{arg min}\atop{\gamma}} \sum_{i=1}^{n} L(y_i, F_{m-1}(x_i) + \gamma )\).
Update the model \(F_m(x) = F_{m-1}(x) + \nu \sum_{j=1}^{J_m} {\gamma_m}(I(x \in R_{jm}))\).
Output \(F_M(x)\).
Here we have different multiplier for every tree. Need to update the code according to it.
loss function calculations#
\begin{align} L(y_i, F(x)) &= \frac{1}{2}{(Observed - Predicted)} ^2\\ \frac{\partial L(y_i, F(x_i))}{\partial F(x_i)} &= -(Observed - Predicted) \end{align}
Sample data#
[2]:
sample_data = np.array([
[1.6, 'Blue', 'Male', 88],
[1.6, 'Green', 'Female', 76],
[1.5, 'Blue', 'Female', 56],
[1.8, 'Red', 'Male', 73],
[1.5, 'Green', 'Male', 77],
[1.4, 'Blue', 'Female', 57]
],dtype='O')
headers = ['Height','FavColor','Gender','Weight']
df = pd.DataFrame(sample_data,columns=headers)
df
[2]:
Height | FavColor | Gender | Weight | |
---|---|---|---|---|
0 | 1.6 | Blue | Male | 88 |
1 | 1.6 | Green | Female | 76 |
2 | 1.5 | Blue | Female | 56 |
3 | 1.8 | Red | Male | 73 |
4 | 1.5 | Green | Male | 77 |
5 | 1.4 | Blue | Female | 57 |
[3]:
X = sample_data[...,[0,1,2]]
y = sample_data[...,[-1]]
Initial model/ leaf \(F_0(x)\)#
Calculating initial leaf value. Usually we need a constant value that we can increase or decrease to reach to expected value.
\begin{align} F_0(x) &= {\text{arg min}\atop{\gamma}} {\sum_{i=1}^{n} L(y_i, \gamma)}\\ \frac{\partial L(y_i, \gamma)}{\partial \gamma} &= -(Observed - Predicted)\\ \end{align}
differenciation is used to minimize values. so minimizing loss with derivative where we need predicted(\(\gamma\)) such that it minimizes the loss.
\begin{align} & \sum_{i=1}^{n} \frac{\partial L(y_i, \gamma)}{\partial \gamma} = 0\\ & -\frac{1}{2}{(O_1 - \gamma_1)} - \frac{1}{2}{(O_2 - \gamma_2)} - ... - \frac{1}{2}{(O_n - \gamma_n)} = 0\\ \because & \text{ we need a constant initial value for the predicted}\\ \therefore & \gamma = \gamma_1 = \gamma_2 = ... = \gamma_n\\ & (O_1 - \gamma) + (O_2 - \gamma) + ... + (O_n - \gamma) = 0\\ & n \gamma = O_1 + O_2 + ... + O_n\\ & \gamma = \frac{(O_1 + O_2 + ... + O_n)}{n} \end{align}
SO initial predicted value is mean of observed values
[4]:
initial_const = y.mean()
print(initial_const)
71.16666666666667
the leaf value means that all the predicted weights are 71.1666…
Pseudo residual (\(r_{i,m}\))#
\(r_{im} = -[ \frac{\partial L(y_i, F(x_i))}{\partial F(x_i)} ]_{F(x) = F_{m-1}(x)}\) for i = 1, 2, … , n
so for 1st iteration(number of tree that we are trying to build) m = 1 - \(F(x) = F_0(x) = \gamma\) - \(r_{i1} = -(-(y_i - \gamma))\)
pseudo residual = observed value - predicted value from last tree/leaf
[5]:
initial_const
[5]:
71.16666666666667
[6]:
pseudo_residual = y - initial_const
df['pseudo_residual1'] = pseudo_residual
df
[6]:
Height | FavColor | Gender | Weight | pseudo_residual1 | |
---|---|---|---|---|---|
0 | 1.6 | Blue | Male | 88 | 16.8333 |
1 | 1.6 | Green | Female | 76 | 4.83333 |
2 | 1.5 | Blue | Female | 56 | -15.1667 |
3 | 1.8 | Red | Male | 73 | 1.83333 |
4 | 1.5 | Green | Male | 77 | 5.83333 |
5 | 1.4 | Blue | Female | 57 | -14.1667 |
Building regression tree to predict residuals#
Yes, so this is different from other algorithms that we are building the regression tree to predict residuals and not the target values.
[7]:
from mightypy.ml.tree_models import DecisionTreeRegressor
[8]:
model_tree = DecisionTreeRegressor(max_depth=2)
model_tree.train(
X=X,
y=pseudo_residual,
feature_name = headers[:-1],
target_name = headers[-1]
)
model_tree.print_tree()
|-Is Height >= 1.8 ? | variance :129.13888888888889
|---> True:
|-- Predict : 1.8333333333333286
|---> False:
|--Is FavColor == Green ? | variance :154.16000000000003
|----> True:
|--- Predict : 5.333333333333329
|----> False:
|---Is Height >= 1.6 ? | variance :220.66666666666666
|-----> True:
|---- Predict : 16.83333333333333
|-----> False:
|---- Predict : -14.666666666666671
[9]:
predicted_residual = model_tree.predict(X)
df['predicted_residual1'] = predicted_residual
df
[9]:
Height | FavColor | Gender | Weight | pseudo_residual1 | predicted_residual1 | |
---|---|---|---|---|---|---|
0 | 1.6 | Blue | Male | 88 | 16.8333 | 16.8333 |
1 | 1.6 | Green | Female | 76 | 4.83333 | 5.33333 |
2 | 1.5 | Blue | Female | 56 | -15.1667 | -14.6667 |
3 | 1.8 | Red | Male | 73 | 1.83333 | 1.83333 |
4 | 1.5 | Green | Male | 77 | 5.83333 | 5.33333 |
5 | 1.4 | Blue | Female | 57 | -14.1667 | -14.6667 |
Chaining predicted results with earlier results#
Lets add predicted residual with average of target
[10]:
df['tree_prediction1'] = initial_const + predicted_residual
[11]:
df
[11]:
Height | FavColor | Gender | Weight | pseudo_residual1 | predicted_residual1 | tree_prediction1 | |
---|---|---|---|---|---|---|---|
0 | 1.6 | Blue | Male | 88 | 16.8333 | 16.8333 | 88 |
1 | 1.6 | Green | Female | 76 | 4.83333 | 5.33333 | 76.5 |
2 | 1.5 | Blue | Female | 56 | -15.1667 | -14.6667 | 56.5 |
3 | 1.8 | Red | Male | 73 | 1.83333 | 1.83333 | 73 |
4 | 1.5 | Green | Male | 77 | 5.83333 | 5.33333 | 76.5 |
5 | 1.4 | Blue | Female | 57 | -14.1667 | -14.6667 | 56.5 |
problem here is if we directly add new residuals to previous tree’s predictions we are getting too close results to original observations. And this will cause overfitting. We need one parameter that slows down the learning. called learning rate
Introducing learning rate (\(\nu\))#
so in last two cells, values are really close , seems overfitting.
lets introduce a learning rate, so gradually we’ll reach to a proper result
multiplying learning rate(\(\nu\)) with predicted residuals of the tree, and adding it up with last tree’s output(predictions)
for first tree ,last tree output is actually the leaf value(mean of observations).
[12]:
learning_rate = 0.3
last_tree_output = initial_const + (learning_rate *predicted_residual)
df['tree_prediction1'] = last_tree_output
df
[12]:
Height | FavColor | Gender | Weight | pseudo_residual1 | predicted_residual1 | tree_prediction1 | |
---|---|---|---|---|---|---|---|
0 | 1.6 | Blue | Male | 88 | 16.8333 | 16.8333 | 76.2167 |
1 | 1.6 | Green | Female | 76 | 4.83333 | 5.33333 | 72.7667 |
2 | 1.5 | Blue | Female | 56 | -15.1667 | -14.6667 | 66.7667 |
3 | 1.8 | Red | Male | 73 | 1.83333 | 1.83333 | 71.7167 |
4 | 1.5 | Green | Male | 77 | 5.83333 | 5.33333 | 72.7667 |
5 | 1.4 | Blue | Female | 57 | -14.1667 | -14.6667 | 66.7667 |
Now these value are not exactly equal to the target data, but it is getting closer. Slow and steadily we can reach to the target values. now use this output to create a new tree.
Repeating for next tree#
Again calulate residuals
[13]:
pseudo_residual = y - last_tree_output
df['pseudo_residual2'] = pseudo_residual
df
[13]:
Height | FavColor | Gender | Weight | pseudo_residual1 | predicted_residual1 | tree_prediction1 | pseudo_residual2 | |
---|---|---|---|---|---|---|---|---|
0 | 1.6 | Blue | Male | 88 | 16.8333 | 16.8333 | 76.2167 | 11.7833 |
1 | 1.6 | Green | Female | 76 | 4.83333 | 5.33333 | 72.7667 | 3.23333 |
2 | 1.5 | Blue | Female | 56 | -15.1667 | -14.6667 | 66.7667 | -10.7667 |
3 | 1.8 | Red | Male | 73 | 1.83333 | 1.83333 | 71.7167 | 1.28333 |
4 | 1.5 | Green | Male | 77 | 5.83333 | 5.33333 | 72.7667 | 4.23333 |
5 | 1.4 | Blue | Female | 57 | -14.1667 | -14.6667 | 66.7667 | -9.76667 |
[15]:
model_tree = DecisionTreeRegressor(max_depth=2)
model_tree.train(
X=X,
y=pseudo_residual,
feature_name = headers[:-1],
target_name = headers[-1]
)
model_tree.print_tree()
predicted_residual = model_tree.predict(X)
df['predicted_residual2'] = predicted_residual
learning_rate = 0.3
last_tree_output = last_tree_output + (learning_rate *predicted_residual)
df['tree_prediction2'] = last_tree_output
|-Is Height >= 1.8 ? | variance :63.36305555555555
|---> True:
|-- Predict : 1.2833333333333314
|---> False:
|--Is FavColor == Green ? | variance :75.6404
|----> True:
|--- Predict : 3.7333333333333343
|----> False:
|---Is Height >= 1.6 ? | variance :108.21166666666664
|-----> True:
|---- Predict : 11.783333333333331
|-----> False:
|---- Predict : -10.266666666666666
Conclusion from the process#
[16]:
df
[16]:
Height | FavColor | Gender | Weight | pseudo_residual1 | predicted_residual1 | tree_prediction1 | pseudo_residual2 | predicted_residual2 | tree_prediction2 | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 1.6 | Blue | Male | 88 | 16.8333 | 16.8333 | 76.2167 | 11.7833 | 11.7833 | 80.93 |
1 | 1.6 | Green | Female | 76 | 4.83333 | 5.33333 | 72.7667 | 3.23333 | 3.73333 | 74.26 |
2 | 1.5 | Blue | Female | 56 | -15.1667 | -14.6667 | 66.7667 | -10.7667 | -10.2667 | 62.66 |
3 | 1.8 | Red | Male | 73 | 1.83333 | 1.83333 | 71.7167 | 1.28333 | 1.28333 | 72.23 |
4 | 1.5 | Green | Male | 77 | 5.83333 | 5.33333 | 72.7667 | 4.23333 | 3.73333 | 74.26 |
5 | 1.4 | Blue | Female | 57 | -14.1667 | -14.6667 | 66.7667 | -9.76667 | -10.2667 | 62.66 |
[51]:
fig, ax = plt.subplots(1,2,figsize=(15,7))
ax[0].axhline(0,color='k', label='minimum residual = 0')
ax[0].plot(df['pseudo_residual1'],'o-', label='Pseudo Residual 1')
ax[0].vlines(df.index,ymin = 0,ymax = df['pseudo_residual1'],color= 'grey',lw=5,alpha=0.6)
ax[0].plot(df['pseudo_residual2'],'o-', label='Pseudo Residual 2')
ax[0].vlines(df.index,ymin = 0,ymax = df['pseudo_residual2'],color= 'orange', lw=3,alpha=1)
ax[0].set_title('Pseudo Residual')
ax[0].legend()
ax[1].plot(df['Weight'],'o-' ,color='k', label='Observations')
ax[1].plot(df['tree_prediction1'],'o-', label='Tree predictions 1')
ax[1].plot(df['tree_prediction2'],'o-', label='Tree predictions 2')
ax[1].set_title('Predictions')
ax[1].legend()
plt.show()

It is clearly visible that after each iteration pseudo residual is decreasing, and tree prediction is reaching closer to original observations.
Understand chaining/adding trees#
leaf value + (lr) * tree 1 predictions + (lr) * tree 2 predictions + …
so actually here we are addding trees based on residuals from last tree(errors made from last tree). and chain/add them while including learning rate.
we keep adding trees until predefined number of trees are not reached or adding new trees are not significantly reducing the size of residuals.
Gradient Boosting Classification#
Algorithm#
Initial model with a constant value \(F_0(x) = {\text{arg min}\atop{\gamma}} {\sum_{i=1}^{n} L(y_i, \gamma)}\).
For m=1 to M:
Compute so-called pseudo-residuals: \(r_{i,m} = -[ \frac{\partial L(y_i, F(x_i))}{\partial F(x_i)} ]_{F(x) = F_{m-1}(x)}\), for i = 1,…,n.
Fit a regression tree to the \(r_{i,m}\) values and create terminal regions \(R_{jm}\) for \(j = 1,...,J_m\)
For \(j = 1,...,J_m\) compute \(\gamma_{jm} = {\text{arg min}\atop{\gamma}} \sum_{i=1}^{n} L(y_i, F_{m-1}(x_i) + \gamma)\).
Update the model \(F_m(x) = F_{m-1}(x) + \nu \sum_{j=1}^{J_m} {\gamma_m I(x \in R_{jm})}\).
Output \(F_M(x)\).
Loss function calculations#
\begin{align} L(y, p) &= -\sum_{i=1}^{n} y_i \times log(p_i) + (1 - y_i) \times log(1 - p_i)\\ L(y, p) &= -\sum_{i=1}^{n} y_i log(p_i) + log(1 - p_i) - y_i log(1 - p_i)\\ L(y, p) &= -\sum_{i=1}^{n} y_i log(p_i) - y_i log(1 - p_i) + log(1 - p_i)\\ L(y, p) &= -\sum_{i=1}^{n} y_i log(\frac{p_i}{1 - p_i}) + log(1 - p_i)\\ \because \text{odds} &= \frac{p}{1 - p}\\ \therefore L(y, p) &= -\sum_{i=1}^{n} y_i log(odds) + log(1 - p_i) \end{align}
\begin{align} log(1 - p_i) &= log(1 - \frac{e^{log(odds)}}{1 + e^{log(odds)}})\\ &= log(\frac{1 + e^{log(odds)} - e^{log(odds)}}{1 + e^{log(odds)}})\\ &= log(\frac{1}{1 + e^{log(odds)}})\\ &= log(1) - log(1 + e^{log(odds)})\\ &= 0 - log(1 + e^{log(odds)}) \end{align}
\begin{align} L(y, p) &= -\sum_{i=1}^{n} y_i log(odds) - log(1 + e^{log(odds)}) \end{align}
\begin{align} \frac{\partial L}{\partial log(odds)} &= - \sum y_i - \frac{e^{log(odds)}}{1 + e^{log(odds)}}\\ \because \frac{e^{log(odds)}}{1 + e^{log(odds)}} &= p_i \text{ predicted probability from the tree}\\ \therefore \frac{\partial L}{\partial log(odds)} &= - \sum y_i - p_i \end{align}
\begin{align} L(y, p) &= -\sum_{i=1}^{n} y_i \times log(p_i) + (1 - y_i) \times log(1 - p_i)\\ \gamma_{j,m} = \frac{\partial L(y, p)}{\partial p} &= -\sum \frac{y_i}{p_i} + \frac{1-y_i}{1-p_i} {(-1)}\\ \frac{\partial L(y, p)}{\partial p} &= -\sum \frac{y_i}{p_i} - \frac{1-y_i}{1-p_i}\\ \frac{\partial L(y, p)}{\partial p} &= -\sum \frac{y_i -p_i y_i - p_i + p_i y_i}{p_i (1-p_i)}\\ \frac{\partial L(y, p)}{\partial p} &= -\sum \frac{y_i - p_i}{p_i (1-p_i)} \end{align}
Sample data#
[2]:
sample_data = np.array([
['yes', 12, 'Blue', 1],
['yes', 19, 'Green', 1],
['no', 44, 'Blue', 0],
['yes', 87, 'Red', 0],
['no', 32, 'Green', 1],
['no', 14, 'Blue', 1]
],dtype='O')
headers = ['LikesPopcorn', 'Age', 'FavColor', 'LovesTroll2']
df = pd.DataFrame(sample_data,columns=headers)
df
[2]:
LikesPopcorn | Age | FavColor | LovesTroll2 | |
---|---|---|---|---|
0 | yes | 12 | Blue | 1 |
1 | yes | 19 | Green | 1 |
2 | no | 44 | Blue | 0 |
3 | yes | 87 | Red | 0 |
4 | no | 32 | Green | 1 |
5 | no | 14 | Blue | 1 |
[3]:
X = sample_data[...,[0,1,2]]
y = sample_data[...,[-1]]
Initial model/ leaf \(F_0(x)\)#
Concept of log odds#
log odds is equivalent to averaging in terms of logistic regression.
\(odds = \frac{P}{1 - P}\)
I am considering there are two classes 1 and 0.
\begin{align} odds &= \frac{\frac{count_{1}}{count_T}}{1 - \frac{count_{1}}{count_T}}= \frac{\frac{count_{1}}{count_T}}{\frac{count_T - count_{1}}{count_T}} = \frac{\frac{count_{1}}{count_T}}{\frac{count_0}{count_T}}\\ odds &= \frac{count_{1}}{count_{0}} \end{align}
[4]:
total_count = y.shape[0]
class_counts = np.unique(y,return_counts=True)
class_counts_dict = dict(zip(class_counts[0],class_counts[1]))
class_counts_dict
[4]:
{0: 2, 1: 4}
[5]:
odds = class_counts_dict[1]/ class_counts_dict[0]
log_odds = np.log(odds)
df['ini_leaf_val'] = log_odds
df
[5]:
LikesPopcorn | Age | FavColor | LovesTroll2 | ini_leaf_val | |
---|---|---|---|---|---|
0 | yes | 12 | Blue | 1 | 0.693147 |
1 | yes | 19 | Green | 1 | 0.693147 |
2 | no | 44 | Blue | 0 | 0.693147 |
3 | yes | 87 | Red | 0 | 0.693147 |
4 | no | 32 | Green | 1 | 0.693147 |
5 | no | 14 | Blue | 1 | 0.693147 |
Applying logistic(sigmoid) function for classification limit#
Sigmoid \(f(x) = \frac{1}{1 + e^{-x}}\)
\(f(x) = \frac{e^{x}}{1 + e^{x}}\)
[6]:
probs = np.exp(log_odds) / (1 + np.exp(log_odds))
df['initial_probs'] = probs
df
[6]:
LikesPopcorn | Age | FavColor | LovesTroll2 | ini_leaf_val | initial_probs | |
---|---|---|---|---|---|---|
0 | yes | 12 | Blue | 1 | 0.693147 | 0.666667 |
1 | yes | 19 | Green | 1 | 0.693147 | 0.666667 |
2 | no | 44 | Blue | 0 | 0.693147 | 0.666667 |
3 | yes | 87 | Red | 0 | 0.693147 | 0.666667 |
4 | no | 32 | Green | 1 | 0.693147 | 0.666667 |
5 | no | 14 | Blue | 1 | 0.693147 | 0.666667 |
So we are starting all observations from same initial probability and this probability will move in sigmoid fashion to either 0 or 1.
Pseudo Residual (\(r_{i,m}\))#
[7]:
pseudo_residual = y - probs
df['pseudo_residual1'] = pseudo_residual
df
[7]:
LikesPopcorn | Age | FavColor | LovesTroll2 | ini_leaf_val | initial_probs | pseudo_residual1 | |
---|---|---|---|---|---|---|---|
0 | yes | 12 | Blue | 1 | 0.693147 | 0.666667 | 0.333333 |
1 | yes | 19 | Green | 1 | 0.693147 | 0.666667 | 0.333333 |
2 | no | 44 | Blue | 0 | 0.693147 | 0.666667 | -0.666667 |
3 | yes | 87 | Red | 0 | 0.693147 | 0.666667 | -0.666667 |
4 | no | 32 | Green | 1 | 0.693147 | 0.666667 | 0.333333 |
5 | no | 14 | Blue | 1 | 0.693147 | 0.666667 | 0.333333 |
Building regression tree to predict residuals#
Yes, so this is different from other algorithms that we are building the regression tree to predict residuals and not the target values.
[8]:
from mightypy.ml.tree_models import DecisionTreeRegressor
[9]:
model_tree = DecisionTreeRegressor(max_depth=2)
model_tree.train(
X=X,
y=pseudo_residual,
feature_name = headers[:-1],
target_name = headers[-1]
)
[10]:
model_tree.print_tree(mean_preds=False)
|-Is FavColor == Green ? | variance :0.22222222222222224
|---> True:
|-- Predict : [0.33333333333333337 0.33333333333333337]
|---> False:
|--Is Age >= 87 ? | variance :0.25
|----> True:
|--- Predict : [-0.6666666666666666]
|----> False:
|---Is LikesPopcorn == yes ? | variance :0.22222222222222224
|-----> True:
|---- Predict : [0.33333333333333337]
|-----> False:
|---- Predict : [-0.6666666666666666 0.33333333333333337]
[11]:
predicted_residuals = model_tree.predict(X,False)
predicted_residuals
[11]:
array([[array([0.33333333333333337], dtype=object)],
[array([0.33333333333333337, 0.33333333333333337], dtype=object)],
[array([-0.6666666666666666, 0.33333333333333337], dtype=object)],
[array([-0.6666666666666666], dtype=object)],
[array([0.33333333333333337, 0.33333333333333337], dtype=object)],
[array([-0.6666666666666666, 0.33333333333333337], dtype=object)]],
dtype=object)
\begin{align} \gamma_{j, m} &= \frac{\partial L(y, p)}{\partial p} = \sum \frac{y_i - p_i}{p_i (1-p_i)}\\ \gamma_{j, m} &= \frac{\sum{Residual_i}}{\sum{prev prob_i} \times (1 - prev prob_i)} \end{align}
\begin{align} &\text{ but when we are taking about tree regions, and lets assume we have leaf with mulitple values.}\\ &\text{ then this calculation turns out}\\ \gamma_{j, m} &= \frac{{Residual_1 + Residual_2}}{[{prev prob_1} \times (1 - prev prob_1)] + [{prev prob_2} \times (1 - prev prob_2)]}\\ &\text{ now if they are part of same leaf then prev probability will be same for both of them,}\\ &\text{ then denominator actually becomes average of residuals}\\ \gamma_{j, m} &= \frac{\sum{Residual_i}}{n * {prev prob_i} \times (1 - prev prob_i)}\\ &\text{ where n is number of values in a leaf} \end{align}
________
|Question|
|________|
/ \
________/ \________
|___1 __| |__2 3 4_|
so based on my decision tree regression model I can simply take the mean predictions instead of getting individual values in a leaf
[12]:
model_tree.print_tree(mean_preds=True)
predicted_residuals = model_tree.predict(X,True)
predicted_residuals
|-Is FavColor == Green ? | variance :0.22222222222222224
|---> True:
|-- Predict : 0.33333333333333337
|---> False:
|--Is Age >= 87 ? | variance :0.25
|----> True:
|--- Predict : -0.6666666666666666
|----> False:
|---Is LikesPopcorn == yes ? | variance :0.22222222222222224
|-----> True:
|---- Predict : 0.33333333333333337
|-----> False:
|---- Predict : -0.16666666666666663
[12]:
array([[0.33333333333333337],
[0.33333333333333337],
[-0.16666666666666663],
[-0.6666666666666666],
[0.33333333333333337],
[-0.16666666666666663]], dtype=object)
[13]:
tree_output = predicted_residuals / ( probs * (1 - probs))
df['tree_output1'] = tree_output
df
[13]:
LikesPopcorn | Age | FavColor | LovesTroll2 | ini_leaf_val | initial_probs | pseudo_residual1 | tree_output1 | |
---|---|---|---|---|---|---|---|---|
0 | yes | 12 | Blue | 1 | 0.693147 | 0.666667 | 0.333333 | 1.5 |
1 | yes | 19 | Green | 1 | 0.693147 | 0.666667 | 0.333333 | 1.5 |
2 | no | 44 | Blue | 0 | 0.693147 | 0.666667 | -0.666667 | -0.75 |
3 | yes | 87 | Red | 0 | 0.693147 | 0.666667 | -0.666667 | -3 |
4 | no | 32 | Green | 1 | 0.693147 | 0.666667 | 0.333333 | 1.5 |
5 | no | 14 | Blue | 1 | 0.693147 | 0.666667 | 0.333333 | -0.75 |
Chaining predicted results with earlier results#
[14]:
learning_rate = 0.1
predictions = probs + (learning_rate * tree_output)
df["predictions1"] = predictions
df
[14]:
LikesPopcorn | Age | FavColor | LovesTroll2 | ini_leaf_val | initial_probs | pseudo_residual1 | tree_output1 | predictions1 | |
---|---|---|---|---|---|---|---|---|---|
0 | yes | 12 | Blue | 1 | 0.693147 | 0.666667 | 0.333333 | 1.5 | 0.816667 |
1 | yes | 19 | Green | 1 | 0.693147 | 0.666667 | 0.333333 | 1.5 | 0.816667 |
2 | no | 44 | Blue | 0 | 0.693147 | 0.666667 | -0.666667 | -0.75 | 0.591667 |
3 | yes | 87 | Red | 0 | 0.693147 | 0.666667 | -0.666667 | -3 | 0.366667 |
4 | no | 32 | Green | 1 | 0.693147 | 0.666667 | 0.333333 | 1.5 | 0.816667 |
5 | no | 14 | Blue | 1 | 0.693147 | 0.666667 | 0.333333 | -0.75 | 0.591667 |
Repeating for next tree#
[67]:
probs = np.exp(predictions) / (1 + np.exp(predictions))
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
AttributeError: 'float' object has no attribute 'exp'
The above exception was the direct cause of the following exception:
TypeError Traceback (most recent call last)
<ipython-input-67-974572c91fc1> in <module>
----> 1 probs = np.exp(predictions) / (1 + np.exp(predictions))
TypeError: loop of ufunc does not support argument 0 of type float which has no callable exp method
[15]:
predictions = np.array(predictions,dtype='float')
probs = np.exp(predictions) / (1 + np.exp(predictions))
[16]:
df['prob1'] = probs
df
[16]:
LikesPopcorn | Age | FavColor | LovesTroll2 | ini_leaf_val | initial_probs | pseudo_residual1 | tree_output1 | predictions1 | prob1 | |
---|---|---|---|---|---|---|---|---|---|---|
0 | yes | 12 | Blue | 1 | 0.693147 | 0.666667 | 0.333333 | 1.5 | 0.816667 | 0.693528 |
1 | yes | 19 | Green | 1 | 0.693147 | 0.666667 | 0.333333 | 1.5 | 0.816667 | 0.693528 |
2 | no | 44 | Blue | 0 | 0.693147 | 0.666667 | -0.666667 | -0.75 | 0.591667 | 0.643747 |
3 | yes | 87 | Red | 0 | 0.693147 | 0.666667 | -0.666667 | -3 | 0.366667 | 0.590653 |
4 | no | 32 | Green | 1 | 0.693147 | 0.666667 | 0.333333 | 1.5 | 0.816667 | 0.693528 |
5 | no | 14 | Blue | 1 | 0.693147 | 0.666667 | 0.333333 | -0.75 | 0.591667 | 0.643747 |
[70]:
pseudo_residual = y - probs
df['pseudo_residual2'] = pseudo_residual
model_tree = DecisionTreeRegressor(max_depth=2)
model_tree.train(
X=X,
y=pseudo_residual,
feature_name = headers[:-1],
target_name = headers[-1]
)
predicted_residuals = model_tree.predict(X,True)
tree_output = predicted_residuals / ( probs * (1 - probs))
df['tree_output2'] = tree_output
predictions = probs + (learning_rate * tree_output)
predictions = np.array(predictions,dtype='float')
df["predictions2"] = predictions
probs = np.exp(predictions) / (1 + np.exp(predictions))
df['prob2'] = probs
df
[70]:
LikesPopcorn | Age | FavColor | LovesTroll2 | ini_leaf_val | initial_probs | pseudo_residual1 | tree_output1 | predictions1 | prob1 | pseudo_residual2 | tree_output2 | predictions2 | prob2 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | yes | 12 | Blue | 1 | 0.693147 | 0.666667 | 0.333333 | 1.5 | 0.816667 | 0.693528 | 0.306472 | 1.4419 | 0.837719 | 0.697984 |
1 | yes | 19 | Green | 1 | 0.693147 | 0.666667 | 0.333333 | 1.5 | 0.816667 | 0.693528 | 0.306472 | 1.4419 | 0.837719 | 0.697984 |
2 | no | 44 | Blue | 0 | 0.693147 | 0.666667 | -0.666667 | -0.75 | 0.591667 | 0.643747 | -0.643747 | -0.626797 | 0.581068 | 0.641313 |
3 | yes | 87 | Red | 0 | 0.693147 | 0.666667 | -0.666667 | -3 | 0.366667 | 0.590653 | -0.590653 | -2.44292 | 0.346362 | 0.585735 |
4 | no | 32 | Green | 1 | 0.693147 | 0.666667 | 0.333333 | 1.5 | 0.816667 | 0.693528 | 0.306472 | 1.4419 | 0.837719 | 0.697984 |
5 | no | 14 | Blue | 1 | 0.693147 | 0.666667 | 0.333333 | -0.75 | 0.591667 | 0.643747 | 0.356253 | -0.626797 | 0.581068 | 0.641313 |