ML notes personal 4

Gradient Boosting


How it works?

step 1 : model 1 - it always returns the avg value of output col, pred1

step 2 : calculate residual 1 -> predicted-actual = res1

step 3 : make a decision tree iq, cgpa and res1

step 4 : make pred2 

step 5 : res2  = actual - (pred1 + LR*pred2)


step 6 : again make dt based on iq, cgpa, res2, and repeat till res is nearly = 0


Gradient Boosting for classification : 

step 1 :  simple model




step 2 : 



this is the log ans, but we need probability, so






step 3: calculate error



step 4 : make model2


and repeat

Final output : 



Some intuition : 

calculate log _odds for each leaf node using given formula



detailed explanation for 1 node : 



observe that number of samples=2, so, in sigma, we are calculating for 2 values, highlighted with red color in table


so, for the column pred2(log_odds) is sum : 

prev logodds + current leaf's loggodd

    



and loop..

res2= actual-current_probability

If some new point comes, this is how we will make prediction : 



0.35<0.5, so ->0



Stacking Ensemble



step 1: train basemodels



step 2 : 

take base models output as an input and real package as expected output , and then train a meta model.


    


problem : 

for the base models, you are predicting on same data on which the model was trained

solution : 

1. blending

2. k-fold     stacking



1. Blending


level 0 = entire dataset D0

level 1 = 2 subsets D1, D2

level 3 = subpart of D1 = D3, D4

steps:

a. train base models on D3

b.perform prediction of base models on D4

c. train meta model

d. predict on D2


2. k-fold stacking

step 1:





step 2 : take their predictions and train meta model



Multi layer stacking



where k-means fail??



why?

because it measures distance from centroid


solution  :



1. Agglomorative Hierarchical Clustering





the matrix will look like this : 





merges the nearest ones


2. Divisive Clustering

its the opposite, it splitts from larger



    We have diff types of agglomoritive clustering based on how we calculate the distance between 2 clusters.


1. single link : calculate ditance between each points in both clusters, and then take the min dist.

works best if both clusters have good distance between them, fails in case of outliers



2. same but take max distance

it solved outlier case

disadvantage :

can break a big cluster




3. group/avg - balance between min and max

avg of all distances

4. ward - default for sklearn

calculate the centroid for both clusters each and then common centroid


How to find the ideal no. of clusters?


look into the graph, and take a longest verticle line where it does not get cut by any horizontal line, divide from there



Benefits :

1. widely used, can work good on complex patterns

2. we have the info that particular point is closest to which point because of dendogram


Limitations  :

1. we are measuring distance between each points, at almost every step, so the matrix calculation will take a lot of space and computations, so not good for large dataset


KNN : 


How to select k?

1.

Generally, this technique is not advicable


2. experiment for k = [1,...25]

so, 25 models, and select the model with highest accuracy



Decision Surface : 



take the range for x and y, for each pixel in the grid, send each pixel/point it to knn model, and if it is 0, mark blue, if 1 , mark orange


for very small k value, chances of overfitting is there



underfitting for very high k value, lets say total 100 points, 70 orange, 30 blue, but if k is >95 or something, then, if you take majority, it will always be orange

limitaitions of KNN : 

1. if very large dataset, it will take a lot of time during prediction


2. high dimension data : f=500, it is said that for high dim data, distance is not good metric, there is a chance of error and knn totally relies on distance

3. outliers



4. 


5. we don't know which feature has how much weight for the output, it works like black box model


Assumptions of Linear Regression




1. each feature is expected to have linear relationship with output


2. There should not be multi-colinearity, features should not depend on each other. In below graph, values are too small, so they don't depend on each other



3. Error should be normally ditributed.



    



4. it must have homoscedasticity, it is also related to residual


5. there should not be any pattern in residuals


1st is not valid, 2nd is correct


SVM


in below image. all of these lines are correctly classifying the points, so which model is best? which line is perfect?


So, we take the nearest +ve and -ve line and measure the margin/distance, where the distance is max, we choose that line


SVM kernels for classifying this type of data : 






how does it work?

it just converts 2d data to 3d data in a way that middle points are higher than side points by using some mathematical function




 A. Adaboost vs B.Gradient Boost

- A - max depth of decision tree=1

- B - max leaf node - (8 to 32)



Probabilities : 

Independent events:

event that does not depend on outcome of prev events

let's say, you flipped a coin 3 times, and got a Head

now, for the 4th time, probability of Head is gonna remain same 1/2, it is independent of previous results


Mutually Exclusive Events

two events that cannot happen together, like you cannot get head and tail both in one flip



Naive Bayes Theorem:



Proof of equation :





XG Boost

- it is a library made on gradient boosting algorithm




parallel processing : Example : we can calculate gini index for each feature parallely

4. out of core computing : 
8gb RAM but, dataset size is 10GB, in this case XGboost will automatically, divide the data into chunks and train model sequentially and will use cache as well for better performance.

5. distributed computing:
we can use multiple devices, so when we divide dataset, it can be trained parallely

XGBoost = Extreme Gradient Boosting



Performance 



1 = better regularization, less chance for overfitting because of added regularization term in equation

2,3 = no need to remove/fill missing values, xgboost will handle it as it is.

4 = instead of splitting tree at each value of a column, we perform binning, and do approximate tree




5 - gives one more hyper param called gamma value for better decision making for prunning



XGBoost for Regression






Initial setup is same as gradient boosting

but DT is different here.

So, what is the difference?

Instead of using Gini/Entropy , we make DT by different criteria


step 1 : mean
step 2 : residual1 (actual-predicted)
step 3 : calculate similarity score for residual1. For easiness, taking lambda = 0
step 4 : split res1 based on cgpa

step 5 : calculate similarity score fro left and right node, and then gain, whichever splitting criteria gives highest gain is needed to us.


step 6 : 8.25 was giving max gain, so


step 7 : for these 3 point, decide the best splitting option



step 8 : calculate output for each node



step 9 : calculate model 2 predictions






XGBoost for Classification

step 1 : same value for each row, log 



step 2 : convert logs to probability




step 3 : root similarity score



step 4 : calculate gain for each splitting criteria



and repeat the process, same as regression






DBSCAN Clustering


DBSCAN (Density-Based Spatial Clustering of Applications with Noise)

Dense : 

you make a circle, considering current point as a center and radius = epsilon
If points in circle > min points, it's dense


Sparse : 

If points in circle < min points, it's sparse




core point : center of a circle if circle contains atleast min points

Border point :  center of a circle if circle does not contain at least min points but has one core point

Noise point : if circle doesn't have core point or border point or min points



    




Imbalanced data in ML


    
solution : 







2. 
oversampling - duplicate data of the minority class





3. SMOTE - generate data for minority class and make it meaningful








5.2 - custom loss function


Baysian Search :

For hyper parameter tuning, grid searchcv measures accuracy for all possibilities, but baysian search takes few points only, and try to predict mathematical relationship between them


baysian




Optuna library/framework

you can give list of algorithms and list of parameters to tune for each algo, this library will automatically choose the best algo and best values for the hyper parameters

ROC Curve


- used for threshold selection, mostly in binary classification.

Auc - area under the curve


It decides a threshould point from where we will make decision for particular point to be in which class




    




Comments

Popular posts from this blog

Extracting Tables and Text from Images Using Python

Positional Encoding in Transformer

Chain Component in LangChain