ML notes personal 3

 

๐Ÿง  Ridge Regression aka L2 Regularization(Because of Square)



๐Ÿ“Œ What is Ridge Regression?

Ridge Regression is a regularized version of linear regression that helps reduce overfitting by penalizing large coefficients.


๐Ÿงฎ Key Formula:



Loss=RSS+ฮปj=1nฮฒj2\text{Loss} = \text{RSS} + \lambda \sum_{j=1}^{n} \beta_j^2

Where:

  • RSS = Residual Sum of Squares (normal linear regression loss)

  • ฮป (lambda) = regularization strength (also called tuning parameter)

  • ฮฒแตข = model coefficients

  • The penalty term is L2 norm


๐ŸŽฏ Why Use Ridge?

  • Controls model complexity

  • Prevents overfitting

  • Useful when features are highly correlated (multicollinearity)


⚙️ How It Works

  • Adds a penalty to the squared values of coefficients

  • Forces coefficients to be smaller, but not zero

  • Helps in bias-variance tradeoff (increases bias, reduces variance)


๐Ÿ” Difference from Other Methods

MethodPenaltyShrinks Coefficients to Zero?
Linear RegressionNoneNo
RidgeL2 (squares)No
LassoL1 (absolute)Yes

๐Ÿงช When to Use Ridge?

  • Many features with small effects

  • Multicollinearity present

  • Want to keep all features, but reduce their impact


๐Ÿงฉ How to Choose ฮป?

  • Use Cross-Validation to find the best ฮป

  • As ฮป ↑ → model becomes more biased but generalizes better


๐Ÿ“ˆ Implementation (sklearn)

from sklearn.linear_model import Ridge model = Ridge(alpha=1.0) model.fit(X_train, y_train)

Pros

  • Reduces overfitting

  • Keeps all features

  • Improves generalization

Cons

  • Coefficients are harder to interpret

  • Doesn't perform feature selection

Keypoints :

1. when you increase lambda, values converge towards 0, but not an exact 0, high values columns are affected more.



Too high lambda, then, it will underfit.


2. Ideal value of lambda/alpha.


3. Notice how coeffecients are converging towards 0.


when to use ridge regression?

- when no. of input cols are >=2


Now, why it never becomes exactly 0?




It will be possible only if the numerator is 0, but it is very hard to reach 0, because we are having a sum of actual - predicted.

Lambda is in denominator.

Lasso Regression (L1 Regularization)




In ridge regression, coefficient never gets 0, here it can get 0. 

Term : sparsity means few coeff are 0.

So, whats the benefit of making them 0?

- when you have very high dim data, (x1,x3000) so, in this case, if we keep all the features, model is most likely to overfit, in lasso, if we increase alpha, then, few cols coeff will become 0, so it is helpful for feature selection.



in the above image, notice the selected row, few features are 0, we can easily ignore them for training.



why it gets 0?


ElasticNet Regression :

Used when you dont know whether to use ridge ot lasso


we take values of lambda and l1_ratio, based on that we calculate a and b

Default value of lambda = 1 and l1_ratio = 0.5




Logistic Regression

Used for linearly seperable data


  • Logistic Regression (yes, name is misleading — it’s for classification)




  • actual eq is ax+by+c=0,

    but to make it in sigma format, we have added col x0.



    1. we go till x no of epochs, 
    2. select random weights for a,b,c
    3. select one random point during each epoch, and ask it if it is at the right side, ad from that points perspective, is the line correct?
    4. If yes, do nothing, if no, update weights so that line covers that point in right region.




    These are too much operations, so we can use simplified algo.



    only 4 possibilities, look at the bottom left corner

    now, if you look at top right corner, in the new formula, you will get the same results.

    Problem with this technique : 


    Red line will be predicted by model, but in real, we need to predict black line.

    So, now we will make a new approach :

    mis-classified points will pull the line towards them

    correctly-classified points will push the line, so there will be a time where blue and green both are pushing, so line will be in equilibrium.


    so, to solve this, sigmoid function comes into the picture.

    so, we are just replacing step function with sigmoid.


    instead of 0 and 1, we will get the values between 
    0.00000000001 to 0.5
    and 0.5 to 0.9999999999



    when you add something, line will go down. when you subtract something, line will go up.


    so, now we have got this output : 


    so, in this logistic regression, we are taking random points for n no of epochs, so, there will be correct line so many times, but how do we know, that particular line is the best line, for that, we will use loss function called maximum likelihood.



    model = actual point colour's probability for all data points

    it means, if green point, we will use P(G), for red, will use P(R)

    further mathematics,
       

    log loss error equation : 



    Derivative (d/dx) of sigmoid : 



    just use gradient descent in logistic regression, and you will get the exact black line.


    Accuracy : 


    How much accuracy is good?

    - totally depends on usecase.
    - for medical and self driving car, 99% accuracy is bad, because 1 person will die or car accident
    - for swiggy sales increase during holidays, 80% accuracy is also good

    Confusion Matrix :


    TYPE 1 Error : 
    FP

    TYPE 2 Error : 
    FN


    When accuracy score is misleading?
    - example , defect detection, 99% black part, only 1% white part, so even if model predict entire black masks, the accuracy will be 99%

    Precision, use when FP is more important, TYPE 1 error



    Recall, when FN is more important, TYPE 2 Error



        

    F1 Score : 



    The F1 Score is the harmonic mean of precision and recall.

    The harmonic mean ensures that the F1 score is low if either precision or recall is low

    For example:

    • If precision = 1.0 and recall = 0.0, the F1 score is 0.0, not 0.5. That’s important — it reflects the fact that zero recall means your model missed all real positives, which is bad.





    same for recall and f1 score.





    Softmax Function : 




        sum of these 3 probabilities will be 1.



    Polynomial Features in Logistic Regression : 






    Logistic regression is mostly used in linear data only, for non-linear, we have better algorithms, but it is just for the concept understanding that polunomial logistic regression do exist.



    Decision Tree : 










    Geomatric Intuition  : 





    1. Now, we always have to decide the best node to start from, and then, best node at each level and step, we will see how to do it in future.

    2. how to decide splitting threshold for numerical data?


    Also, known as CART - Classification and Regression Tree

    calculate Entropy : 









    Information Gain 





    entropy range : 0 to 1 always



        




    Gini Impurity : 



    Range :     0 to 0.5 always    

    ๐Ÿ” Interpretation:

    • 0 → Pure (only one class)

    • > 0 → Impure (more than one class)

    • Maximum Gini (for two classes) is 0.5, when classes are perfectly balanced (50/50)


    For numerical data : 



    so, for each value, we will have 2 datasets.







    Overfitting in Decision Trees : 





    the green point should have been considered part of red to avoid overfitting



    Underfitting in Decision Tree:


    since, 80 are yes, will predict yes always




    So, we have some hyperparameters for the depth of the tree.
    • min_samples_split = 4

    • min_samples_leaf = 3

    Step 1: min_samples_split : Can we split this node?

    • We have 10 samples

    • min_samples_split = 4, so ✅ we can split (since 10 ≥ 4)

    Step 2:min_samples_leaf : After I split, will each leaf (ending point) have enough support? Is the split valid?

    Say the split would result in:

    • Left child: 2 samples ❌

    • Right child: 8 samples ✅

    But you set min_samples_leaf = 3, so the left child is invalid because it’s too small. This split will be rejected.


    more params :

    max_no_of_leaf_nodes
    minimum_decrease_in_GiniImpurity

    etc..

    purity means : all values are same after that node

    impurity meanse, col contains more than one unique values.



    1. Voting



    2. Stacking


    after the prediction from 1 layer, another model is placed, which will predict the final output.



    checkout bagging, boosting blog first. then continue:

    diff color lines are predictions of diff models.


    Benefits :

    1. helps achieving low bias+low variance

    When to use?
    Always.


    Voting Ensemble

    Assumptions :

    1. each model is different from each other
    2. each model has at least 51% accuracy


    Classification : 




    hard voting vs soft voting :



        
    soft voting gives you score as well.
    Hard voting gives you best one directly.


    bagging has 2 stages : 
    1. bootstrapping - split dataset
    2. aggregating - combine output of all models



    note that, in bagging, all the models are of same algo, such as all knn, or all svms, but it is not mendatory, you can have different algo, but think practically, it won't make any sense, so we keep same algo.

    now, when we want to test on some data point, it will always pass thriugh all the models, and whatever is the majority, we will take that as a final prediction.




    4 Types : 
    Bagging : Row sampling with replacement(can have duplicates)
    Pasting : Row sampling without replacement(no duplicates)
    Random Subspaces : Column sampling with or without replacement

    axis=0 or 1 will decide row or column

    Random Patches : Row & Column sampling with or without replacement


    Bagging, or bootstrap aggregating, is a useful ensemble learning technique when you have a model with high variance and are concerned about overfitting, or when you want to improve the stability and accuracy of predictionsIt's particularly effective for unstable models like decision trees. 

    in the reality, you can directly use bagging like this : 


    samples = row
    features = column

    now, during this random selection and replacement, there is a high chance of duplication and missing some rows or columns, for that you cab set oob_score=True

    oob = out ot bound

    you can test on those rows and columns, so basically you are using them as a testing data






     
    Bagging vs Random Forest  : 


    1)
    bagging can have any algorithm for model training.

    Random Forest is always trained on Decision Trees. 

    2)

    Bagging : Column sampling happens at tree level

    Random Forest : Column Sampling happens at each node of tree


    OOB Score will be used to take skipped rows as part of validation data.




    RF in Feature Selection : 



    here, it is a digit dataset, middle part shows the digit, 1 pixel = 1 feature

    rf predicted : light color features are important



    Adaboost : 



    let's say, 5 rows, for starting we have assigned equal wight alpha = 1/5=0.2 to each row.
    let's say, we are using decision tree for prediction, we got y_pred, now, lets assume that out of 5, 1 row's prediction is wrong.




    so, now, we will calculate alpha based on error for each row.


    lets say, we have 3 models, a b c -> a has 0% error rate, aways correct, b has 100% error rate, always wrong, c has 50-50.

    now, if you think carefully, a and b our equally useful. for b, we can just do 1-prediction, so it will always be true.

    so, the graph looks like this, u shaped, but in negative



    low error = low alpha = no need to focus more

    high error =  high alpha = need to pay high attention


    take the new weight, and normalize it for sum=1


    put normalized values in range,

    select 5 random values between 0 to 1 and see it falls majorly in which row? 



    this is called as upsampling, take these rows only for next training.

    so, take 1st, 4th rows, and 3rd row for 3 times, just remove the other 2 rows(2nd and 5th)

    below is a different dataset, but observe that row with [0 1 5 ...] is repeating so many times.



    Final Step (predict) : 

    this is a combined prediction.


    when to use bagging and boosting?


    bagging : 
    1. when you have LBHV model- fully grown decision tree
    2. parallel training
    3. all model have same weights

    boosting : 
    1. when you have HBLV model - shallow decision tree
    2. sequential training
    3. all model have different weight(alpha)



    KMeans Clustering




    a. plot all data points
    b. select 2 random points (k=2) as node
    c. for all other points, measure its distance from both the nodes, and put it in the closest cluster
    d. calculate centroid for both clusters(look at x mark)
    e. repeat step c
    f. repeat step d

    when to stop? - when current and last centroid position is same.

    How to decide value of k?

    using elbow method : 

    Terminology : wcss

    from centroid, distace of all points

    plot a graph, calculate wcss for k=1, k=2, k=3,....max usually we go upto 20

    k=1



    k=2


    plotting point = sum of wcss1 and wcss2

    k=3


    .....

    Finally, select the elbow point.



    Comments

    Popular posts from this blog

    Extracting Tables and Text from Images Using Python

    Positional Encoding in Transformer

    Chain Component in LangChain