ML Notes Personal2

Handling Mixed Variables | Feature Engineering

Mixed variable = numeric + categorical data in a single cell/column

cell:

solution:

column :

solution:

Handling date and time related data:

Handling missing values in dataset

Solution 1 : CCA - Complete Case Analysis

remove entire row if any of the column has null value in that column.

when to use this?

- lets say we have 1000 rows, for 50 rows, age column has missing value. we will remove these 50 rows only if they are at random space, not like all or them are in top or bottom, this is called MCAR (Missing Completely At Random)
- when we have <5% missing data combining all cols

- This same can work for column too, lets say one column has 95% missing data, just remove that column

Example :

remove the rows where these columns contains missing values.

Solution 2: mean/median Imputation

When the data is spread properly and in center, use mean

If skewed from left or right, go with median

you can use fillna() function.

3. Arbitrary Value Imputation :

you fill the values with random constant words / numbers

If categorical :

use words like 'none' 'missing' 'other'

If numerical : 0,-1, etc

When to use?

- missing data is not random

4. End of distribution:

it is hard to find the arbitrary value, so we replace nulls with last or end value from the data.

When to use?

- missing data is not random

5. For categorical data,

option 1 : Replace with most frequent value (Mode)

option 2 :Replace with arbitrary word - 'none' , 'missing', 'other'

6. Common Techniques for numeric and categorical data :

1. Random Imputation,

For the current data, you take random value from the rows that have data for this column

Benefit : data shape remains same

2. Missing Value Indicator :

you keep one extra col to mark if the value is present or not.

model learns that we have value available and not available, so it helps in improving accuracy.

Now, the qs is, how to decide that which technique is best for particular usecase?

solution : use gridsearch cv, it will perform all the permutations and combinations.

KNN Imputer :

In all the above method, we were taking help of concerned column only to fill out the missing values, but here, we take entire row to measure distance(Its euclidean).

Iterative Imputer

Step 1 :

Step 2 : Make the value of current column's cell as nan again, keep mean values for other cols

Step 3:

Consider 2 other cols as x, and current col as y, train the model and predict the value.

Step 4-5 : Do it for all the cols

Step 6:

- take the difference of mean one and currently predicted value, notice that its 13.

- so, take I1 as input, remove the nan values, train model and predict, get the difference and repeat till the difference is near to 0, so the model is at optimal state and cant learn any more.

Outliers

How to trat outliers?

1. Trimming:

remove them

Attention : If too many outliers, your dataset will thin

Pros : very fast

2. Capping

replace values above and below certain range with min and max values. Let's say -3 to 3.

now, if the outlier is -5 , it will get converted to -3.

3. replace outliers with 'NAN' values

4. Discretization, lets say 100 is outlier, but after discretization, it will fall in range (90-100) which is normal.

How to detect outliers?

1. Z-score

2. IQR

3. Percentile based

------------------------------------------------------------------------------------------------

1. z-score

(use on normal bell shaped curve data)

upper limit =

lower limit =

2. IQR : ( use on skewed data)

IQR = Q3-Q1

3. Percentile :

we can customly decide upper and lower range to remove from specific percentile. just keep them same, 5% upper, 5% lower will be considered as outlier, its not like 10% upper and 3% lower

Now, in this percentile method,

1. if you remove something - normal

2. If you perform capping, its called winsorization

✅ Feature Construction

Feature construction is the process of creating new input features from existing ones to improve model performance.

🔧 Examples:

Combining Features:

BMI = weight / (height^2) (constructed from weight and height)

✅ Feature Splitting

Feature splitting refers to breaking a single feature into multiple features. This is useful when a column contains composite or encoded information.

🔧 Examples:

Splitting a full name:
- Name → First Name, Last Name
Splitting a date:

2025-07-27 → year, month, day

Curse of Dimensionality :

- Selecting the optimal no. of features, not too less, and not too many, because there is a point, after that even if you add features, model's performance will not improve, on top that, computational cost will increase and even, model's performance can also decrease. So, this scenario is called ad Curse of Dimensionality.

Solution : Dimensionality Reduction

PCA : Priciple Component Analysis Technique

- It works on unsupervised data

- PCA helps convert data From high dims, to best possible low dims

- lets say it is converting 700 dim data to 3D , so helps in visualization too.

📌 What is PCA?

Principal Component Analysis (PCA) is a dimensionality reduction technique used in machine learning and data analysis.

It transforms the original features into a new set of uncorrelated features called principal components, ordered by how much variance they capture from the original data.

✅ Why is PCA Needed?

Reduce Dimensionality:
- Many datasets have too many features (high-dimensional), which can lead to the curse of dimensionality.
- PCA reduces the number of features while retaining most of the information.
Remove Redundancy:
- If features are correlated, PCA combines them into fewer uncorrelated components, reducing redundancy.
Improve Model Efficiency:
- Fewer features = faster training and less overfitting.
Visualization:
- PCA can reduce data to 2D or 3D for easier visualization of complex datasets.

📈 Importance of Variance in PCA

Variance = Information.
- PCA looks for the directions (axes) in which the data varies the most.
- The first principal component captures the maximum variance, the second captures the next highest, and so on.

Notice that, PC1 has higher variance compared to PC2.

✅ What Are PC1 and PC2 (Green Lines)?

Even though the original features are "Radius" and "Area", PCA reorients the coordinate system to find new axes:

PC1 (Principal Component 1): Direction of maximum variance.
PC2 (Principal Component 2): Orthogonal (perpendicular) to PC1, captures remaining variance.

These green lines (PC1 and PC2) are linear combinations of the original features — but the transformation has already happened in this visual context. That’s why they don't align with "Radius" or "Area" axes.

🧠 Think of it like this:

You're rotating the coordinate system to a new angle that better captures the shape of the data.

📌 Your Point: "The features are not combined yet?"

Correct — before PCA, the features are independent (Radius, Area). But PCA creates the new features (PC1, PC2) by combining them:

PC1 = a1 * Radius + a2 * Area
PC2 = b1 * Radius + b2 * Area

So, in this plot, PCA has already been applied, and the green arrows show the new directions (eigenvectors).

Eigenvector for PC1 = [0.89, 0.45]

=> PC1 = 0.89 * Radius + 0.45 * Area

📌 What is an Eigenvector?

An eigenvector is a special kind of vector that, when a matrix transformation is applied to it, does not change its direction—it only gets stretched or shrunk.

normal vector:

Eigen Vector:

Eigen value is 2 here : the factor that shows the variance among both vectors.

2D has 2 Eigen vec, 3d has 3 and so on..

🔍 Intuition (Very Important for PCA):

Imagine a transformation applied to a 2D space — most vectors will rotate and change direction, but eigenvectors are the exception: they stay in the same direction, only their length changes.

In PCA :

Eigenvectors of this matrix are the principal directions (axes) where variance is maximized.
Eigenvalues tell how much variance is along each eigenvector.

🎯 Why Use These New Directions?

In the plot:

If you just look at the black axis, try imagining variance there, in this case variance on a and y axis is same, so we will not be able to figure out which feature to keep and what to ignore, hence we rotate the axis.

You can see that the spread (variance) of data along PC1 is greater than along PC2.
So, PC1 captures the most important information.
That’s why the image suggests reducing 2D → 1D by keeping only PC1.

Step - By - Step solution to implement PCA :

step 1 : make data mean centric

step 2 : Find covariance matrix

step 3 : Find eigen values and vectors

highest eigen value = highest variance = pc1

From 3D,

keep only pc1 - 1D

keep pc1 and pc2 - 2D

Assume that, we have 784 vectors,

we will calculate variance in percentage for each lambda

After that, we will take lambda values till we get total sum = 90% variance

circular and patterned data, it does not work.

best works where data is linearly seperable.

PCA is a linear technique: it finds directions (lines/planes) that maximize variance.

Linear Regression

Regression Metrics

R2 Score:

SSR : sum of squared regressor (Regression line)

SSM : sum of squared mean (Mean line)

Value of R²	Meaning
1.0	Perfect fit — predictions match actual values exactly
0.0	Model does no better than predicting the mean
< 0.0	Model performs worse than just using the mean
Between 0 and 1	% of the variance explained by the model

The higher the score, the better the model.

Adjusted R²

📐 Formula:

\text{Adjusted } R^2 = 1 - \left( \frac{(1 - R^2)(n - 1)}{n - p - 1} \right)

Where:

$R^2$ : Regular R² score
$n$ : Number of observations (rows)
$p$ : Number of features

Now, if the selected feature column is not relavant,

normal R2 will decrease, so nominator is high and p will increase, so , overall adjusted R2 will decrease

But, if the selected feature is relavant,

then, normal R2 will increase, nominator will less, p is more, so denominator is high, overall, it will go low, but it is deducted from 1, so , in reality, we will get a better and high adjusted R2 score.

Multiple linear Regression :

Now, to derive this coefficient and intercept values, we will need to perform matrix inversion operations. O(n^3). So, if there are too many features, this will take a lot of computational cost.

So, gradient descent comes to the picture.

Gradient	Derivative of loss w.r.t. a weight. It tells us the direction in which the loss increases the fastest
Gradient Descent	The method to update model parameters to minimize loss.
Learning Rate	Controls the size of the steps during updates.
Local Minima	A point where loss is low, but not the lowest.
Global Minimum	The absolute lowest point of the loss function.
Solutions for local minima	Techniques like Adam, momentum, etc., help avoid bad local minima.
Convergence	Convergence means that your training process has reached a stable point — it’s no longer making significant improvements in minimizing the loss function.

Problems with Gradient Descent :

platleau : slope is too slow, so steps are too small, so it takes a lot of time to reach the convergence.

Types :

Variant	Data Used	Update Frequency	Speed	Memory Use	Best For
Batch Gradient Descent	Whole dataset	Once per epoch	Slowest	Highest	Small/medium datasets
Stochastic Gradient Descent (SGD)	One sample	Every sample	Fastest	Lowest	Online/streaming/large data
Mini-Batch Gradient Descent (MGD)	Small batch	Every mini-batch	Balanced	Moderate	Deep learning, large datasets

Polynomial Regression :

Degree 2 Polynomial (Quadratic)

Linear Regression: p = β₀ + β₁x

Polynomial Regression: p = β₀ + β₁x + β₂x² + β₃x³ + ... + βₙxⁿ

n = degree of polynomial

-----------------------------------------

Scenario 1: Single Input Feature (x → p)

Degree 2 Polynomial (Quadratic)

p = β₀ + β₁x + β₂x²

Degree 4 Polynomial (Quartic)

p = β₀ + β₁x + β₂x² + β₃x³ + β₄x⁴

-----------------------------------------

Scenario 2: Two Input Features (x, y → p)

Degree 2 Polynomial (Quadratic Surface)

p = β₀ + β₁x + β₂y + β₃x² + β₄xy + β₅y²

Degree 4 Polynomial (Complex Surface)

p = β₀ + β₁x + β₂y + β₃x² + β₄xy + β₅y² + β₆x³ + β₇x²y + β₈xy² + β₉y³ + β₁₀x⁴ + β₁₁x³y + β₁₂x²y² + β₁₃xy³ + β₁₄y⁴

Terms by Degree:

Degree 0: β₀ (constant)
Degree 1: β₁x + β₂y (linear terms)
Degree 2: β₃x² + β₄xy + β₅y² (quadratic terms)
Degree 3: β₆x³ + β₇x²y + β₈xy² + β₉y³ (cubic terms)
Degree 4: β₁₀x⁴ + β₁₁x³y + β₁₂x²y² + β₁₃xy³ + β₁₄y⁴ (quartic terms)

Bias Variance TradeOff

we have to choose the middle model, for that there are 3 techniques :

v62

Search This Blog

Machine Learning