ML Notes Personal2
Handling Mixed Variables | Feature Engineering
Mixed variable = numeric + categorical data in a single cell/column
cell:
solution:column : solution:
- when we have <5% missing data combining all cols
you can use fillna() function.
3. Arbitrary Value Imputation :
you fill the values with random constant words / numbers
If categorical :
use words like 'none' 'missing' 'other'
If numerical : 0,-1, etc
When to use?
- missing data is not random
4. End of distribution:
it is hard to find the arbitrary value, so we replace nulls with last or end value from the data.
When to use?
- missing data is not random
5. For categorical data,
option 1 : Replace with most frequent value (Mode)
option 2 :Replace with arbitrary word - 'none' , 'missing', 'other'
6. Common Techniques for numeric and categorical data :
1. Random Imputation,
For the current data, you take random value from the rows that have data for this column
Benefit : data shape remains same
2. Missing Value Indicator :
you keep one extra col to mark if the value is present or not.
model learns that we have value available and not available, so it helps in improving accuracy.
Step 1 :
Step 3:
Consider 2 other cols as x, and current col as y, train the model and predict the value.
Step 4-5 : Do it for all the cols
Step 6:
- so, take I1 as input, remove the nan values, train model and predict, get the difference and repeat till the difference is near to 0, so the model is at optimal state and cant learn any more.
Outliers
now, if the outlier is -5 , it will get converted to -3.
✅ Feature Construction
Feature construction is the process of creating new input features from existing ones to improve model performance.
🔧 Examples:
-
Combining Features:
-
BMI = weight / (height^2)
(constructed from weight and height)
✅ Feature Splitting
Feature splitting refers to breaking a single feature into multiple features. This is useful when a column contains composite or encoded information.
🔧 Examples:
-
Splitting a full name:
-
Name
→First Name
,Last Name
-
-
Splitting a date:
-
2025-07-27
→year
,month
,day
📌 What is PCA?
Principal Component Analysis (PCA) is a dimensionality reduction technique used in machine learning and data analysis.
It transforms the original features into a new set of uncorrelated features called principal components, ordered by how much variance they capture from the original data.
✅ Why is PCA Needed?
-
Reduce Dimensionality:
-
Many datasets have too many features (high-dimensional), which can lead to the curse of dimensionality.
-
PCA reduces the number of features while retaining most of the information.
-
-
Remove Redundancy:
-
If features are correlated, PCA combines them into fewer uncorrelated components, reducing redundancy.
-
-
Improve Model Efficiency:
-
Fewer features = faster training and less overfitting.
-
-
Visualization:
-
PCA can reduce data to 2D or 3D for easier visualization of complex datasets.
-
📈 Importance of Variance in PCA
-
Variance = Information.
-
PCA looks for the directions (axes) in which the data varies the most.
-
The first principal component captures the maximum variance, the second captures the next highest, and so on.
-
✅ What Are PC1 and PC2 (Green Lines)?
Even though the original features are "Radius" and "Area", PCA reorients the coordinate system to find new axes:
-
PC1 (Principal Component 1): Direction of maximum variance.
-
PC2 (Principal Component 2): Orthogonal (perpendicular) to PC1, captures remaining variance.
These green lines (PC1 and PC2) are linear combinations of the original features — but the transformation has already happened in this visual context. That’s why they don't align with "Radius" or "Area" axes.
🧠 Think of it like this:
You're rotating the coordinate system to a new angle that better captures the shape of the data.
📌 Your Point: "The features are not combined yet?"
Correct — before PCA, the features are independent (Radius, Area). But PCA creates the new features (PC1, PC2) by combining them:
-
PC1 = a1 * Radius + a2 * Area
-
PC2 = b1 * Radius + b2 * Area
So, in this plot, PCA has already been applied, and the green arrows show the new directions (eigenvectors).
Eigenvector for PC1 = [0.89, 0.45]
=> PC1 = 0.89 * Radius + 0.45 * Area
📌 What is an Eigenvector?
An eigenvector is a special kind of vector that, when a matrix transformation is applied to it, does not change its direction—it only gets stretched or shrunk.
normal vector:
Eigen Vector:
Eigen value is 2 here : the factor that shows the variance among both vectors.
2D has 2 Eigen vec, 3d has 3 and so on..
🔍 Intuition (Very Important for PCA):
Imagine a transformation applied to a 2D space — most vectors will rotate and change direction, but eigenvectors are the exception: they stay in the same direction, only their length changes.
In PCA :
-
Eigenvectors of this matrix are the principal directions (axes) where variance is maximized.
-
Eigenvalues tell how much variance is along each eigenvector.
🎯 Why Use These New Directions?
In the plot:
If you just look at the black axis, try imagining variance there, in this case variance on a and y axis is same, so we will not be able to figure out which feature to keep and what to ignore, hence we rotate the axis.
-
You can see that the spread (variance) of data along PC1 is greater than along PC2.
-
So, PC1 captures the most important information.
-
That’s why the image suggests reducing 2D → 1D by keeping only PC1.
Linear Regression
Value of R² | Meaning |
---|---|
1.0 | Perfect fit — predictions match actual values exactly |
0.0 | Model does no better than predicting the mean |
< 0.0 | Model performs worse than just using the mean |
Between 0 and 1 | % of the variance explained by the model |
📐 Formula:
Where:
-
: Regular R² score
-
: Number of observations (rows)
-
: Number of features
Now, to derive this coefficient and intercept values, we will need to perform matrix inversion operations. O(n^3). So, if there are too many features, this will take a lot of computational cost.
Variant | Data Used | Update Frequency | Speed | Memory Use | Best For |
---|---|---|---|---|---|
Batch Gradient Descent | Whole dataset | Once per epoch | Slowest | Highest | Small/medium datasets |
Stochastic Gradient Descent (SGD) | One sample | Every sample | Fastest | Lowest | Online/streaming/large data |
Mini-Batch Gradient Descent (MGD) | Small batch | Every mini-batch | Balanced | Moderate | Deep learning, large datasets |
Linear Regression: p = β₀ + β₁x
Polynomial Regression: p = β₀ + β₁x + β₂x² + β₃x³ + ... + βₙxⁿ
n
= degree of polynomial
-----------------------------------------
Scenario 1: Single Input Feature (x → p)
Degree 2 Polynomial (Quadratic)
p = β₀ + β₁x + β₂x²
Degree 4 Polynomial (Quartic)
p = β₀ + β₁x + β₂x² + β₃x³ + β₄x⁴
-----------------------------------------
Scenario 2: Two Input Features (x, y → p)
Degree 2 Polynomial (Quadratic Surface)
p = β₀ + β₁x + β₂y + β₃x² + β₄xy + β₅y²
Degree 4 Polynomial (Complex Surface)
p = β₀ + β₁x + β₂y + β₃x² + β₄xy + β₅y² + β₆x³ + β₇x²y + β₈xy² + β₉y³ + β₁₀x⁴ + β₁₁x³y + β₁₂x²y² + β₁₃xy³ + β₁₄y⁴
Terms by Degree:
- Degree 0:
β₀
(constant) - Degree 1:
β₁x + β₂y
(linear terms) - Degree 2:
β₃x² + β₄xy + β₅y²
(quadratic terms) - Degree 3:
β₆x³ + β₇x²y + β₈xy² + β₉y³
(cubic terms) - Degree 4:
β₁₀x⁴ + β₁₁x³y + β₁₂x²y² + β₁₃xy³ + β₁₄y⁴
(quartic terms)
Comments
Post a Comment