ML Notes personal

L21:

Types of Analysis :

Type	Variables	Purpose	Complexity
Univariate	1	Describe single variable	Low
Bivariate	2	Find relationships between pairs	Medium
Multivariate	3+	Understand complex interactions	High

Univariate :

only one variable, lets say we need to see the age of people travelling in business class.

Bivariate :

Multivariate :

We are observing 3 variables at a time:

1. the higher the bill amount, the higher the tip

2. which gender gave the tip? male or female

L22 :

Pandas Profiling

insight for each column along with graph plotting

L23:

Feature Engineering

1. Feature Transformation

Feature scaling means setting the values in some range. ex (-1 to 1)

here, we have combined 2 cols (sibling-spouse and parent-child and make a single column 'family')

the sum of 2 cols and then, again, categorize it with family size in [small, medium, large]

3. Feature Selection :

keep the necessary columns only and remove unnecessary ones.

-- Doubt in feature construction vs feature extraction, will be solved later on.

Feature Scaling

why?

here, salary feature will dominate the model, hence, no accuracy due to insufficient weight on imp columns such as age.

Types:

Normalization (Min-Max Scaling)

Scales data to a fixed range, typically [0, 1]

Formula: (x - min) / (max - min)

Characteristics:

Preserves the original distribution shape
Bounded to a specific range (usually 0-1)
Sensitive to outliers (they can compress the rest of the data)
Also called Min-Max scaling

Example:

from sklearn.preprocessing import MinMaxScaler

# Original data: [10, 20, 30, 40, 50]
# Normalized: [0, 0.25, 0.5, 0.75, 1.0]

scaler = MinMaxScaler()
normalized_data = scaler.fit_transform(data)

Standardization (Z-score Scaling)

Centers data around mean=0 with standard deviation=1

Formula: (x - mean) / standard_deviation

Characteristics:

Creates data with mean=0, std=1
Not bounded to a specific range
Less sensitive to outliers
Preserves the relationship between data points
Also called Z-score normalization

Example:

from sklearn.preprocessing import StandardScaler

# Original data: [10, 20, 30, 40, 50] (mean=30, std≈15.81)
# Standardized: [-1.26, -0.63, 0, 0.63, 1.26]

scaler = StandardScaler()
standardized_data = scaler.fit_transform(data)

Geomatic intuition



Entire data has center 0,0

original data position vs after performing standardization.

Before standardization :




After Standardization :

so, now, both the cols will get equal priority.

model performs better after scaling.



what happens for outliers?


The results are same only.

Let's say 100 is a outlier.

NORMALIZATION: [Gets affected by outlier)

Without outlier: [1,2,3,4,5]
|----|----|----|----|
0   0.25 0.5 0.75  1.0   ← Good spread
With outlier: [1,2,3,4,100]
||||--------------------|
0  0.03                1.0   ← All normal values crushed together! (its not good, very difficult to 
distinguish between 1 to 4)

STANDARDIZATION: [ Doesn't get affected by outlier)

Without outlier: [1,2,3,4,5]  
|----|----|----|----|
-1.26   -0.63  0  0.63  1.26   ← Good spread
With outlier: [1,2,3,4,100]
|---|---|---|---|--------------------|
-0.53 -0.51 -0.48 -0.46              1.97   ← Still can see differences



Types of Normalization:




Geomatric intuition for normalization :


Just placing the entire data between 0 to 1

Encoding Categorical data
ordinal,nominal and label encoding

fit vs transform

what is fit vs transform

Simple Explanation

Fit = Learn the rules Transform = Apply the rules

Easy Example with Min-Max Scaling

Let's say you have training data: [10, 20, 30, 40, 50]

FIT: Learning the Rules

python

scaler.fit([10, 20, 30, 40, 50])

The scaler learns:

Min = 10
Max = 50
Rule: (x - 10) / (50 - 10)

Nothing gets transformed yet! It just remembers the min and max.

TRANSFORM: Applying the Rules

python

scaler.transform([10, 20, 30, 40, 50])
# Result: [0, 0.25, 0.5, 0.75, 1.0]

Now it applies the rule it learned to actually scale the data.

Why Separate Them?

Training Data

python

train_data = [10, 20, 30, 40, 50]

# Learn rules from training data
scaler.fit(train_data)
# Apply rules to training data  
train_scaled = scaler.transform(train_data)
# Result: [0, 0.25, 0.5, 0.75, 1.0]

New Test Data

python

test_data = [15, 25, 35]

# DON'T fit again! Use the same rules learned from training
test_scaled = scaler.transform(test_data)  
# Result: [0.125, 0.375, 0.625]

Key Point: Test data uses the same min (10) and max (50) learned from training!

What if You Fit Again on Test Data? BAD!

python

# WRONG WAY:
scaler.fit(test_data)  # This learns NEW rules: min=15, max=35
test_scaled = scaler.transform(test_data)
# Result: [0, 0.5, 1.0]  # Different scale! Can't compare with training!

Quick Rule

Ordinal: Has order → Use ordinal encoding

Examples: Size (S,M,L), Grade (A,B,C), Rating (1-5 stars)

Nominal: No order → Use one-hot encoding

Examples: Color, Country, Gender, Brand

Label: Only for target variable in classification

Example: Disease type → [0, 1, 2] for prediction



Visual Summary
Ordinal:    Small → Medium → Large     (Order matters)
            0    →   1     →   2

Nominal:    Red, Blue, Green           (No order)
            [1,0,0] [0,1,0] [0,0,1] (one-Hot Encoding)
            If 500 cols, you will have to make vector of 500 dims

Label:      Cat, Dog, Fish             (Target only)
            0,   1,   2
Let's dive deeper into the one hot encoding.
When, you make a sum or each vector , is will be 1, so model will start maintaining relationship between the cols, so to avoid dependency in input columns, we just skip one column from the vector and take n-1. It is the issue of multi colinearity and is called a dummy variable trap.
what if there are 1000 cols? we can keep n (lets say 800 important cols) and one col as 'other' which are less frequent or unimportant.

Lets take one example :




Notice the increased number of columns.
To remove the 1st col from this new dataset, drop_first = True. or drop = 'first'


Manual way for encoding(tedious):
# Step by step - messy!
age_salary_scaled = StandardScaler().fit_transform(df[['Age', 'Salary']])
city_encoded = OneHotEncoder().fit_transform(df[['City']])
education_encoded = OrdinalEncoder().fit_transform(df[['Education']])

# Then combine them back - pain!
final_data = np.concatenate([age_salary_scaled, city_encoded, education_encoded], axis=1)
column transformer :
Applies different transformations to different columns in one step


Pipeline in ML :
It is same as langchain pipeline.
1. Train Script Without Pipeline:
# Step 1: Data Cleaning
X_train_clean = clean_data(X_train)

# Step 2: Feature Scaling
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train_clean)

# Step 3: Model Training
model = LogisticRegression()
model.fit(X_train_scaled, y_train)

# Step 4: Hyperparameter Tuning (Optional)
# Let's assume hyperparameter tuning is done here

2. Eval Script Without Pipeline:
# Step 1: Data Cleaning
X_test_clean = clean_data(X_test)

# Step 2: Feature Scaling
X_test_scaled = scaler.transform(X_test_clean)  # Use the same scaler from training

# Step 3: Model Evaluation
y_pred = model.predict(X_test_scaled)
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.4f}")

3. Train Script With Pipeline:

from sklearn.pipeline import Pipeline

# Define the pipeline with all 4 steps
pipeline = Pipeline([
    ('cleaner', DataCleaner()),        # Step 1: Custom Data Cleaning
    ('scaler', StandardScaler()),      # Step 2: Feature Scaling
    ('classifier', LogisticRegression())  # Step 3: Model Training
])

# Train the pipeline
pipeline.fit(X_train, y_train)

4. Eval Script With Pipeline:
# Evaluation using the pipeline (no need to manually transform)
y_pred = pipeline.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.4f}")

Key Differences:

Without Pipeline: You manually apply each step (cleaning, scaling, and model training). During evaluation, you must repeat those steps (like transforming the test data).

With Pipeline: The pipeline takes care of the entire workflow. When calling pipeline.fit() and pipeline.predict(), it applies all the necessary steps automatically.

For the normal distribution of data, some sort of transformation is required.

what is normal distribution?
- there is one concept called QQplot, which shows
1. if black line is exactly on the green line, its good, it's a normal distribution.
2. If the black line is far from the green line, it is not the normal distribution.


For some algorithm to work properly, normal distribution is required. 



Algorithm
Requires Normality?
Notes




Linear Regression
Residuals only
For valid p-values and confidence intervals


Logistic Regression
No (but helps)
Performance improves with scaled/normalized data


LDA
✅ Yes
Assumes normal distribution within classes


QDA
✅ Yes
Like LDA but allows different covariances


Gaussian Naive Bayes
✅ Yes
Assumes Gaussian features per class


PCA
No (but helps)
Normality helps in better variance capture


Tree-based models
❌ No
Insensitive to feature distribution


Neural Networks
❌ No
Benefit from normalization, not normality



Transforms:
1. log transform : log x (base 2 or 10, depends on requirement) 
[log transform is mostly used on right skewed data]
2. reciprocal : 1/x
3. square : x^2
4. sqrt : square root of x
    Function transformer contains all these functions :
Example :


we can pass our custom logic as well in function transformer.

Power Transformer : 
1. Box-Cox
2. Yeo - Johnson
There is one more transformer called quantile transformer.

Binning and Binarization
Binning : 


used to convert numerical data into categorical.
whats the need? can be useful in decision tree.
how does it work? Data gets converted into categories.
- let's say app A has been downloaded by 6000 users and B 
has been downloaded by 10,34,333 users, this things cab be simplified.
Downloaded by 5k+ and 1M+ users

Another example : Age
# Example: A list of ages
ages = [5, 12, 18, 25, 30, 40, 50, 65, 70]

# Define the bin edges
bins = [0, 18, 35, 50, 100]

# Define the labels for each bin
labels = ['0-18', '19-35', '36-50', '51-100']

# Apply binning using pandas 'cut' function
age_groups = pd.cut(ages, bins=bins, labels=labels)

print(age_groups)

Output:
[0-18, 0-18, 19-35, 19-35, 19-35, 36-50, 36-50, 51-100]




1. Equal Width/Uniform Binning
It is called Equal width or uniform because the width of the interval is the same or uniform.

Suppose, we have the age data of some people and the maximum age in the data is 100 and minimum is 0. Let’s take no. of bin is 10 then the width of the interval will be 10.
Here, the graph will be Histogram which has 10 bins and width is also 10.


It is used because :-
It handled the outliers as outliers came into the last bins.
No change in the spread of data.


2. Equal Frequency/Quantile Binning
It is also known as Quantile binning. Here, we talk about the quantiles like 10 percentile, 20 percentile. The width of the interval is not the same as in equal frequency.
Let’s understand this with the help of an example. Suppose the above example of age data.
In Quantile Binning, let’s say we want 10 percentile.
The age width where 10 percentile lies like here is 0–16 then again 10 percentile i.e., 16–20 and if we start from 0–20 then it will be 20 percentile. Now, 20 -22 will be 10 percentile but from 0–22 will be 30 percentile.
Each interval contains 10% of total observations.
Zoom image will be displayed
Why to use equal frequency :-
Works better for outliers.
Improve the value spread.




3. KMeans Binning


4. Custom Binning
we use our knowledge base to perform this binning, for example:
0-5 = kids
6-12 = children
13-18 = Teens
etc.

Binarization : 
✅ What is Binarization?
Binarization is the process of converting continuous numerical values into binary (0 or 1) values.
It applies a threshold to the data:
Values greater than or equal to the threshold become 1
Values less than the threshold become 
🧪 Example: Exam Scores
Suppose you have students' exam scores, and you want to classify whether they passed (score ≥ 50) or failed (score < 50).
🔢 Original Data:
scores = [[30], [45], [60], [75], [90]]

⚙️ Binarization Code:
from sklearn.preprocessing import Binarizer

# Define threshold
binarizer = Binarizer(threshold=50)

# Apply binarization
binary_scores = binarizer.transform(scores)

print(binary_scores)
📤 Output:
[[0.]
 [0.]
 [1.]
 [1.]
 [1.]]


Example : convert rgb to grayscale , set the pixel threshold, lets say 127, pixel value below
127=0 (black) and above 127=1 or 255 (white)

Search This Blog

Machine Learning