ML Notes personal

 



L21:

Types of Analysis :

TypeVariablesPurposeComplexity
Univariate  1Describe single variableLow
Bivariate  2Find relationships between pairsMedium
Multivariate  3+Understand complex interactions     High

Univariate :

only one variable, lets say we need to see the age of people travelling in business class.




Bivariate :




Multivariate :


We are observing 3 variables at a time:

1. the higher the bill amount, the higher the tip
2. which gender gave the tip? male or female




L22 :

Pandas Profiling 

    





insight for each column along with graph plotting 



L23:

Feature Engineering



1. Feature Transformation




Feature scaling means setting the values in some range. ex (-1 to 1)



here, we have combined 2 cols (sibling-spouse and parent-child and make a single column 'family')
the sum of 2 cols and then, again, categorize it with family size in [small, medium, large]

3. Feature Selection :

keep the necessary columns only and remove unnecessary ones.

-- Doubt in feature construction vs feature extraction, will be solved later on.

Feature Scaling

why?


here, salary feature will dominate the model, hence, no accuracy due to insufficient weight on imp columns such as age.

 Types:


Normalization (Min-Max Scaling)

Scales data to a fixed range, typically [0, 1]

Formula: (x - min) / (max - min)

Characteristics:

  • Preserves the original distribution shape
  • Bounded to a specific range (usually 0-1)
  • Sensitive to outliers (they can compress the rest of the data)
  • Also called Min-Max scaling

Example:

from sklearn.preprocessing import MinMaxScaler

# Original data: [10, 20, 30, 40, 50]
# Normalized: [0, 0.25, 0.5, 0.75, 1.0]

scaler = MinMaxScaler()
normalized_data = scaler.fit_transform(data)

Standardization (Z-score Scaling)

Centers data around mean=0 with standard deviation=1

Formula: (x - mean) / standard_deviation

Characteristics:

  • Creates data with mean=0, std=1
  • Not bounded to a specific range
  • Less sensitive to outliers
  • Preserves the relationship between data points
  • Also called Z-score normalization

Example:

from sklearn.preprocessing import StandardScaler

# Original data: [10, 20, 30, 40, 50] (mean=30, std≈15.81)
# Standardized: [-1.26, -0.63, 0, 0.63, 1.26]

scaler = StandardScaler()
standardized_data = scaler.fit_transform(data)
Geomatic intuition

Entire data has center 0,0

original data position vs after performing standardization.
Before standardization :


After Standardization :


so, now, both the cols will get equal priority.
model performs better after scaling.

what happens for outliers?


The results are same only.

Let's say 100 is a outlier.

NORMALIZATION: [Gets affected by outlier)

Without outlier: [1,2,3,4,5]
|----|----|----|----|
0 0.25 0.5 0.75 1.0 ← Good spread
With outlier: [1,2,3,4,100]
||||--------------------|
0 0.03 1.0 ← All normal values crushed together! (its not good, very difficult to
distinguish between 1 to 4)

STANDARDIZATION: [ Doesn't get affected by outlier)

Without outlier: [1,2,3,4,5]
|----|----|----|----|
-1.26 -0.63 0 0.63 1.26 ← Good spread
With outlier: [1,2,3,4,100]
|---|---|---|---|--------------------|
-0.53 -0.51 -0.48 -0.46 1.97 ← Still can see differences



Types of Normalization:




Geomatric intuition for normalization :


Just placing the entire data between 0 to 1

Encoding Categorical data
ordinal,nominal and label encoding
fit vs transform

what is fit vs transform

Simple Explanation

Fit = Learn the rules Transform = Apply the rules

Easy Example with Min-Max Scaling

Let's say you have training data: [10, 20, 30, 40, 50]

FIT: Learning the Rules

python
scaler.fit([10, 20, 30, 40, 50])

The scaler learns:

  • Min = 10
  • Max = 50
  • Rule: (x - 10) / (50 - 10)

Nothing gets transformed yet! It just remembers the min and max.

TRANSFORM: Applying the Rules

python
scaler.transform([10, 20, 30, 40, 50])
# Result: [0, 0.25, 0.5, 0.75, 1.0]

Now it applies the rule it learned to actually scale the data.

Why Separate Them?

Training Data

python
train_data = [10, 20, 30, 40, 50]

# Learn rules from training data
scaler.fit(train_data)
# Apply rules to training data  
train_scaled = scaler.transform(train_data)
# Result: [0, 0.25, 0.5, 0.75, 1.0]

New Test Data

python
test_data = [15, 25, 35]

# DON'T fit again! Use the same rules learned from training
test_scaled = scaler.transform(test_data)  
# Result: [0.125, 0.375, 0.625]

Key Point: Test data uses the same min (10) and max (50) learned from training!

What if You Fit Again on Test Data? BAD!

python
# WRONG WAY:
scaler.fit(test_data)  # This learns NEW rules: min=15, max=35
test_scaled = scaler.transform(test_data)
# Result: [0, 0.5, 1.0]  # Different scale! Can't compare with training!

Quick Rule

  • Ordinal: Has order → Use ordinal encoding
    • Examples: Size (S,M,L), Grade (A,B,C), Rating (1-5 stars)
  • Nominal: No order → Use one-hot encoding
    • Examples: Color, Country, Gender, Brand
  • Label: Only for target variable in classification
    • Example: Disease type → [0, 1, 2] for prediction

Visual Summary

Ordinal:    Small → Medium → Large     (Order matters)
            0    →   1     →   2

Nominal:    Red, Blue, Green           (No order)
            [1,0,0] [0,1,0] [0,0,1] (one-Hot Encoding)
            If 500 cols, you will have to make vector of 500 dims

Label:      Cat, Dog, Fish             (Target only)
            0,   1,   2
Let's dive deeper into the one hot encoding.
When, you make a sum or each vector , is will be 1, so model will start maintaining relationship between the cols, so to avoid dependency in input columns, we just skip one column from the vector and take n-1. It is the issue of multi colinearity and is called a dummy variable trap.
what if there are 1000 cols? we can keep n (lets say 800 important cols) and one col as 'other' which are less frequent or unimportant.

Lets take one example :




Notice the increased number of columns.
To remove the 1st col from this new dataset, drop_first = True. or drop = 'first'


Manual way for encoding(tedious):

# Step by step - messy!
age_salary_scaled = StandardScaler().fit_transform(df[['Age', 'Salary']])
city_encoded = OneHotEncoder().fit_transform(df[['City']])
education_encoded = OrdinalEncoder().fit_transform(df[['Education']])

# Then combine them back - pain!
final_data = np.concatenate([age_salary_scaled, city_encoded, education_encoded], axis=1)
column transformer :
Applies different transformations to different columns in one step


Pipeline in ML :
It is same as langchain pipeline.

1. Train Script Without Pipeline:

# Step 1: Data Cleaning X_train_clean = clean_data(X_train) # Step 2: Feature Scaling scaler = StandardScaler() X_train_scaled = scaler.fit_transform(X_train_clean) # Step 3: Model Training model = LogisticRegression() model.fit(X_train_scaled, y_train) # Step 4: Hyperparameter Tuning (Optional) # Let's assume hyperparameter tuning is done here

2. Eval Script Without Pipeline:

# Step 1: Data Cleaning
X_test_clean = clean_data(X_test) # Step 2: Feature Scaling X_test_scaled = scaler.transform(X_test_clean) # Use the same scaler from training # Step 3: Model Evaluation y_pred = model.predict(X_test_scaled) accuracy = accuracy_score(y_test, y_pred) print(f"Accuracy: {accuracy:.4f}")

3. Train Script With Pipeline:


from sklearn.pipeline import Pipeline # Define the pipeline with all 4 steps pipeline = Pipeline([ ('cleaner', DataCleaner()), # Step 1: Custom Data Cleaning ('scaler', StandardScaler()), # Step 2: Feature Scaling ('classifier', LogisticRegression()) # Step 3: Model Training ]) # Train the pipeline pipeline.fit(X_train, y_train)

4. Eval Script With Pipeline:

# Evaluation using the pipeline (no need to manually transform) y_pred = pipeline.predict(X_test) accuracy = accuracy_score(y_test, y_pred) print(f"Accuracy: {accuracy:.4f}")

Key Differences:

  • Without Pipeline: You manually apply each step (cleaning, scaling, and model training). During evaluation, you must repeat those steps (like transforming the test data).

  • With Pipeline: The pipeline takes care of the entire workflow. When calling pipeline.fit() and pipeline.predict(), it applies all the necessary steps automatically.


For the normal distribution of data, some sort of transformation is required.


what is normal distribution?
- there is one concept called QQplot, which shows
1. if black line is exactly on the green line, its good, it's a normal distribution.
2. If the black line is far from the green line, it is not the normal distribution.


For some algorithm to work properly, normal distribution is required.
Algorithm Requires Normality? Notes
Linear Regression Residuals only For valid p-values and confidence intervals
Logistic Regression No (but helps) Performance improves with scaled/normalized data
LDA ✅ Yes Assumes normal distribution within classes
QDA ✅ Yes Like LDA but allows different covariances
Gaussian Naive Bayes ✅ Yes Assumes Gaussian features per class
PCA No (but helps) Normality helps in better variance capture
Tree-based models ❌ No Insensitive to feature distribution
Neural Networks ❌ No Benefit from normalization, not normality

Transforms:
1. log transform : log x (base 2 or 10, depends on requirement) 
[log transform is mostly used on right skewed data]
2. reciprocal : 1/x
3. square : x^2
4. sqrt : square root of x
    Function transformer contains all these functions :
Example :


we can pass our custom logic as well in function transformer.

Power Transformer :
1. Box-Cox
2. Yeo - Johnson
There is one more transformer called quantile transformer.

Binning and Binarization

Binning :


used to convert numerical data into categorical.
whats the need? can be useful in decision tree.
how does it work? Data gets converted into categories.
- let's say app A has been downloaded by 6000 users and B
has been downloaded by 10,34,333 users, this things cab be simplified.
Downloaded by 5k+ and 1M+ users

Another example : Age

# Example: A list of ages

ages = [5, 12, 18, 25, 30, 40, 50, 65, 70] # Define the bin edges bins = [0, 18, 35, 50, 100] # Define the labels for each bin labels = ['0-18', '19-35', '36-50', '51-100'] # Apply binning using pandas 'cut' function age_groups = pd.cut(ages, bins=bins, labels=labels) print(age_groups)

Output:

[0-18, 0-18, 19-35, 19-35, 19-35, 36-50, 36-50, 51-100]




1. Equal Width/Uniform Binning

It is called Equal width or uniform because the width of the interval is the same or uniform.


Suppose, we have the age data of some people and the maximum age in the data is 100 and minimum is 0. Let’s take no. of bin is 10 then the width of the interval will be 10.

Here, the graph will be Histogram which has 10 bins and width is also 10.



It is used because :-

  • It handled the outliers as outliers came into the last bins.
  • No change in the spread of data.


2. Equal Frequency/Quantile Binning

It is also known as Quantile binning. Here, we talk about the quantiles like 10 percentile, 20 percentile. The width of the interval is not the same as in equal frequency.

Let’s understand this with the help of an example. Suppose the above example of age data.

In Quantile Binning, let’s say we want 10 percentile.

The age width where 10 percentile lies like here is 0–16 then again 10 percentile i.e., 16–20 and if we start from 0–20 then it will be 20 percentile. Now, 20 -22 will be 10 percentile but from 0–22 will be 30 percentile.

Each interval contains 10% of total observations.

Zoom image will be displayed

Why to use equal frequency :-

  • Works better for outliers.
  • Improve the value spread.




3. KMeans Binning


4. Custom Binning
we use our knowledge base to perform this binning, for example:
0-5 = kids
6-12 = children
13-18 = Teens
etc.

Binarization :

What is Binarization?

  • Binarization is the process of converting continuous numerical values into binary (0 or 1) values.

  • It applies a threshold to the data:

    • Values greater than or equal to the threshold become 1

    • Values less than the threshold become

🧪 Example: Exam Scores

Suppose you have students' exam scores, and you want to classify whether they passed (score ≥ 50) or failed (score < 50).

🔢 Original Data:

scores = [[30], [45], [60], [75], [90]]

⚙️ Binarization Code:

from sklearn.preprocessing import Binarizer # Define threshold binarizer = Binarizer(threshold=50) # Apply binarization binary_scores = binarizer.transform(scores) print(binary_scores)

📤 Output:

[[0.] [0.] [1.] [1.] [1.]]


Example : convert rgb to grayscale , set the pixel threshold, lets say 127, pixel value below
127=0 (black) and above 127=1 or 255 (white)

Comments

Popular posts from this blog

Extracting Tables and Text from Images Using Python

Positional Encoding in Transformer

Chain Component in LangChain