DL Notes Personal1

Types of Neural Networks

Perceptron :

x1, x2 = input

w1, w2 = weights

b = bias

z = sum of all these

now, z will be given to activation function

For example, step function, output will be 0 or 1

How to use it?

You train the model and find the value of w1, w2 and b

Perceptron is a line :

It's a binary classifier, that divides data into 2 regions

No, matter how many features we have, it will always divide the data into 2 parts

Limitation : Perceptron will be used on linear data only

Code :

Prediction using perceptron

How do we find the correct values for weights?

Step 1 : we take the random values for each weight

Step 2 : randomly take one data point, if the point is on correct region, do nothing, else, move the line

Step 3 : repeat step 2 for 1000(n) times

How do we know that point is on correct region?

- we know that blue points should be in -ve region and green points should be in +ve regions

- so, we just have to make line transformations accordingly

How do we move towards correct values of A,B,C?

1. If you make change in C, line moves parallely

2. If you change x, line will move on the x axis, y is still same

3. same for change in y

Example :

If you want to move point in -ve region, you subtract it from line

If you want to move point in +ve region, you add it in line

but , in this way line will move very drastically, so we use learning rate.

Algorithm :

Instead of using these 2 if conditions, it can be simplified to this 1 formula

Explanation :

3rd row = green point

4th row = red point

This method will find a line, but it is not gauranteed that it is the best fit line, because, in the empty space between 2 regions, there can be multiple lines.

Perceptron loss function

So, how do we find the values such that this Loss becomes minimum

Intuition :

Assume that w2 and b are constant, so for w1, we have to get a point where L is minimum.

For that , we will use gradient descent.

Types of perceptron

Problem with Perceptron :

- Works only on linear data

xor dataset, perceptron fails here

MLP - Multi layer perceptron :

k=lay

er no where this is going to

i = from node

j= to node

bij => i=layer no, j=node no.

Perceptron with Sigmoid

Intuition :

how do we overlap?

so, for this diagram, MLP will look like below :

In multi class classification, there will be multiple nodes in output layer. dog, cat, cow

Hiden layers with correct Activation function can identify any data pattern

Trainable parameters in Neural network

what is it ? - Sum of weights + biases at each layer

Formula : Sigma (no_of_nodes_in_current_layer*no_of_nodes_in_next_layer) + bias aka no_of_nodes_in_next_layer

26 trainable params in above image

Once the training is done, make prediction

Training :

Another Example :

Loss Functions in DL

cost function vs loss function

loss func : applies on single row for each row

cost func : applies on batch data

loss func :

mse (Mean Squared Error)

mae (Mean Absolute Error)

bce (Binary Cross-Entropy)

CCE (Categorical Cross-Entropy)

SCE (Sparse Categorical Cross-Entropy)

Task Type	Loss Function	Output Activation Function
Regression	MSE / MAE	Linear (None)
Binary Class.	BCE	Sigmoid
Multi-class	CCE / SCE	Softmax

Backpropagation

y bar =

First 3 done

Notice that, inside each iteration, step c will be performed 9 times as we have 9 weights

Finally, it will look like this :

Memoization :

it is the concept where we use more memory to save time complexity

Types of Gradient Descent

Batch/Stochastic/Mini batch

lets say epoch=10

Batch : 1 to 50 rows for each epoch

=> 10 times weight update

Stochastic : same but shuffle the data before starting prediction => 500 times weight update

stochastic is faster than batch because it will converge and give better results in lesser epochs

spikey nature of SGD has one drawback : gives approx minimum value around the actual value, not the exact

Mini batch :

- To use RAM effectively

when does it happen??

Example :

The weight didn't change much 1 to 0.999, this ain't helping to minimize the loss

1) less no. of hidden layers

One more problem with Gradient

Problems in Neural Networks and its Solutions

Overfitting in NN- Solutions :

Dropouts :

for each epoch, you randomly switch off some nodes in input layer and hidden layer.

p=0.5 means, we can switch off 50% nodes at each layer

During prediction, we use all the nodes.

During testing, we multiply the weight by 0.75 as the node was present only 75% of time and 25% of time, it was misssing.

p value :

L2 regularization :

Add penalty in cost function

This formula helps reducing weight, hence reducing overfitting. L2 is better than L1, because in L1 there are chances that weight become 0, in L2, it will be nearly equal to 0, but not the exact 0.

Orange curve= Regularized weights

Blue = Normal weights

Why do we need activation functions?

- to introduce non-linearity in data

Activation Functions :

1. Sigmoid - used for binary classification

- non-linear

- differentiable

Disadvantage : vanishing gradient problem, so only used in output layer

- computationally expensive because of exponential in formula

- non-zero centroid, it means you can either only add in weghts or either subtract, both are not possible, so slow training

range (0,1)

2. tanh

range(-1,1)

RELU

Non-Zero Centroid : This problem is solved by Batch Normalization

if more than 50% neurons die, then, its a dieing Relu problem

why dead?

max(0,x)

- so problem occurs when z is -ve.

Weight Initialization :

What not to do?

1. never keep all weights as 0 - derivative will be 0 and hence no change in weights

2. Never keep same value for all weights - in this case most of the term will have the same value and z11=z12 plus a11=a12 and that's why, in the next layer, no matter how many neurons you take, it will always work as a single neuron, hence, no linearity, and wrong model

3. randomly initialize weights with small values- vanishing gradient and slow convergence

4. randomly initialize weights with large values - vanishing gradient and slow convergence, spikes and unstable training because very much different value for gradient

What to do?

1. xavier initialization - normal

limit -

All these formulas have been derived by scientist by experiments.

Batch Normalization

g(z) is an activation function

Why we are using gamma, beta? it is the opposite of normalization, right? because it gives NN the flexibility, not every data needs normalization, so it can keep or remove it.

gamma and beta will change during training, back propagation

Advantages of Batch Normalization :

Optimizers

challenges :

1. same LR throughout the training

2. local minima

Solution :

EWMA

Remember one thing, newer point has higher weightage than the older point

Problem with SGD

Optimization Technique

Keras Tuner -

works as Gridsearchcv for neural networks

CNN

CNN will work better than ANN for images

Why ANN is not good?

Too many weights and chances for overfitting

How will we know if it is an edge or not

drastic change in colur of 2 adjacent pixels

example taken of horizontal edge detection filter

why is padding needed?

- original image size is reduced, so middle part will have more impact than the edges

- so we should apply padding

If you apply filter on below image, row with index 2 will repeated for each filter window, so it will have high impact than others.

without padding : size is decreasing

with padding : size remains same

stride = step/jump

stride=1

stride=2

the more thr stride, the less the output

size will decrease

pooling makes them location independent

what is convolution?

It is just a filter that applies to the image

CNN Architecture

lenet 5 Architecture

weights, bias and training for CNN, all of these applied to filter

Difference between ANN and CNN

ANN : no. of trainable params depends on input size

CNN : no. of trainable params depends on filter size, so no matter how much data is there, no of weights remains same

Transfer Learning

Freeze the conco layers and train on FC

this is how you can freeze from specific layer

so, we need non-linearity for multiple usecases.

Example :

Example : multiple outputs - age and some classification

Example : multiple inputs

RNN

Problem with ANN

for a single dataset also, the input size is not same, zero padding will also not work, if curent sentence has 5 words and the largest one has 100, then we are using 95 unnecessary vectors

For the consistency, we provide O0 for 1st one as zero or random

Types of RNN

Problems with RNN

why?

LSTM

Forget Gate : Remembers only necessary things

ft=[1,1,1] - after multiplication, model remembers everything

ft=[0.5,0.5,0.5] - model remembers 50% of the things

Input Gate

GRU

Gated Recurrent Unit

update :

reset :

for vikaram junior, we will modify vikram's, as there is no mention of conflict for vk jr

Bi directional RNN/LSTM/GRU

why? we need not only previous words but next words also.

2 parts : 1:forward rnn
1 backward RNN

Encoder Decoder

why?

Attention mechanism,

to predict next word in translation, we use previously translated words + all input words according to weights, weights will be calculated by ANN

Bahdanau Attention

Luong Attention

___ Play List Completed _____________

Search This Blog

Machine Learning