Encoder Architechture - Transformer
Transformer Architecture :
Transformer Architecture has 2 parts :
1. Encoder
2. Decoder
Each encoder and docoder part repeats 6 time. (mentioned in Attention is all you need paper)
They found that this is the best fit value from experiments.
Each encoder block is identical to others, same for decoder.
Now, lets zoom in to the Encoder Block.
6 Encoder Blocks:
Input would look like this:
- Input : sentence
- Output : positional Encoded Vector
Input : Output of first step (positional encoded vector)
Output : Normalized vectors
Why is there residual connection and why do we need to add input and final vector after multihead attention?
- Answer is not mentioned in Attention is all you need paper and nobody has an idea on the internet.
People assume 2 possibilities:
- 1. For stable training
- 2. If the current layer messes up, we still have some good original data
Now, let's focus on Feed-Forward part:
Output : Introduced non-linearity and final normalized vectors
Steps:
1. 3 normalized vectors as input
2. stack them together and make a matrix. That will be 3*512
3. After that we have 2048 neurons, so matrix of 512*2048
4. If we multiply above 2 matrix, we will have 3*2048 (So, we have increased the dimensions of the matrix.)
5. Now, this neurons will go to next layer and it will be 2048*512
6. multiply the matrix of layer 4 and 5. We will get 3* 512, which was the original size (Reduced dimensionality)
7. What is the need for increasing and reducing the dimensions?
- We introduced non-linearity through ReLU
8. Residual connection, sum, normalized vectors.
This is how one Encoder works, now this output will go the next block and process will repeat.
Why we need multiple encoder blocks?
- Natural Language is complex, and for model to understand it better, multiple blocks are used.
Example (for intuition):
-
First block: “Paris is the capital of France.”
-
Learns that “Paris” and “France” are related.
-
-
Second block: Uses the info to understand “capital” relates to “France” through “Paris.”
-
Third block: Sees the whole phrase and understands the full meaning in context.
Comments
Post a Comment