Encoder Architechture

Encoder Architechture - Transformer

- July 06, 2025

Transformer Architecture :

Transformer Architecture has 2 parts :

1. Encoder

2. Decoder

Each encoder and docoder part repeats 6 time. (mentioned in Attention is all you need paper)

They found that this is the best fit value from experiments.

Each encoder block is identical to others, same for decoder.

Now, lets zoom in to the Encoder Block.

6 Encoder Blocks:

Input would look like this:

Input : sentence
Output : positional Encoded Vector

Now, lets focus on Multihead attention and Normalization part:

Input : Output of first step (positional encoded vector)

Output : Normalized vectors

Why is there residual connection and why do we need to add input and final vector after multihead attention?

Answer is not mentioned in Attention is all you need paper and nobody has an idea on the internet.

People assume 2 possibilities:

1. For stable training
2. If the current layer messes up, we still have some good original data

Now, let's focus on Feed-Forward part:

Input : Output of previous layer (normalized vectors)

Output : Introduced non-linearity and final normalized vectors

Steps:

1. 3 normalized vectors as input

2. stack them together and make a matrix. That will be 3*512

3. After that we have 2048 neurons, so matrix of 512*2048

4. If we multiply above 2 matrix, we will have 3*2048 (So, we have increased the dimensions of the matrix.)

5. Now, this neurons will go to next layer and it will be 2048*512

6. multiply the matrix of layer 4 and 5. We will get 3* 512, which was the original size (Reduced dimensionality)

7. What is the need for increasing and reducing the dimensions?

We introduced non-linearity through ReLU

8. Residual connection, sum, normalized vectors.

This is how one Encoder works, now this output will go the next block and process will repeat.

Why we need multiple encoder blocks?

- Natural Language is complex, and for model to understand it better, multiple blocks are used.

Example (for intuition):

First block: “Paris is the capital of France.”
- Learns that “Paris” and “France” are related.
Second block: Uses the info to understand “capital” relates to “France” through “Paris.”
Third block: Sees the whole phrase and understands the full meaning in context.

Search This Blog

Machine Learning