Posts
Cross Attention in Decoder Block of Transformer
- Get link
- X
- Other Apps
Notice where the cross attention is marked, 2 arrows are coming from encoder block, and one is coming from decoder block. Why do we need to consider Encoder Block? Now, lets say we have predicted 2 words, and we need to predict the 3rd word? It will depend on what? Ofc, first 2 words of decoder block, and original sentence context from the Encoder block. So, we need to figure out the relationship between these two. How will we get the relationship? q : Hindi (from Decoder Block) k : Eng (from Encoder Block) v : Eng (from Encoder Block)
Decoder Blockin Transformer : Understanding of Masked Self-Attention and Masked Multi-Head Attention
- Get link
- X
- Other Apps
Masked Self-Attention : Statement: What is autoregressive model? Model that predicts next value based on previous data. Like, next word prediction, to predict the next word, you need to have a previous word. Now, the qs is why it behaves differently during training and inference? To get this answer, and prove the above statement, 'masked self attention' comes into the picture. --------------------------------------------------------------------------------------------------------------- Now, focus on below example. During Inference: To predict the next word, we will use the previous word as input. During Training : - If you look at this diagram, observe that, even if the model is predicting the wrong word, we are passing the correct word from our dataset as the input in next step, so that the model learns correctly. So, during training, we are not dependent on previous data. Hence, it is non-autoregressive during training. Now, pose a little, and think for a while, what does s...
Encoder Architechture - Transformer
- Get link
- X
- Other Apps
Transformer Architecture : Transformer Architecture has 2 parts : 1. Encoder 2. Decoder Each encoder and docoder part repeats 6 time. (mentioned in Attention is all you need paper) They found that this is the best fit value from experiments. Each encoder block is identical to others, same for decoder. Now, lets zoom in to the Encoder Block. 6 Encoder Blocks: Input would look like this: Input : sentence Output : positional Encoded Vector Now, lets focus on Multihead attention and Normalization part: Input : Output of first step (positional encoded vector) Output : Normalized vectors Why is there residual connection and why do we need to add input and final vector after multihead attention? Answer is not mentioned in Attention is all you need paper and nobody has an idea on the internet. People assume 2 possibilities: 1. For stable training 2. If the current layer messes up, we still have some good original data Now, let's focus on Feed-Forwar...
Layer Normalization in Transformer
- Get link
- X
- Other Apps
1. Why can't you apply batch normalization in Transformers? Let's say, we have 4 sentences, we want to pass them in self-attention layers in batches. Now, let's take r1 and r2. Notice, that no. of words are different in both the sentences. So, we will apply padding. ----------------------------------------------------------------------------------------------------------------------- ----------------------------------------------------------------------------------------------------------------------- ----------------------------------------------------------------------------------------------------------------------- Now, just focus on d1, notice those 0s for padding, you can clearly see the problem, if the sentence was too large and batch size was also large(32 sentences in one batch and lets say s1 has 20k tokens), then there will be too much unnecessary 0s, which will lead to wrong calculations and training. Hence, batch normalization is not a good solution for transf...
Positional Encoding in Transformer
- Get link
- X
- Other Apps
1. Why Position Matters in Transformers? Transformers rely on self‑attention, which processes tokens in parallel. This means, unlike RNNs, they don’t inherently know the order of words. So, sentences like “Ravi killed the lion” vs. “The lion killed Ravi” would look identical to a vanilla Transformer—clearly problematic! 🧪 Idea #1: The Naïve Approach A simple fix would be to add index/position of the token in an embedding vector. Issues: Unbounded values: Position IDs can become huge (e.g. 100,000+ in long texts), destabilizing training. Discrete steps: Sharp jumps between integers disrupt gradient flow. 🧪 Idea #2: Normalize the Position Numbers What if we divide the position numbers by a constant to make them small and smooth? That helps a bit—values don’t explode anymore. Issues: Now, if you observe, in both the sentences, the word at second position has got the different values. 1 for sentence1, and 0.5 for sentence2. so, the Neural network will get confused while training, what a...