Posts

Cross Attention in Decoder Block of Transformer

Image
  Notice where the cross attention is marked, 2 arrows are coming from encoder block, and one is coming from decoder block. Why do we need to consider Encoder Block? Now, lets say we have predicted 2 words, and we need to predict the 3rd word? It will depend on what? Ofc, first 2 words of decoder block, and original sentence context from the Encoder block. So, we need to figure out the relationship between these two. How will we get the relationship? q : Hindi (from Decoder Block) k : Eng (from Encoder Block) v : Eng (from Encoder Block)

Decoder Blockin Transformer : Understanding of Masked Self-Attention and Masked Multi-Head Attention

Image
 Masked Self-Attention : Statement: What is autoregressive model? Model that predicts next value based on previous data. Like, next word prediction, to predict the next word, you need to have a previous word. Now, the qs is why it behaves differently during training and inference? To get this answer, and prove the above statement, 'masked self attention' comes into the picture. --------------------------------------------------------------------------------------------------------------- Now, focus on below example. During Inference: To predict the next word, we will use the previous word as input. During Training : - If you look at this diagram, observe that, even if the model is predicting the wrong word, we are passing the correct word from our dataset as the input in next step, so that the model learns correctly. So, during training, we are not dependent on previous data. Hence, it is non-autoregressive during training. Now, pose a little, and think for a while, what does s...

Encoder Architechture - Transformer

Image
 Transformer Architecture : Transformer Architecture has 2 parts : 1. Encoder 2. Decoder Each encoder and docoder part repeats 6 time. (mentioned in Attention is all you need paper) They found that this is the best fit value from experiments. Each encoder block is identical to others, same for decoder. Now, lets zoom in to the Encoder Block. 6 Encoder Blocks: Input would look like this: Input : sentence Output : positional Encoded Vector Now, lets focus on Multihead attention and Normalization part: Input : Output of first step (positional encoded vector) Output :  Normalized vectors Why is there residual connection and why do we need to add input and final vector after multihead attention? Answer is not mentioned in Attention is all you need paper and nobody has an idea on the internet.          People assume 2 possibilities: 1. For stable training 2. If the current layer messes up, we still have some good original data Now, let's focus on Feed-Forwar...

Layer Normalization in Transformer

Image
 1. Why can't you apply batch normalization in Transformers? Let's say, we have 4 sentences, we want to pass them in self-attention layers in batches. Now, let's take r1 and r2. Notice, that no. of words are different in both the sentences. So, we will apply padding. ----------------------------------------------------------------------------------------------------------------------- ----------------------------------------------------------------------------------------------------------------------- ----------------------------------------------------------------------------------------------------------------------- Now, just focus on d1, notice those 0s for padding, you can clearly see the problem, if the sentence was too large and batch size was also large(32 sentences in one batch and lets say s1 has 20k tokens), then there will be too much unnecessary 0s, which will lead to wrong calculations and training. Hence, batch normalization is not a good solution for transf...

Positional Encoding in Transformer

Image
1. Why Position Matters in Transformers? Transformers rely on self‑attention, which processes tokens in parallel. This means, unlike RNNs, they don’t inherently know the order of words. So, sentences like “Ravi killed the lion” vs. “The lion killed Ravi” would look identical to a vanilla Transformer—clearly problematic! πŸ§ͺ Idea #1: The NaΓ―ve Approach A simple fix would be to add index/position of the token in an embedding vector. Issues: Unbounded values: Position IDs can become huge (e.g. 100,000+ in long texts), destabilizing training. Discrete steps: Sharp jumps between integers disrupt gradient flow. πŸ§ͺ Idea #2: Normalize the Position Numbers What if we divide the position numbers by a constant to make them small and smooth? That helps a bit—values don’t explode anymore. Issues: Now, if you observe, in both the sentences, the word at second position has got the different values. 1 for sentence1, and 0.5 for sentence2. so, the Neural network will get confused while training, what a...

How Is the Word Embedding Generated?

Image
How Is the Word Embedding Generated? | A Simple Guide with Examples What is a Word Embedding? Imagine you have a word like “apple.” Now, think of representing “apple” as a point in a space — let’s say in a 3D space. Each dimension can represent a feature or meaning category . For example: Dimension 1: Tech (related to Apple Inc.) Dimension 2: Fruit (edible apple) Dimension 3: Vehicle (rare but possible use) Let’s say our model generates this vector for “apple”: apple → [0.2, 0.8, 0.00001] where dimensions represent [tech, fruit, vehicle] This tells us that “apple” here is mostly referring to fruit (0.8), a little to tech (0.2), and barely related to vehicles. Similar Words Stay Closer In this vector space, words with similar meanings are closer together , and words with different meanings are far apart . 🐢 dog and 🐱 cat might be nearby. πŸš— car and 🌳 tree would be far apart. 🍎 “apple” (fruit) and πŸ‡ “grape” will likely be close. This closeness is captur...