Layer Normalization in Transformer
1. Why can't you apply batch normalization in Transformers?
Let's say, we have 4 sentences, we want to pass them in self-attention layers in batches.
Now, let's take r1 and r2.
Notice, that no. of words are different in both the sentences. So, we will apply padding.
-----------------------------------------------------------------------------------------------------------------------
-----------------------------------------------------------------------------------------------------------------------
-----------------------------------------------------------------------------------------------------------------------
Now, just focus on d1, notice those 0s for padding, you can clearly see the problem, if the sentence was too large and batch size was also large(32 sentences in one batch and lets say s1 has 20k tokens), then there will be too much unnecessary 0s, which will lead to wrong calculations and training. Hence, batch normalization is not a good solution for transformers.
Important Notes:
Batch normalization works across the batch (Vertical)
Layer Normalization works across the feature (Horizontal)
Hence justified, why Layer Normalization is used in Transformers instead of Batch Normalization!
Comments
Post a Comment