NLP Notes Personal

 bag of words




N- gram : 




bi-gram : keep 2 wrods together while making vocab

tri-gram - keep 3 words together...

bag or words = uni gram


you can take range as well (1,3) - so, final vocab will have = 1 words + 3 word combos


why n-gram?


-the new vectors will have more distance




TF- IDF

- the word is multiple times in one document/row, but is rare in other rows, in that case, for this row, we will give high weightage to this word.






why take log in IDF formula?


so, that TF also gets equal priority and IDF doesn't dominate

continuous bag of words
-keep window of 1 word at each side
-helps in contextual prediction



Comments

Popular posts from this blog

Extracting Tables and Text from Images Using Python

Positional Encoding in Transformer

Chain Component in LangChain