bag of words
N- gram :
bi-gram : keep 2 wrods together while making vocab
tri-gram - keep 3 words together...
bag or words = uni gram
you can take range as well (1,3) - so, final vocab will have = 1 words + 3 word combos
why n-gram?
-the new vectors will have more distance
TF- IDF
- the word is multiple times in one document/row, but is rare in other rows, in that case, for this row, we will give high weightage to this word.
why take log in IDF formula?
so, that TF also gets equal priority and IDF doesn't dominate
continuous bag of words
-keep window of 1 word at each side
-helps in contextual prediction
Comments
Post a Comment