Training Word2vec using gensim
--
Word2vec is a method to create word embeddings efficiently and has been around since 2013. Word2vec is a two-layer neural net that processes text by “vectorizing” words. Its input is a text corpus and its output is a set of vectors: feature vectors that represent words in that corpus.
Contents of the blog:
- What is Word embedding?
- Why do we need word embedding?
- How to train our own word vectors using gensim word2vec?
- How to use the trained model?
- How to use pre-trained word2vec models?
- References
What is Word embedding?
Embeddings are the vector representations of a particular word. Embeddings can be calculated in many ways, for, eg.
One hot encoding (where 1 stands for the position where the word exists and 0 everywhere else)
Explanation: Sentence: “ Word Embeddings are a representation of a Word in numbers.”
First, we form a list of unique words from the given sentence, i.e. [ ‘Word,’ ‘Embeddings,’’ are,’’ a,’ ‘representation,’ ‘of,’ ‘in,’ ‘numbers’]
One hot representation of words:
Word : [1,0,0,0,0,0,0,0]
Embeddings : [0,1,0,0,0,0,0,0]
are: [0,0,1,0,0,0,0,0]
and so on.
Frequency-based Embedding
- Count Vector
- TF-IDF Vector
- Co-Occurrence Vector
Prediction based Embedding
- CBOW (Continuous Bag of words) — This method takes the context of each word as the input and tries to predict the word corresponding to the context. E.g. ‘You are a great learner,’ here target is “learner,” and the context is ‘great’ (Again that depends on the context window). If the context window is 2, then ‘a great’ will be the context and also, on both the sides 2 words will be the context for a target word.
2. Skip — Gram model — Skip-gram aims to predict the context given the word.