Training Word2vec using gensim

7 min readSep 29, 2020

Word2vec is a method to create word embeddings efficiently and has been around since 2013. Word2vec is a two-layer neural net that processes text by “vectorizing” words. Its input is a text corpus and its output is a set of vectors: feature vectors that represent words in that corpus.

Contents of the blog:

What is Word embedding?

Embeddings are the vector representations of a particular word. Embeddings can be calculated in many ways, for, eg.

One hot encoding (where 1 stands for the position where the word exists and 0 everywhere else)

Explanation: Sentence: “ Word Embeddings are a representation of a Word in numbers.

First, we form a list of unique words from the given sentence, i.e. [ ‘Word,’Embeddings,’are,’a,’representation,’ ‘of,’ ‘in,’ ‘numbers’]

One hot representation of words:

Word :        [1,0,0,0,0,0,0,0]
Embeddings : [0,1,0,0,0,0,0,0]
are: [0,0,1,0,0,0,0,0]

and so on.

Frequency-based Embedding

  1. Count Vector
  2. TF-IDF Vector
  3. Co-Occurrence Vector

Prediction based Embedding

  1. CBOW (Continuous Bag of words) — This method takes the context of each word as the input and tries to predict the word corresponding to the context. E.g. ‘You are a great learner,’ here target is “learner,” and the context is ‘great’ (Again that depends on the context window). If the context window is 2, then ‘a great’ will be the context and also, on both the sides 2 words will be the context for a target word.

2. Skip — Gram model — Skip-gram aims to predict the context given the word.

Why do we need word embeddings?


Data Scientist @Sprinklr | IIT Bombay | IIT (ISM) Dhanbad