Training Word2vec using gensim
Word2vec is a method to create word embeddings efficiently and has been around since 2013. Word2vec is a two-layer neural net that processes text by “vectorizing” words. Its input is a text corpus and its output is a set of vectors: feature vectors that represent words in that corpus.
Contents of the blog:
What is Word embedding?
Embeddings are the vector representations of a particular word. Embeddings can be calculated in many ways, for, eg.
One hot encoding (where 1 stands for the position where the word exists and 0 everywhere else)
Explanation: Sentence: “ Word Embeddings are a representation of a Word in numbers.”
First, we form a list of unique words from the given sentence, i.e. [ ‘Word,’ ‘Embeddings,’’ are,’’ a,’ ‘representation,’ ‘of,’ ‘in,’ ‘numbers’]
One hot representation of words:
Word : [1,0,0,0,0,0,0,0]
Embeddings : [0,1,0,0,0,0,0,0]
are: [0,0,1,0,0,0,0,0]
and so on.
Frequency-based Embedding
- Count Vector
- TF-IDF Vector
- Co-Occurrence Vector
Prediction based Embedding
- CBOW (Continuous Bag of words) — This method takes the context of each word as the input and tries to predict the word corresponding to the context. E.g. ‘You are a great learner,’ here target is “learner,” and the context is ‘great’ (Again that depends on the context window). If the context window is 2, then ‘a great’ will be the context and also, on both the sides 2 words will be the context for a target word.
2. Skip — Gram model — Skip-gram aims to predict the context given the word.
Why do we need word embeddings?
Many ML algorithms and almost all DL Architectures are incapable of processing strings or plain text in their raw form. They require numbers as inputs to perform tasks like classification or for similarity calculations etc.
In this blog, Let’s cover how to train word2vec using the gensim library.
The word2vec algorithms include skip-gram and CBOW models, using either hierarchical softmax or negative sampling. For the basics of CBOW and skip-gram models, follow this blog.
We can use the pre-trained word2vec models and get the word vectors like ‘GoogleNews-vectors-negative300.bin,’ or we can also train our own word vectors. Using the python package gensim, we can train our word2vec model very easily.✌🏻
There are more ways to train word vectors in Gensim than just Word2Vec, like Doc2Vec and FastText.
How to train our own word vectors using gensim word2vec?
- Gensim’s word2vec expects a sequence of sentences as its input. Each sentence is a list of words.
- Gensim only requires that the input provide sentences sequentially when iterated over, we can provide one sentence, process it, forget it, load another sentence, and keep doing this. Let’s see what kind of input is accepted by our word2vec model.
- If we have sentences like:
Sent1 = 'Train your own word vectors'
Sent2 = 'Find similar words from the model vocabulary'
- We need to give the input as:
model_input =
[['Train', 'your','own','word','vectors'], ['Find ','similar', 'words', 'from', 'the', 'model', 'vocabulary']]
Preparing input:
- Import all the required packages:
import nltk
from nltk.corpus import stopwords
from gensim.models import Word2Vec,KeyedVectors
from gensim.test.utils import datapath
import re
import unicodedata
from tqdm import tqdm
import gensim
import multiprocessing
import random
- The very first step is to clean the dataset, i.e., remove the unnecessary words and punctuations.
stopwords_list=stopwords.words('english')def clean_data(w):
w = w.lower()
w=re.sub(r'[^\w\s]','',w)
w=re.sub(r"([0-9])", r" ",w)
words = w.split()
clean_words = [word for word in words if (word not in stopwords_list) and len(word) > 2]
return " ".join(clean_words)
- For training word2vec, we need a huge dataset, and to handle huge datasets, we can write a generator function that provides sentences sequentially when iterated over.
def get_inp(fname):
with open(fname,'r') as f:
text=f.readlines()
sent=list(map(clean_data,text))
for lines in tqdm(sent):
yield lines.split()
Explanation:
Reading a .txt file (fname), cleaning the texts using the clean_data function, and yields the sentences there one at a time.
Input : 'Train your own word vectors'
Output: ['Train', 'your','own','word','vectors']
Taking input:
>>> data = datapath(‘lee_background.cor’)
>>> inp_data=get_inp(data)
- Replace datapath(‘lee_background.cor’) with the required .txt file path.
Create an empty model:
cores= multiprocessing.cpu_count()model = Word2Vec(min_count=5,window=5,size=300,workers=cores-1,max_vocab_size=100000)
- size(int, optional) : Dimensionality of the word vectors.
- window (int, optional): Maximum distance between the current and predicted word within a sentence.
- workers(int, optional): worker threads to train the model.
- sg(int, optional) : Training algorithm: 1 for skip-gram; otherwise CBOW (default skip-gram).
- max_vocab_size (int, optional) — Limits the RAM during vocabulary building; if there are more unique words than this, then prune the infrequent ones. Every 10 million word types need about 1GB of RAM. Set to None for no limit.
- min_count (int, optional) — Ignores all words with a total frequency lower than this.
- callbacks (iterable of
CallbackAny2Vec
, optional) – Sequence of callbacks to be executed at specific stages during training. (For, e.g., If we want to save the model epoch wise)
Build the vocabulary:
- Before training the model, we need to build the vocabulary which contains unique words from our dataset.
- One way is to initialize the vocabulary with two or three tokens, and we can keep updating the vocabulary in our loop.
model.build_vocab([['word','embeddings'],['training','model']])
- We can print the vocabulary using
model.wv.vocab
# This will be a dictionary with words as a key and keyedvectors object as the value.ormodel.wv.vocab.keys()
# To see the words in the vocabulary
- Then update the vocabulary in the loop
model.build_vocab(inp,update=True)# inp will be another list of list of tokens/words from the dataset.
- Another way is to build the vocabulary using the entire dataset sentences, but for large datasets, this approach will be very slow. We can randomly take some sentences and build our vocabulary on those sentences and then train the model.
Note: The words added in the vocabulary will be fixed, and we can perform tasks (eg. finding similar words) only on those words.
model.build_vocab(inp_data)
#For using the entire dataset for vocabulary.
- Set
texts = texts [:samples]
#to use some of the sentences from the dataset, samples can be any integer value.or texts=random.sample(texts,samples)
# To select samples randomnly
Possible Errors:
RuntimeError: you must first build vocabulary before training the model
Reason: Trying to building an empty vocabulary.
Possible Solution: Print model.wv.vocab, and check if its empty, if its empty the try to reduce min_count from 5 to 1 or to lesser value. In gensim’s Word2Vec is set to 5. If there is no word in your vocab with frequency greater than 4, your vocab will be empty and hence the error.
Cannot sort the vocabulary
Reason: Trying to change/modify the built vocabulary.
Possible solution: Either set update=True in model.build if you are trying to add new words in the vocabulary or define a fixed vocabulary first and don’t change it.
Training the model:
model.train(inp_data,total_examples=model.corpus_count,epochs=50)
total_examples
(int) - Count of sentences (Set to model.corpus.count)epochs
(int) - Number of iterations (epochs) over the corpus- callbacks: can also be added here to save the model epochwise or based on other tasks.
Saving the model:
model.wv.save_word2vec_format(path_to_save_string,binary=True)
# path_to_save_string: ‘/Desktop/gensim_w2v_model.bin’ormodel.save(path_to_save_string)
# path_to_save_string: ‘/Desktop/gensim_w2v_model.model’
Loading the model:
trained_model= KeyedVectors.load_word2vec_format(saved_model_path, binary=True)ortrained_model = gensim.models.Word2Vec.load('saved_model_path')
How to use the trained model:
- Similar words: find the word most similar words to a key()from the model
key='word_string'trained_model.wv.most_similar(positive=[key],topn=5)
#Gives top 5 similar words from the vocabulary.
The key should be present in the vocabulary. Above line will return similar words with their score of similarities.
- Word similarity: One of the methods for checking model quality is to check if it reports high level of similarity between two semantically (or syntactically) equivalent words.
trained_model.similarity('word1', 'word2')
This will give the similarity score between word1 and word2, should be high for similar words. (Both words should be present in the vocabulary)
- Unmatching word :
trained_model.wv.doesnt_match(['word1','word2','word3'])
Return the word that does not belong to the list.
- Analogy difference:
trained_model.most_similar(positive=['word1', 'word2'], negative=['word3'], topn=5)
Returns top 5 words based on : Which word is to word1 as word2 is to word3?
Eg.
trained_model.wv.most_similar(positive=["woman", "homer"], negative=["marge"], topn=5)
Return top 5 words based on : Which word is to woman as homer is to marge?
>>> trained_model.most_similar(positive=['father', 'girl'], negative=['mother'], topn=5)>>> Output:
[('boy', 0.9240204095840454),
('woman', 0.8523120880126953),
('dog', 0.843636155128479),
('lady', 0.8371353149414062),
('man', 0.8356690406799316)]
- Access the vectors
word_vector = trained_model['word_string']
print(word_vector) # Return a 300 Dim vector
print(word_vector.size)
- To get the frequency of a word in the dataset
trained_model.wv.vocab['word_string'].count
- To get all the words in the vocabulary length of the vocabulary
print(list(trained_model.wv.vocab.keys())) #All words
print(len(trained_model.wv.vocab)) #Length
Summarizing:
- Iterating over multiple .txt files with multiple sentences per file, building the vocabulary using all the files and all sentences from a file.
If we have a folder named ‘txt_data’ and all files are inside the txt_data then dirname should be ‘txt_data/’, skip this part if you only have a single data file
- Iterating over multiple .txt files with multiple sentences per file, building the vocabulary using some of the files randomly and random samples of the sentences from a file.
texts_sample (Line 38) and vocab_sample (Line 50) can be any integer value less than the len(texts) and len(input_data) respectively. Also, input_data should be a list of .txt file names for eg. [‘f1.txt’,’f2.txt’,…].
- Iterating over multiple .txt files with multiple sentences per file, building the vocabulary first with random tokens and updating it after each file.
How to use pre-trained word2vec models?
- Download the pre-trained model.
- Eg. model ‘GoogleNews-vectors-negative300.bin’ can be downloaded from here
- Load the model and perform all of the above tasks
model = KeyedVectors.load_word2vec_format('path/GoogleNews-vectors-negative300.bin', binary=True)
References:
- Explore gensim library:
- Different approaches for word embedding