Beginners guide for text preprocessing in NLP

Swatimeena
14 min readSep 15, 2020

Preprocessing the data is the most essential part of any kind of problem statement in Machine Learning. The preprocessing for image tasks is normalization, making patches out of the bigger images to process them efficiently, ignoring the patches with less amount of information, and so on. This blog covers the most common yet important preprocessing steps for NLP tasks.

We can perform the following steps on our text data:

  1. Lower casing
  2. Punctuation free text
  3. Stop word removal
  4. Removing numerical data from the text
  5. Removing multiple whitespaces from the text
  6. Removing duplicate characters from the word
  7. Tokenization
  8. Lemmatization
  9. Stemming

Prerequisites: nltk and regular expression (re) packages

Import all the necessary packages first:

Import all packages

Lower casing:

  • Analyze the effect of “Natural”, “NATURAL”, “natural” and so on combinations in the dataset.
  • There could be so many “natural” words in the dataset but due to the mixed casing and there was insufficient evidence for the neural-network to effectively learn the weights for the less common version.
  • This type of issue is bound to happen when your dataset is fairly small, and lowercasing is a great way to deal with sparsity issues
  • Convert everything to the lower case would be the first step in preprocessing.

Punctuation removal

  • To further reduce the noise in the dataset, let’s remove all the punctuation in the dataset (also remove @,%,#, and other special characters).
  • There can be many ways to remove this but let’s try to make it straight forward one line code.
  • Using regular expressions (re):
Special character and punctuation removal using re.

Explanation:

string_no_punct=re.sub(r'[^\w\s]','',string_inp)
  1. re.sub: Return the string obtained by replacing the leftmost non-overlapping occurrences of pattern in string by the replacement “ ”. If the pattern isn’t found, the string is returned unchanged.
  2. The pattern here is, r’[^\w\s]’: Let’s break this into parts-

2.1. \w : Matches Unicode word characters

2.2. \s : Matches Unicode whitespace characters

2.3. ^: all the characters that are not in the set will be matched

3. This means that this line will replace all the character (in given string) which is not (due to ^) present in the set with “ ”.

4. String input starts with ‘h’ of ‘hey’, and this is a word character, so it will not be replaced by “ ”, only characters which are neither a word character nor whitespace character only those will be replaced. So (@,#,?,|,’) everything will be replaced.

  • Using string:
Removing punctuations by string.punctuation

string.punctuation contains

!”#$%&\’()*+,-./:;<=>?@[\\]^_`{|}~
Punctuation removal output

The output of both methods is the same.

Stop words removal

  • The intuition behind using stop words is that, by removing low information words from the text, we can focus on the important words instead.
  • We can also remove all short words (length < 2).
  • We can also append other unnecessary words to stopwords list.
Stopwords and short words removal
Stopwords output

So far:

input- 'Hey, @all do youuuuu want to learn Natural Language   Processinggg 100% ??'

Output after lower casing , removing stopwords and punctuation:

output- 'hey youuuuu want learn natural language processinggg 100'

Removing numerical data from the text

  • The numerical value in the dataset is also redundant, we need to keep only keywords in our text data.
  • This task is very basic and simple using re.
string_no_num=re.sub(r’[0–9]’,’’,string_inp)

Explanation:

re.sub(r’[0–9]’,’’,string_inp)

This line will replace everything in string_inp that matches anything between 0–9 set.

So far we changed our input to

input: hey youuuuu want learn natural language processinggg 100

After executing the above line we will have the below output

hey youuuuu want learn natural language processinggg

Note: If we apply the below pattern to the string

re.sub(r’[^0–9]’,’’,string_inp)

output:

100

It will replace everything that doesn’t match the set [0–9] due to the ^ operator at the beginning of the set.

Removing multiple whitespaces in the text

  • If we have an input string like
Input string: "hey you want   learn       natural language     processing"
  • We can remove the multiple whitespaces using re expressions.
re.sub(' +', ' ',string_inp)

Explanation

  • ‘+’ : For One or more occurrences
  • If there are multiple occurrences of whitespaces in the text then it will be replaced by single whitespace using re.sub

What if we have leading and trailing whitespaces?

  • In that case we can use strip() method which returns a copy of the string with both leading and trailing characters (argument or characters given inside the strip) removed.
  • For eg.
>> Input string: 
" hey you want learn natural language processing "
>> string_inp.strip()>> Output:
"hey you want learn natural language processing"
  • The leading and trailing whitespaces are removed here.
  • Clubbing both
string_inp = re.sub(' +', ' ', string_inp).strip(' ')orstring_inp = re.sub(' +', ' ', string_inp).strip()

Removing duplicate characters in a word

  • It is possible to have words like ‘youuuuu’, ’okkkk’ in the dataset when we work on real-world problem statement.
  • Removing multiple characters depends on the noise content in the dataset and it should be performed before removing stopwords so that we can change ‘youuuu’ to ‘you’ and then it can be further removed by stopwords removal line.
  • One other thing, we need to decide how many repeated characters can we allow in the word string, for eg. If we wish to delete 2 or more than 2 duplicate characters then the word ‘look’ will become ‘lok’ and ‘cool’ will become ‘col’.
  • Analyze the text data first, then remove the duplicate characters.

Explanation

(.) match and capture any single character \1{2,} then match the same character two or more times. The quantity \1 represents the first capture group in sub.

  • Be careful with the variables here, otherwise, if you have the word “looook”, the above line will make it “lok”, so change accordingly !! 🤓
input : 'are youuu okkkk'
output : 'are you ok'

Tokenization

  • In NLP, most of the operations are token-based (tokens- words in a sentence)
  • Tokenization means- Splitting a string of words into individual words (tokens)
  • Tokenization can be done in many ways — space-based word tokenization, sentence tokenization, whitespace tokenizer.
  • The commonly used tokenizers are offered by nltk and spaCy.
  • Advanced level tokenization is offered by transformers like Bert, Distilbert, and many more.
  • Word tokenization:
Word tokenization

We can tokenize a sentence using nltk or spacy or directly split() method.

So far after all the above preprocessing we have the string_input as:

input: hey you want learn natural language processing

The output after word_tokenize() from nltk:

output: ['hey', 'you', 'want', 'learn', 'natural', 'language', 'processing']

The output from split() method :

['hey', 'you', 'want', 'learn', 'natural', 'language', 'processing']

Both the above methods are space-based tokenization, hence the output is the same. The output will be a list of all the words in a sentence.

  • Sentence tokenization:

For some tasks, where we want to calculate metrics with respect to the words, we need sentence tokenization as well.

Sentence tokenization
input: "hey you want to learn natural language processing! let's go"
output: ['hey you want to learn natural language processing!', 'let's go']

The output will be a list of sub sentences in the sentence.

Explanation:

The sent_tokenize function uses an instance of PunktSentenceTokenizer from the nltk.tokenize.punkt module, which is already been trained and thus very well knows to mark the end and beginning of sentence at what characters and punctuation (For this you need to keep the punctuation in the sentence, skip the punctuation removal step).

Lemmatization

  • Lemmatization is the process of mapping a word to its root form ( in the word that exists in the language).
  • The libraries like nltk and spaCy, have different kinds of lemmatizers implemented.
  • Lemmatization using nltk:
Import WordNetLemmatizer from nltk

Points to remember:

  1. ‘pos’: Part of speech tag to a word like noun, verb, adjective, and so on. (Default is noun)
  2. Returns the input word unchanged if it cannot be found in WordNet.
  3. Importance of: ‘pos’ argument in lemmatizer.lemmatize():
input: 'better'

After performing:

lemmatizer.lemmatize('better')

The output will be:

'better'

But, after giving the pos tag, the output will be different

lemmatizer.lemmatize('better',pos='a')
output: 'good'

2. What if we give an entire string to the lemmatizer object?

  • Let’s have a look:
input : 'I am better'
lemmatizer.lemmatize(input)
output: 'I am better'
  • If we give pos tag,
input : 'I am better'
lemmatizer.lemmatize(input,pos='a')
output: 'I am better'
  • To have a better root form of a word, we need to tokenize the sentence (by any method) first and then give the pos tag to each word, and then we can apply the lemmatizer.

3. How can we give appropriate pos tags to a word??

  • We can get the pos tags in so many ways (nltk and spaCy)
  • Using nltk:
>> nltk.pos_tag(['caring'])
>> [('caring', 'VBG')]
>> nltk.pos_tag(['care'])
>> [('care', 'NN')]
  • nltk.pos_tag(): accepts only a list (list of words), even if its a single word and returns a tuple with word and its pos tag.
  • First, we need to convert the pos tags returned by nltk.pos_tag in the form of string which lemmatizer accepts.
def map_wordnet_pos(word):
nltk_tag = nltk.pos_tag([word])[0][1][0].upper()
lemmatizer_tag_dict = {"J": wordnet.ADJ,
"N": wordnet.NOUN,
"V": wordnet.VERB,
"R": wordnet.ADV}
return tag_dict.get(tag,wordnet.NOUN)

Explanation:

Define a dictionary that maps the first character of the nltk.pos_tag to wordnet pos tags that lemmatizer accepts.

Return the mapped value with tag, and if no matches like ‘They’ will have ‘DT’ tag then this function will return wordnet.NOUN (which anyway defaults in lemmatizer).

  • Now, find pos tag w.r.t each word in the text and obtain the root form.
print(" ".join([lemmatizer.lemmatize(w,map_wordnet_pos(w)) for w in nltk.word_tokenize(string_inp)]))orprint(" ".join([lemmatizer.lemmatize(w,map_wordnet_pos(w)) for w in string_inp.split()]))

for eg:

>>> input: "The striped bats are hanging on their feet for best"
>>> output: "The strip bat be hang on their foot for best"

Stemming

  • Stemming is the process of reducing inflection in words to their root form. The “root” in this case may not be a real root word, but just a canonical form of the original word.
  • There are different algorithms for stemming like Porter, Lancaster, Krovetz Stemmer, Lovins Stemmer, Dawson Stemmer, Xerox Stemmer, N-Gram Stemmer, and so on.
  • Let’s see the two most commonly used stemmers:
Import Stemmer objects
>> input='"cats"'
>> print(porter.stem(input),lancaster.stem(input))
>> cat,cat
  • The string input we got so far is:
>>  string_inp= 'hey want learn natural language processing'
>> print(" ".join([porter.stem(word) for word in string_inp.split()]))
>> output: 'hey want learn natur languag process'
>> print(" ".join([lancaster.stem(word) for word in string_inp.split()]))
>> output: 'hey want learn nat langu process'
  • What if we give an entire string to the Stemmer object?

Like lemmatization, here also we need to tokenize first, and then we can apply the stemmer.

for eg:

>> porter.stem(string_inp)
>> output: 'hey want learn natural language process'

Difference between stemming and lemmatizing:

  • The lemmatization process doesn’t just chop things off, it actually transforms words to the actual root.
  • The lemmatization process brings context to the words (See above example in which the stemming gives the root word which doesn’t make sense so there could be loss of information due to stemming process).
  • The stemming process gives rise to two errors: over-stemming and under-stemming.

Feel free to explore more about both topics and apply the preprocessing step to the dataset. Generally, both of these steps are ignored as no significant change in the performance of the network has been seen after applying these steps. Again, that depends on the task you have, and learning never hurts right? ✌🏻

Summing it up:

  • Noise Removal : This is the first step to perform in any NLP task This includes punctuation removal, special character removal, numbers removal, Html formatting removal, domain-specific keyword removal (e.g. ‘RT’ for retweet), source code removal, header removal, and more.
  • Let’s write a complete function for cleaning the data:
Input: data=['@a bank of clouds was building to the ##northeast!!',
'Her bank account was rarely over two hundred???',
'She sat on the river bank across from a series of wide large steps leading up a hill to the park where the Arch stood framed against a black sky',
'Deidres gaze was caught by the bank of windows lining one side of the penthouse.',
'How could a man with four million in the bank be in financiallll danger',
'Jackson picked up an apple from the bowl of fruit tossed it in the air caught it then bit into it',
'If he walked beneath the #apple_trees in the orchard would she be waiting for him with her sweet smile as she had the day they met',
'The mistletoe so #extensively used in England at Christmas is largely derived from the apple orchards of Normandy a quantity is also sent from the apple orchards of Herefordshire',
'About 80000 went in payments on all the estatessss to the Land Bank about 30000 went for the upkeep of the estate near Moscow the town house and the allowance to the three princesses about 15000 was given in pensions and the same amount for asylums 150000 alimony was sent to the countess about went for interest on debts',
'Apple was started in 1976 by Steve Jobs and Steve Wozniakkkk Before they made the company they sold blue boxes which had telephone buttons on them People could use them to make telephone calls from payphones without paying any money It did this by pretending to be a telephone operator',
'Apple Inc. is a mmmmultinational company that makes computer hardware (the Macintoshes), software (macOS, iOS, watchOS and tvOS), and mobile devices (iPod, iPhone and iPad) like music players. Apple calls its computers Macintoshes or Macs, and it calls its laptops MacBooks. Their popular line of mobile music players is called iPod, their smartphone line is called iPhone and their tablet line is called iPad. Apple sells their products all around the world.[5] Apple Inc. used to be called #Apple @Computer, Inc., but Apple changed their name after introducing the original iPhone']
  • Print out and have a look at each sentence:
>> for sent in data:
print(sent,'\n')
>> OUTPUT:@a bank of clouds was building to the ##northeast!!

Her bank account was rarely over two hundred???

She sat on the river bank across from a series of wide large steps leading up a hill to the park where the Arch stood framed against a black sky

Deidres gaze was caught by the bank of windows lining one side of the penthouse.

How could a man with four million in the bank be in financiallll danger

Jackson picked up an apple from the bowl of fruit tossed it in the air caught it then bit into it

If he walked beneath the #apple_trees in the orchard would she be waiting for him with her sweet smile as she had the day they met

The mistletoe so #extensively used in England at Christmas is largely derived from the apple orchards of Normandy a quantity is also sent from the apple orchards of Herefordshire

About 80000 went in payments on all the estatessss to the Land Bank about 30000 went for the upkeep of the estate near Moscow the town house and the allowance to the three princesses about 15000 was given in pensions and the same amount for asylums 150000 alimony was sent to the countess about went for interest on debts

Apple was started in 1976 by Steve Jobs and Steve Wozniakkkk Before they made the company they sold blue boxes which had telephone buttons on them People could use them to make telephone calls from payphones without paying any money It did this by pretending to be a telephone operator

Apple Inc. is a mmmmultinational company that makes computer hardware (the Macintoshes), software (macOS, iOS, watchOS and tvOS), and mobile devices (iPod, iPhone and iPad) like music players. Apple calls its computers Macintoshes or Macs, and it calls its laptops MacBooks. Their popular line of mobile music players is called iPod, their smartphone line is called iPhone and their tablet line is called iPad. Apple sells their products all around the world.[5] Apple Inc. used to be called #Apple @Computer, Inc., but Apple changed their name after introducing the original iPhone
  • Write a function that takes a string, then preprocess it and then return a preprocessed string.
def clean_data(w):
w=w.lower()
w=re.sub(' +', ' ', w).strip(' ')
w=re.sub(r'[^\w\s]','',w)
w=re.sub(r"([0-9])", r" ",w)
w=re.sub("(.)\\1{2,}", "\\1", w)
words = w.split()
clean_words = [word for word in words if (word not in stopwords_list) and len(word) > 2]
return " ".join(clean_words)
  • Now, either we can map this function directly to the list of sentences like
preprocessed_sent=list(map(clean_data,data))
  • Preprocessed Output:
>> for i in range(len(preprocessed_sent)):
print(i+1,"Original text: \n{}\n\nPreprocessed text: \n{}\n".format(data[i],preprocessed_sent[i]))
>>
1 Original text:
@a bank of clouds was building to the ##northeast!!

Preprocessed text:
bank clouds building northeast

2 Original text:
Her bank account was rarely over two hundred???

Preprocessed text:
bank account rarely two hundred

3 Original text:
She sat on the river bank across from a series of wide large steps leading up a hill to the park where the Arch stood framed against a black sky

Preprocessed text:
sat river bank across series wide large steps leading hill park arch stood framed black sky

4 Original text:
Deidres gaze was caught by the bank of windows lining one side of the penthouse.

Preprocessed text:
deidres gaze caught bank windows lining one side penthouse

5 Original text:
How could a man with four million in the bank be in financiallll danger

Preprocessed text:
could man four million bank financial danger

6 Original text:
Jackson picked up an apple from the bowl of fruit tossed it in the air caught it then bit into it

Preprocessed text:
jackson picked apple bowl fruit tossed air caught bit

7 Original text:
If he walked beneath the #apple_trees in the orchard would she be waiting for him with her sweet smile as she had the day they met

Preprocessed text:
walked beneath apple_trees orchard would waiting sweet smile day met

8 Original text:
The mistletoe so #extensively used in England at Christmas is largely derived from the apple orchards of Normandy a quantity is also sent from the apple orchards of Herefordshire

Preprocessed text:
mistletoe extensively used england christmas largely derived apple orchards normandy quantity also sent apple orchards herefordshire

9 Original text:
About 80000 went in payments on all the estatessss to the Land Bank about 30000 went for the upkeep of the estate near Moscow the town house and the allowance to the three princesses about 15000 was given in pensions and the same amount for asylums 150000 alimony was sent to the countess about went for interest on debts

Preprocessed text:
went payments estates land bank went upkeep estate near moscow town house allowance three princesses given pensions amount asylums alimony sent countess went interest debts

10 Original text:
Apple was started in 1976 by Steve Jobs and Steve Wozniakkkk Before they made the company they sold blue boxes which had telephone buttons on them People could use them to make telephone calls from payphones without paying any money It did this by pretending to be a telephone operator

Preprocessed text:
apple started steve jobs steve wozniak made company sold blue boxes telephone buttons people could use make telephone calls payphones without paying money pretending telephone operator

11 Original text:
Apple Inc. is a mmmmultinational company that makes computer hardware (the Macintoshes), software (macOS, iOS, watchOS and tvOS), and mobile devices (iPod, iPhone and iPad) like music players. Apple calls its computers Macintoshes or Macs, and it calls its laptops MacBooks. Their popular line of mobile music players is called iPod, their smartphone line is called iPhone and their tablet line is called iPad. Apple sells their products all around the world.[5] Apple Inc. used to be called #Apple @Computer, Inc., but Apple changed their name after introducing the original iPhone

Preprocessed text:
apple inc multinational company makes computer hardware macintoshes software macos ios watchos tvos mobile devices ipod iphone ipad like music players apple calls computers macintoshes macs calls laptops macbooks popular line mobile music players called ipod smartphone line called iphone tablet line called ipad apple sells products around world apple inc used called apple computer inc apple changed name introducing original iphone
  • We can also add a stemmer and lemmatizer in the function.
  • If there is something like this in the dataset or need to change from unicode (The Unicode Standard includes punctuation marks, diacritics, mathematical symbols, technical symbols, arrows, dingbats, emoji, etc.) to ascii:

Check out this blog for a complete understanding of unicode and normalization.

For eg:

string='@ôÖ__ôÖ-->ôÖ__ôaô'
  • Use this function:
def unicode_to_ascii(s):
return ''.join(c for c in unicodedata.normalize('NFD', s) if unicodedata.category(c) != 'Mn')
  • Complete function:
Preprocess the text data
  • Run nltk.download(‘all’) in .py script or jupyter notebook or import only useful packages like nltk.download(‘stopwords’) after installing nltk.

Enjoy and feel free to give suggestions. Please let me know if I missed something in the blog🤓✌🏻

References:

  1. Explore the libraries:

2. Regular expressions (re)

3. Unicodedata

4. Lemmatization approaches

--

--

Swatimeena

Data Scientist @Sprinklr | IIT Bombay | IIT (ISM) Dhanbad