Natural language processing

Ekta Kumari
Aug 7, 2021
11 min read

What is NLP?

NLP stands for natural language processing. For understanding in a simple way let’s take an example we all know Alexa (a voice-controlled virtual assistant) and we use Alexa in our day to day live like we say Alexa, turn on the light and the light turns on or Alexa, switch off the fan, play music and many more. And Alexa performs every action that we give to Alexa. But wait, did you ever think how Alexa understand our language like we say to Alexa and Alexa do whatever we tell to Alexa.

It’s because of NLP (natural language processing). Alexa or other machine can understand people’s language easily because of natural language processing. In other words, natural language processing helps machines or computer to communicate with humans in their own language and performs other language-related tasks. In fact, NLP is an approach to process, analyze and understand large amount of text data.

Why NLP?

When we look around us. In apps messages (WhatsApp), social media (Instagram, Facebook, YouTube), blog, google search and many other channels. Every second these channels are constantly generating big amount of text data. And every day billions of text data being generated. For understanding the large volume of text data and highly unstructured data source, every time we cannot longer use common approach and this is why NLP comes in.

To handle big amount of text data:

Let’s suppose you have to identify positive/negative/neutral sentiment, manually in given a sentence. It’s very easy and you complete this in some seconds. And now suppose if you have millions of sentences and perform sentiment analysis again. So how long it will take you to complete the task? well…. you get the point.

In today world machines can analyze more language-based data than humans without tiredness and in an unbiased and consistent way. For handling a large amount of data NLP plays a big role in this. Now NLP can apply to handle big amount of text data via cloud/distributed computing at an unprecedented speed.

To structure highly unstructured data source:

human language is surprising, complex and diverse. And we communicate in unending manners, both verbally and writing. Not only are there hundreds of languages and dialects, but in every language have a unique set of grammar and syntax rule, terms and slang.

For many applications such as speech recognition and text analytics NLP helps to resolve the ambiguity in language and adds some useful numeric structure.

Application of NLP:

Sentimental analysis
Chatbot
Speech recognition
Machine translation
Spell checking
Keyword searching
Advertisement matching

Component in NLP:

NLP is divided into two major components.

NLU- NLU stands for natural language understanding. The understanding generally refers to mapping the given input into natural language into useful representation and analyzing those aspects of language. Natural language understanding is usually harder because it takes a lot of time and a lot of things to usually understand a particularly language.

NLG- NLG stands for natural language generation. The generation is the process for producing meaningful phrases and sentences in the form of natural language from some internal representation.

Steps involve in NLP:

Now we see these steps-in detail one by one. So, let’s start,

Tokenization:

Tokenization breaks the raw text into words, sentences called tokens. Tokenization plays main role in dealing with the text data. Tokenization is a step in NLP who cuts the big sentences into small tokens. Here tokens can be either word, character, subword. Basically, tokenization is the process to spilt the text into words. Like

Example: He is my best friend.

[‘He’, ‘is’, ‘my', ‘best’, ‘friend’]

Implementation of tokenization:

Word tokenization:

import re
text = """ Ash is my best friend.i met him when i was in 7 standard. he was good in singing.
           now he becomes a singer.i am happy for him. 
           today i am missing him so much. our friendship is still going."""
tokens = re.findall("[\w']+",text)
tokens

In the above code First, let’s understand what re is. It is regular expression in python is denoted as RE (REs,regexes or regex pattern) are imported through re module. And then the function re.findall() find all word that matching pattern passed on it and stores it in the list. The “\w” its represents letters, numbers any word character. means any numbers of times.

Sentence tokenization

import re
text = """ Ash is my best friend.i met him when i was in 7 standard. he was good in singing.
           now he becomes a singer.i am happy for him. 
           today i am missing him so much. our friendship is still going."""
sentences = re.compile('[.]').split(text)
sentences

For performing sentence tokenization, we can use re.split() function. This will split the text into sentences. In the above code, we use re.compile() function where we passed [.] character this means the sentences will split as soon as the character is encountered.

Stemming:

Stemming is the process of normalizing words into its base or root forms. Its commonly referred to as stemming algorithms or stemmers. Stemming play an important role for the pipelining process in NLP. Stemming reduces the words “waits”, “waited”, “waiting” into its root word wait.

Errors in stemming:

There are two types of error in stemming. Overstemming and understemming. Overstemming occur when two words that have different stems are stemmed to same root. And understemming occur when two words that have not different stems are stemmed to same root.

Implementation of stemming word

# import these modules 
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize

ps = PorterStemmer()

# choose some words to be stemmed 
words = ["waiting","connections","respected","programmer"]
for w in words:
    print(w, " : " , ps.stem(w))

In the above code, we use nltk so first let’s understand what nltk is. Nltk stands for natural language tool kit is a python library to make programs that work with natural language. It can perform different operations such as tokenizing, stemming, parsing, tagging and semantic reasoning. package performs stemming using different classes. Porter stemmer is one of the classes so we import it. Porter stemmer uses suffix stripping to produce stems. It is known for its speed and simplicity.

The stemming algorithm works by cutting off the end or the beginning of the word taking into account a list of common prefixes suffixes that can be found in an infected word this indiscriminate can be successful in some occasions but not always. so, let’s understand now the concept of lemmatization below.

Lemmatization:

Lemmatization is the process that takes into consideration the morphological analysis of the word to do. so, it is necessary to have a detail dictionary which the algorithm can look through to link the form back to its original word or the root word which is also known as Lemma. Now what lemmatization does is groups together different infected forms of word called lemma. and it is somehow similar to stemming as it mapped several words into one common root. but the major difference between stemming and lemmatization is that the output of the lemmatization is a proper word. For example, a lemmatizer should map the word gone going and went into go. That will not be the output for stemming.

Implementation of lemmatization

from nltk.stem import WordNetLemmatizer
#creating WordNetLemmatizer object

wnl = WordNetLemmatizer()

# here we lemmatize single word 
word_list = ['affected', 'flying', 'waits', 'tried','feet']

for words in word_list:
    print(words + "-> " + wnl.lemmatize(words))

In the above code, we import wordnetlemmatizer so first we understand what wordnetlemmatizer is. Wordnet provides sematic relationships between its words. It has lexical database of over 200 languages. it is common and earliest technique for lemmatizer. And we also see word like ‘flying’, ‘tried’ etc. remained the same after lemmatization. It is because sometimes these words are treated as noun rather than a verb. To overcome this, we use pos tags.

Pos tags:

POS tags stands for part of speech tags. The grammatical type of the word is referred to as POS tags. It is the verb, noun, adjective, adverb, article and many more. It indicates how a word functions in meaning as well as grammatically within the sentence. To understand this let’s see the example below:

The cat killed the rat

In this example the (determinant), cat(noun), killed(verb), and again the (determinant), rat(noun) so what pos tags does it define the word in a grammatically way.

Implementation of POS tags

from nltk import pos_tag
sentence = "the cat killed the rat.".split()
print("after split:",sentence)
tokens_tag = pos_tag(sentence)
print("after token:",tokens_tag)

But sometimes A word can have more than one part of speech based on the context in which it is used. For example, “google” something on the internet here google is use as a word although its proper noun. These are the some of the problems that occur while processing the natural language now to overcome all of these challenges we have the named entity recognition.

Name entity recognition

It is the process of detecting the named entities such as the person name the company names, we have the quantities or the location now it has three steps which are the noun phrase identification, the phrase classification, and entity disambiguation. So, if we look at this particular example here “google CEO Sundar Pichai introduced the new pixel3 at New York central mall.” So, as we can see here google is identified as an organization Sundar Pichai as a person we have New York as a location and central mall is also defined as organization.

# Import nltk and download necessary packages
 
import nltk
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('maxent_ne_chunker')
nltk.download('words')
 
# then Load Data
 
sentence = "google ceo Sundar Pichai introduced the new pixel3 at New York central mall."
# Step Three: Tokenise, find parts of speech and chunk words 

for sent in nltk.sent_tokenize(sentence):
  for chunk in nltk.ne_chunk(nltk.pos_tag(nltk.word_tokenize(sent))):
     if hasattr(chunk, 'label'):
        print(chunk.label(), ' '.join(c[0] for c in chunk))

If we see the output above the NER founds total 2 entities, a name and a geographical location. And now let’s see what happened with code. Firstly we have imported nltk package and downloaded all the necessary modules. And then we define the sentence as a python variable. in step three we use

to return sentence tokenize. And next we tokenize the sentence and find parts of speech of each words and then we use POS tag using nltk. Identify each words and return the array of tuple with words and their pos tags.

At last we perform a chunking operation that return a nested nltk.tree. tree object so that we can iterate or travarse the tree object to get to the name entites

Implementation of NER using spacy:

import spacy
from spacy import displacy

NER = spacy.load("en_core_web_sm")

text = "Google CEO Sundar Pichai introduced the new pixel3 at New York central mall."
text1= NER(text)

for word in text1.ents:
    print(word.text,word.label_)
    
    
displacy.render(text1,style="ent",jupyter=True)

In the above code we use spacy. So, let’s understand what spacy is. Spacy is open-source library for advanced natural language processing. Once we have downloaded the spacy. The we load the spacy model into variable NER. Next, we loaded data into the model with the defined model and store it in a variable name text1. Now we iterate over the text1 variable to find the entities and then we print the word, the entity it belongs to. Lastly, we display the data.

Now once we have a divided the sentences into tokens done the stemming, lemmatization added the tags as the named entity recognition now it’s time for us to group it back together and make sense out of it. So, for that we have chunking.

Chunking:

Chunking basically means picking up individual pieces of information and grouping them together into the bigger pieces. Now these bigger pieces are also known as chunks. In the concept of NLP chunking means grouping of words and tokens into chunks. Let’s take an example,

we caught the pink panther.

So, as you can see here, we have pink as an adjective panther as a noun and the as a determiner and all of these are together chunk into a noun phrase. Now this helps in getting insights and meaningful information from the given text.

import nltk
text = "we caught the pink panther"
tokens = nltk.word_tokenize(text)
print(tokens)
tag = nltk.pos_tag(tokens)
print(tag)
grammar = "NP: {<DT>?<NN>*<JJ>}"
cp  =nltk.RegexpParser(grammar)
result = cp.parse(tag)
print(result)
result.draw()    # It will draw the pattern graphically which can be seen in Noun Phrase chunking

In the above code, we imported nltk. Then we define our example sentence into variable text. Then we tokenize the sentence and find their pos tags of each word. We need to tag noun, verb, adjective, and coordinating junction from the sentence. The important part of chunking is extracting named entities from the sentence.

If you need to tag noun, verb, adjective from the sentence. You can use the rule as below

Chunk: {<NN.?>*<VBD.?>*<JJ.>?}

It can be changed. It depends on you how you should define your chunk. But you will have to define your chunk according to your sentence.

The meaning of the symbol is following,

After seeing the graph, we can say that pink and panther are two different tokens but categorized as noun phrase.

Cosine similarity

Cosine similarity is a metric to measure the text similarity that determine how the two text sentences close to each other in terms of their context or meaning. A word represents into a vector form and a text documents represent into n- dimensional form. Mathematically, we use cosine similarity for measuring the cosine of the angle between two n- dimensional vector projected in a multi- dimensional space. Cosine similarity between two documents will range from 0 to 1. That means if cosine similarity score is 1 then two vectors have the same orientation. But if the score is 0 that means the two documents have less similarity.

Mathematical equation of cosine similarity between two non- zero vectors is:

To understand cosine similarity, let’s take an example in this we calculate the cosine similarity between two sentence and see how it’s works.

sentence_1 = "Python is an interpreted high-level general-purpose programming language."
sentence_2 = "Python is a general-purpose programming language"

data = [sentence_1, sentence_2]

from sklearn.feature_extraction.text import CountVectorizer

count_vectorizer = CountVectorizer()
vector_matrix = count_vectorizer.fit_transform(data)
vector_matrix

tokens = count_vectorizer.get_feature_names()
tokens

vector_matrix.toarray()

import pandas as pd

def create_dataframe(matrix, tokens):

    sentence_names = [f'sentence_{i+1}' for i, _ in enumerate(matrix)]
    df = pd.DataFrame(data=matrix, index=sentence_names, columns=tokens)
    return(df)

create_dataframe(vector_matrix.toarray(),tokens)

So first we need to count the word appear in each sentence. For counting the word appear in each sentence we use CountVectorizer function that is provided by Scikit- Learn library. Then we define the sentences and apply CountVectorizer on it. We called CountVectorizer function. And the vector that generated is a sparse matrix that is not printed, so we converted into numpy array and display it with the token word. Also, we are converting sparse vector matrix to numpy array to visualize the vectorized data of sentence_1 and sentence_2. For making a clear visualization of vectorize data along with tokens we create a pandas DataFrame.

from sklearn.metrics.pairwise import cosine_similarity

cosine_similarity_matrix = cosine_similarity(vector_matrix)
create_dataframe(cosine_similarity_matrix,['sentence_1','sentence_2'])

here, scikit- learn provide us the function that is calculated the cosine similarity. So now we finded the cosine similarity easily between the sentence_1 and sentence_2. So, after seeing or observing the above table, we can say that the cosine similarity between sentence_1 and sentence_2 is 0.77.

Jaccard similarity

Jaccard similarity is used to determine the similarity between two text sentences like how many common words are exist over total word and how the two text sentences close to each other in terms of their context. It is also known as Jaccard index or intersection over union. It is defined as the insertion of two sentences divided by the union of two sentences. The Jaccard similarity score is in a range of 0 to1. If the two sentences are identical, that means Jaccard similarity is 1. But if the two sentences do not have a common word that means Jaccard similarity is 0.

The mathematical representation of Jaccard similarity is –

Now let’s take an example and we will calculate the insertion and union of two sentences and find Jaccard similarity between sent1 and sent2.

Sent1= he is going abroad for studying.

Sent2= he is going abroad.

Jaccard similarity between two sentence is:

Python implementation of Jaccard similarity

def Jaccard_Similarity(sent1, sent2): 
    
    # List the unique words in a sentence
    words_sent1 = set(sent1.lower().split()) 
    words_sent2 = set(sent2.lower().split())
    
    # Find the intersection of words list of sent1 & sent2
    intersection = words_sent1.intersection(words_sent2)

    # Find the union of words list of sent1 & sent2
    union = words_sent1.union(words_sent2)
        
    # Calculate Jaccard similarity score 
    # using length of intersection set divided by length of union set
    return float(len(intersection)) / len(union)


sent1 = "he is going abroad for studying"
sent2 = "he is going abroad"

Jaccard_Similarity(sent1,sent2)

In the above code, we define a function to calculate the Jaccard similarity between two sentences. So, as we can see the result is identical, the two sentences have a Jaccard similarity of 0.66. Conclusion: Natural language processing is a field of computer vision and artificial intelligence that focus main on the interaction between the computer and humans. Research going in this sphere, there are too many developments to make machines smarter at learning and understanding the human language. And it is one of the growing technologies.

Comments