Natural language processing
What is NLP?
NLP stands for natural language processing. For understanding in a simple way let’s take an example we all know Alexa (a voice-controlled virtual assistant) and we use Alexa in our day to day live like we say Alexa, turn on the light and the light turns on or Alexa, switch off the fan, play music and many more. And Alexa performs every action that we give to Alexa. But wait, did you ever think how Alexa understand our language like we say to Alexa and Alexa do whatever we tell to Alexa.
It’s because of NLP (natural language processing). Alexa or other machine can understand people’s language easily because of natural language processing. In other words, natural language processing helps machines or computer to communicate with humans in their own language and performs other language-related tasks. In fact, NLP is an approach to process, analyze and understand large amount of text data.
When we look around us. In apps messages (WhatsApp), social media (Instagram, Facebook, YouTube), blog, google search and many other channels. Every second these channels are constantly generating big amount of text data. And every day billions of text data being generated. For understanding the large volume of text data and highly unstructured data source, every time we cannot longer use common approach and this is why NLP comes in.
To handle big amount of text data:
Let’s suppose you have to identify positive/negative/neutral sentiment, manually in given a sentence. It’s very easy and you complete this in some seconds. And now suppose if you have millions of sentences and perform sentiment analysis again. So how long it will take you to complete the task? well…. you get the point.
In today world machines can analyze more language-based data than humans without tiredness and in an unbiased and consistent way. For handling a large amount of data NLP plays a big role in this. Now NLP can apply to handle big amount of text data via cloud/distributed computing at an unprecedented speed.
To structure highly unstructured data source:
human language is surprising, complex and diverse. And we communicate in unending manners, both verbally and writing. Not only are there hundreds of languages and dialects, but in every language have a unique set of grammar and syntax rule, terms and slang.
For many applications such as speech recognition and text analytics NLP helps to resolve the ambiguity in language and adds some useful numeric structure.
Application of NLP:
Component in NLP:
NLP is divided into two major components.
NLU- NLU stands for natural language understanding. The understanding generally refers to mapping the given input into natural language into useful representation and analyzing those aspects of language. Natural language understanding is usually harder because it takes a lot of time and a lot of things to usually understand a particularly language.
NLG- NLG stands for natural language generation. The generation is the process for producing meaningful phrases and sentences in the form of natural language from some internal representation.
Steps involve in NLP:
Now we see these steps-in detail one by one. So, let’s start,
Tokenization breaks the raw text into words, sentences called tokens. Tokenization plays main role in dealing with the text data. Tokenization is a step in NLP who cuts the big sentences into small tokens. Here tokens can be either word, character, subword. Basically, tokenization is the process to spilt the text into words. Like
Example: He is my best friend.
[‘He’, ‘is’, ‘my', ‘best’, ‘friend’]
Implementation of tokenization:
import re text = """ Ash is my best friend.i met him when i was in 7 standard. he was good in singing. now he becomes a singer.i am happy for him. today i am missing him so much. our friendship is still going.""" tokens = re.findall("[\w']+",text) tokens
In the above code First, let’s understand what re is. It is regular expression in python is denoted as RE (REs,regexes or regex pattern) are imported through re module. And then the function re.findall() find all word that matching pattern passed on it and stores it in the list. The “\w” its represents letters, numbers any word character. means any numbers of times.
import re text = """ Ash is my best friend.i met him when i was in 7 standard. he was good in singing. now he becomes a singer.i am happy for him. today i am missing him so much. our friendship is still going.""" sentences = re.compile('[.]').split(text) sentences
For performing sentence tokenization, we can use re.split() function. This will split the text into sentences. In the above code, we use re.compile() function where we passed [.] character this means the sentences will split as soon as the character is encountered.
Stemming is the process of normalizing words into its base or root forms. Its commonly referred to as stemming algorithms or stemmers. Stemming play an important role for the pipelining process in NLP. Stemming reduces the words “waits”, “waited”, “waiting” into its root word wait.
Errors in stemming:
There are two types of error in stemming. Overstemming and understemming. Overstemming occur when two words that have different stems are stemmed to same root. And understemming occur when two words that have not different stems are stemmed to same root.
Implementation of stemming word
# import these modules from nltk.stem import PorterStemmer from nltk.tokenize import word_tokenize ps = PorterStemmer() # choose some words to be stemmed words = ["waiting","connections","respected","programmer"] for w in words: print(w, " : " , ps.stem(w))
In the above code, we use nltk so first let’s understand what nltk is. Nltk stands for natural language tool kit is a python library to make programs that work with natural language. It can perform different operations such as tokenizing, stemming, parsing, tagging and semantic reasoning. package performs stemming using different classes. Porter stemmer is one of the classes so we import it. Porter stemmer uses suffix stripping to produce stems. It is known for its speed and simplicity.
The stemming algorithm works by cutting off the end or the beginning of the word taking into account a list of common prefixes suffixes that can be found in an infected word this indiscriminate can be successful in some occasions but not always. so, let’s understand now the concept of lemmatization below.
Lemmatization is the process that takes into consideration the morphological analysis of the word to do. so, it is necessary to have a detail dictionary which the algorithm can look through to link the form back to its original word or the root word which is also known as Lemma. Now what lemmatization does is groups together different infected forms of word called lemma. and it is somehow similar to stemming as it mapped several words into one common root. but the major difference between stemming and lemmatization is that the output of the lemmatization is a proper word. For example, a lemmatizer should map the word gone going and went into go. That will not be the output for stemming.
Implementation of lemmatization