Natural Language Processing 101

Aryan Khimani
6 min readSep 9, 2022

--

We have all been in English class and were completely not paying attention to the teacher. You could have been daydreaming about the dream you had that day or when school was going to end. Languages are confusing but also super important. The ability to communicate and share information is one of the defining characteristics of humans.

Languages are very diverse and complex, with the total number of them being around six thousand and five hundred.

But when we text someone or post on social media, the way we textually form sentences becomes very unstructured. But thankfully our brains can still understand it pretty easily, but when given to a machine it would return you with an error. So how do you teach a machine to understand a language?

Natural Language Processing (NLP)

Well, that is where machine learning can come in handy; more specifically Natural Language Processing (NLP). So to understand how it works, we need to understand text mining or text analytics.

Text Mining

Text mining or text analytics is the process of extracting meaning from text, to put it simply. It usually involves the breakdown and then structuring of the input text. Which the NLP model does by deriving patterns within the structured data, and finally evaluating and interpreting to give us an output.

TL;DR NLP turns the text into data analysis via applying machine learning.

Applications of NLP

One of the most important and most used tools by Twitter is sentiment analysis. It’s able to pick up the emotion in the text or tweet like happiness or hate speech. Next is chatbots, which have become very common to put on company websites.

With chatbots acting as customer services for asking questions. Another one that we all know is speech recognition, which is used by various companies such as Siri, Cortana, and Google Assistant. Another use case of NLP is machine translation; an example is Google Translate, which uses NLP to translate languages in real-time. Lastly, one we use all the time and have a love-hate relationship with is auto-correcting. All of these applications come from Natural Language Processing (NLP)

Breakdown NLP

NLP can be broken down into two main categories one is Natural Language Understanding, which takes the input text, maps it, and runs it through the natural language which converts it into a useful representation of the input. The other is Natural Language Generation, which is the process of producing text such as phrases and sentences that resemble human language and style. Now in those two categories, there are even more subcategories. So let’s take a look at a few…

Tokenization

This is the process of taking an input string and turning it into Tokens, which are fragments of a structure that can be used to find the meaning of the text. An example could be the sentence “One Fish, Two Fish, Red Fish, Blue Fish ” This sentence can be divided into “One”, “Fish”, “Two”, “Fish” , “Red”, “Fish”, “Blue”, “Fish” tokens for each word.

Stemming

One of the methods the NLP model can take is taking the tokens and starting Stemming. This usually refers to normalizing the words into their root format. Such as the words “consultant, consulting, consultative” all of those words can be broken down into one root word “consult” As stemming works by taking away all the suffixes and prefixes of a word. But sometimes the stem word doesn’t fit the context.

Lemmatization

Now when we take a look at Lemmatization it considers the context which it can do through a pre-downloaded dictionary that it can use to link to the root word. As lemmatizations take words and group them called a Lemma, being similar to stemming. The main difference between the two is that the output of the Lemmatization can fit the context better or is a proper word. An example would be “better” should group them with good instead of its true root word.

POS Tags

We have all learned about nouns, verbs, adjectives, adverbs, etc… all of these are grammatical functions that determine a word’s meaning and grammatical meaning in a sentence. As words can have more than one POS Tag such as book or watch which can be both a verb and a noun based on what context it’s put in. This is crucial for the Natural Language Processing Model to understand what is being said in the sentence and what words to give more meaning to.

Named Entity Recognition

One problem you might have noticed is that names don’t fall under POS Tags, so that is where Named Entity Recognition (NER) is used to detect the named entities, such as a person’s name, the company’s names, locations, etc… An example of this would be.

“Google” Where the Named Entity Recognition (NER) would first see the company Google and label that as Organization. Then it will see the name Sambhav and label that as Amazing Person. Next, it will identify the city Toronto and label that as Location. Lastly, seeing the final building Empire State Building it will say it’s an Organization. Now once the model converts them into tokens, you need something to make sense of this.

Chunking

Now once we are done adding the tokens from Stemming, Lemmatization, and Named Entity Recognition, we need to add them all together into one group. That is where Chunking can come and help make sense of all of this. What Chunking does is that it takes the individual tokens of text and groups them together into one sentence, with these structures being known as Chunks. For NLP, chucking is a useful tool to combine all the different types of tokens made by previous systems. An example of this is this sentence “Cat Dog IBM KFC Canada London” which can be broken down to it’s root words through Stemming and Lemmatization, after which they go to their POS Tags and finally have been combined with Chunking.

Math Deep Dive

Outline

  1. NLP Concept Implementation
  2. Creating Training Data
  3. Pytorch Model and Training
  4. Save/Load Model and Implement the Chat

Alright, to start we need to train our model and to do so we need to give it training phrases for it to practice with. But we can’t just give the machine a simple string; we need to convert the phrases into a vector for the NLP model to understand it. Now to do so we need to implement something called Bag of Words.

Bag of Words

So you have strings of words that you want to use, but now you need to convert them into a Vector. That is, where Bag of Words comes in, for example, lets make an array of words [zebra, chicken, food, the, before, art] Now for each word like art we make an array a new unique array with a pattern like [0,0,0,0,0,1]to represent if that word is in use.

So now that we know how we are going to turn the word string into a vector that the NLP model will be able to understand, we are going to start setting up our neural networks. The model I’m going to use is a simple Feed Forward Neural Net. You can use this as a base that you can play around with.

The NLP Preprocessing

As the diagram above shows, we first start with tokenizations breaking down the string into the words and punctuation. After that, we start Stemming and Lemmatization. For now, we get rid of punctuation and implement the Bag of Words. To do this we need a package called Natural Language Toolkit (NLTK)

Step 1: Natural Language Toolkit (NLTK)

This is a Python library very useful for Natural Language Processing having all the Bag of Words, Stemming, Tokenization, etc… resources ready for use. To start you need to create a separate file you can call whatever example nltk_import.py

Creating Training Data

Alright now that we have imported all the packages we needed for Tokenization, Stemming and Bag of Words, it’s time to create training data for our machine to practice on. To start, create a new file called train.py.

Now for stemming you can’t take in punctuation as they aren’t words and would confuse our machine. So what we need to do is make it so that it ignores those words and then does stemming

Now that we have that sorted you need to, now all you need to do is set up your model itself. That should be pretty easy and where you can mess around with the hidden layers, inputs, etc. Hopefully, you have learned basic concepts in NLP and how you can use those concepts to build your chatbot.

Inspiration and resources used in the making of this article:

https://github.com/python-engineer/pytorch-chatbot/blob/master/train.py

https://paperswithcode.com/task/language-modelling

https://paperswithcode.com/task/speech-recognition

https://www.youtube.com/watch?v=RpWeNzfSUHw&list=PLqnslRFeH2UrFW4AUgn-eY37qOAWQpJyg

https://www.youtube.com/watch?v=5ctbvkAMQO4

--

--

Aryan Khimani

I’m 16-years old who is very passionate about using Gene Editing and Machine Learning to disrupt the world of biomedicine, using my knowledge of Python, AI, ML