Table of Content

Natural Language Processing

Natural Language Processing (NLP) is a branch of Artificial Intelligence that deals with text. Chatbot, sentiment analysis, text summarization, speech recognition, spam detection, content recommendation, generate text, and translation are the examples of NLP. So any projects that deals with text, it's called NLP 🎯

Preprocessing

Preprocessing is a step where you modify the characteristics of a text. The reason you should perform preprocessing is that you need to clean the data 🧹 before it becomes the input for your model. We know that data quality is crucial for achieving the best model accuracy and time effiency. Check out the preprocessing steps on the following explanation.

Note: Before you read the preprocesing below, you have to know that not all preprocesing steps you should use. It is relatively based on your project goals. For the example, preprocesing on sentiment analysis is may different with generate text. The techniques could be same, but the preprocesing sequence is different. So there is no ONE-SIZE-FIT-ALL solution for preprocesing steps. The more you doing NLP projects, you also build your own instinct of what techniques you should use to achieving project goals.

Normalization: Remove or convert irrelevant data on the text.

Normalization could be case-folding (lower case), remove link, remove punctuations etc. The point of doing normalization is make the data clean by removing or converting irrelevant data on the text. This process helps in standardizing the text, making it easier for subsequent text processing.

text	normalization
The cats are running and jumping quickly https://www.google.com/	cat running jumping quickly
He enjoys swimming in the pool every morning.	enjoys swimming pool every morning
#$@%^ They are playing soccer in the park.	playing soccer park
She loves to read books before bedtime.	love read book bedtime
We were watching a movie when the power went out.	watching movie power went

Remove Stop Words: Remove common words in sentences with low informative

Common, low-informative words in text, known as "Stop Words", are often irrelevant and introduce noise into text analysis. These are the list of stop words 👉 list of stopwords

text	remove stopwords
the cats are running and jumping quickly	cat running jumping quickly
he enjoys swimming in the pool every morning	enjoys swimming pool every morning
they are playing soccer in the park	playing soccer park
she loves to read books before bedtime	love read book bedtime
we were watching a movie when the power went out	watching movie power went

Tokenization: Splitting text by spaces

Tokenization is splitting text by spaces into one single word (unigram). This steps is mandatory before you go to the next steps (stemming or lemmatization).

text	tokenization
cat running jumping quickly	[cats, running, jumping, quickly]
enjoys swimming pool every morning	[enjoys, swimming, pool, every, morning]
playing soccer park	[playing, soccer, park]
love read book bedtime	[loves, read, books, bedtime]
watching movie power went	[watching, movie, power, went]

Stemming: Convert all words to their base form.

Stemming reduces words to their root form, which helps lower dimensionality. However, it often does not result in actual words.

text	stemming
[cats, running, jumping, quickly]	cat run jump quickli
[enjoys, swimming, pool, every, morning]	enjoy swim pool everi morn
[playing, soccer, park]	play soccer park
[loves, read, books, bedtime]	love read book bedtim
[watching, movie, power, went]	watch movi power went

Lemmatization: Convert all words to their base form, based on dictionary.

Lemmatization is similar to stemming but uses a dictionary to replace words with their base forms. This process is more accurate but takes longer time rather than stemming.

text	lemmatization
[cats, running, jumping, quickly]	cat running jumping quickly
[enjoys, swimming, pool, every, morning]	enjoys swimming pool every morning
[playing, soccer, park]	playing soccer park
[loves, read, books, bedtime]	love read book bedtime
[watching, movie, power, went]	watching movie power went

Note: You can choose to use stemming or lemmatization. Choose stemming when speed and simplicity are needed, and the exact word form is less critical. Choose lemmatization when accuracy and readability are important, and the context and part of speech need to be preserved (e.g., sentiment analysis, machine translation).

I providing preprocesing here ⭐ Colab: Preprocessing. You can use this preprocessing repetitively for your projects. I provided indonesian and english text for stemming and lemma.