- Published
Preprocessing in NLP
- Published
- Gaudhiwaa Hendrasto
Table of Content
Natural Language Processing
Natural Language Processing (NLP) is a branch of Artificial Intelligence that deals with text. Chatbot, sentiment analysis, text summarization, speech recognition, spam detection, content recommendation, generate text, and translation are the examples of NLP. So any projects that deals with text, it's called NLP ๐ฏ
Preprocessing
Preprocessing is a step where you modify the characteristics of a text. The reason you should perform preprocessing is that you need to clean the data ๐งน before it becomes the input for your model. We know that data quality is crucial for achieving the best model accuracy and time effiency. Check out the preprocessing steps on the following explanation.
Note: Before you read the preprocesing below, you have to know that not all preprocesing steps you should use. It is relatively based on your project goals. For the example, preprocesing on sentiment analysis is may different with generate text. The techniques could be same, but the preprocesing sequence is different. So there is no ONE-SIZE-FIT-ALL solution for preprocesing steps. The more you doing NLP projects, you also build your own instinct of what techniques you should use to achieving project goals.
Normalization: Remove or convert irrelevant data on the text.
Normalization could be case-folding (lower case), remove link, remove punctuations etc. The point of doing normalization is make the data clean by removing or converting irrelevant data on the text. This process helps in standardizing the text, making it easier for subsequent text processing.
text normalization The cats are running and jumping quickly https://www.google.com/ cat running jumping quickly He enjoys swimming in the pool every morning. enjoys swimming pool every morning #$@%^ They are playing soccer in the park. playing soccer park She loves to read books before bedtime. love read book bedtime We were watching a movie when the power went out. watching movie power went Remove Stop Words: Remove common words in sentences with low informative
Common, low-informative words in text, known as "Stop Words", are often irrelevant and introduce noise into text analysis. These are the list of stop words ๐ list of stopwords
text remove stopwords the cats are running and jumping quickly cat running jumping quickly he enjoys swimming in the pool every morning enjoys swimming pool every morning they are playing soccer in the park playing soccer park she loves to read books before bedtime love read book bedtime we were watching a movie when the power went out watching movie power went Tokenization: Splitting text by spaces
Tokenization is splitting text by spaces into one single word (unigram). This steps is mandatory before you go to the next steps (stemming or lemmatization).
text tokenization cat running jumping quickly [cats, running, jumping, quickly] enjoys swimming pool every morning [enjoys, swimming, pool, every, morning] playing soccer park [playing, soccer, park] love read book bedtime [loves, read, books, bedtime] watching movie power went [watching, movie, power, went] Stemming: Convert all words to their base form.
Stemming reduces words to their root form, which helps lower dimensionality. However, it often does not result in actual words.
text stemming [cats, running, jumping, quickly] cat run jump quickli [enjoys, swimming, pool, every, morning] enjoy swim pool everi morn [playing, soccer, park] play soccer park [loves, read, books, bedtime] love read book bedtim [watching, movie, power, went] watch movi power went Lemmatization: Convert all words to their base form, based on dictionary.
Lemmatization is similar to stemming but uses a dictionary to replace words with their base forms. This process is more accurate but takes longer time rather than stemming.
text lemmatization [cats, running, jumping, quickly] cat running jumping quickly [enjoys, swimming, pool, every, morning] enjoys swimming pool every morning [playing, soccer, park] playing soccer park [loves, read, books, bedtime] love read book bedtime [watching, movie, power, went] watching movie power went Note: You can choose to use stemming or lemmatization. Choose stemming when speed and simplicity are needed, and the exact word form is less critical. Choose lemmatization when accuracy and readability are important, and the context and part of speech need to be preserved (e.g., sentiment analysis, machine translation).
I providing preprocesing here โญ Colab: Preprocessing. You can use this preprocessing repetitively for your projects. I provided indonesian and english text for stemming and lemma.