Published

Preprocessing in NLP

Published
Gaudhiwaa Hendrasto
Table of Content

Natural Language Processing

Natural Language Processing (NLP) is a branch of Artificial Intelligence that deals with text. Chatbot, sentiment analysis, text summarization, speech recognition, spam detection, content recommendation, generate text, and translation are the examples of NLP. So any projects that deals with text, it's called NLP ๐ŸŽฏ

Preprocessing

Preprocessing is a step where you modify the characteristics of a text. The reason you should perform preprocessing is that you need to clean the data ๐Ÿงน before it becomes the input for your model. We know that data quality is crucial for achieving the best model accuracy and time effiency. Check out the preprocessing steps on the following explanation.

Note: Before you read the preprocesing below, you have to know that not all preprocesing steps you should use. It is relatively based on your project goals. For the example, preprocesing on sentiment analysis is may different with generate text. The techniques could be same, but the preprocesing sequence is different. So there is no ONE-SIZE-FIT-ALL solution for preprocesing steps. The more you doing NLP projects, you also build your own instinct of what techniques you should use to achieving project goals.

  1. Normalization: Remove or convert irrelevant data on the text.

    Normalization could be case-folding (lower case), remove link, remove punctuations etc. The point of doing normalization is make the data clean by removing or converting irrelevant data on the text. This process helps in standardizing the text, making it easier for subsequent text processing.

    textnormalization
    The cats are running and jumping quickly https://www.google.com/cat running jumping quickly
    He enjoys swimming in the pool every morning.enjoys swimming pool every morning
    #$@%^ They are playing soccer in the park.playing soccer park
    She loves to read books before bedtime.love read book bedtime
    We were watching a movie when the power went out.watching movie power went
  2. Remove Stop Words: Remove common words in sentences with low informative

    Common, low-informative words in text, known as "Stop Words", are often irrelevant and introduce noise into text analysis. These are the list of stop words ๐Ÿ‘‰ list of stopwords

    textremove stopwords
    the cats are running and jumping quicklycat running jumping quickly
    he enjoys swimming in the pool every morningenjoys swimming pool every morning
    they are playing soccer in the parkplaying soccer park
    she loves to read books before bedtimelove read book bedtime
    we were watching a movie when the power went outwatching movie power went
  3. Tokenization: Splitting text by spaces

    Tokenization is splitting text by spaces into one single word (unigram). This steps is mandatory before you go to the next steps (stemming or lemmatization).

    texttokenization
    cat running jumping quickly[cats, running, jumping, quickly]
    enjoys swimming pool every morning[enjoys, swimming, pool, every, morning]
    playing soccer park[playing, soccer, park]
    love read book bedtime[loves, read, books, bedtime]
    watching movie power went[watching, movie, power, went]
  4. Stemming: Convert all words to their base form.

    Stemming reduces words to their root form, which helps lower dimensionality. However, it often does not result in actual words.

    textstemming
    [cats, running, jumping, quickly]cat run jump quickli
    [enjoys, swimming, pool, every, morning]enjoy swim pool everi morn
    [playing, soccer, park]play soccer park
    [loves, read, books, bedtime]love read book bedtim
    [watching, movie, power, went]watch movi power went
  5. Lemmatization: Convert all words to their base form, based on dictionary.

    Lemmatization is similar to stemming but uses a dictionary to replace words with their base forms. This process is more accurate but takes longer time rather than stemming.

    textlemmatization
    [cats, running, jumping, quickly]cat running jumping quickly
    [enjoys, swimming, pool, every, morning]enjoys swimming pool every morning
    [playing, soccer, park]playing soccer park
    [loves, read, books, bedtime]love read book bedtime
    [watching, movie, power, went]watching movie power went

    Note: You can choose to use stemming or lemmatization. Choose stemming when speed and simplicity are needed, and the exact word form is less critical. Choose lemmatization when accuracy and readability are important, and the context and part of speech need to be preserved (e.g., sentiment analysis, machine translation).

I providing preprocesing here โญ Colab: Preprocessing. You can use this preprocessing repetitively for your projects. I provided indonesian and english text for stemming and lemma.

Written by Gaudhiwaa Hendrasto