Natural Language Processing Fundamentals
上QQ阅读APP看书,第一时间看更新

Cleaning Text Data

Most of the time, text data cannot be used as it is. This is because the presence of various unknown symbols or links makes it dirty or unfit for use. Data cleaning is the art of extracting meaningful portions from data by eliminating unnecessary details. Consider the sentence, He tweeted, 'Live coverage of General Elections available at this.tv/show/ge2019. _/\_ Please tune in :) '.

Various symbols, such as "_/\_" and ":)," are present in the sentence. They do not contribute much to its meaning. We need to remove such unwanted details. This is done not only to focus more on the actual content but also to reduce computations. To achieve this, methods such as tokenization and stemming are used. We will learn about them one by one in the upcoming sections.

Tokenization

Tokenization and word tokenizers were briefly described in Chapter 1, Introduction to Natural Language Processing. Tokenization is the process of splitting sentences into their constituents; that is, words. In this chapter, we will see how tokenization is done using various packages.

The cleaning of text data is essential before tokenization. Regular expressions are widely used for cleaning. A regular expression is a set of characters in a given order that represents a pattern. This pattern is searched for in the texts. In Python, the re package is used to develop regular expressions. To get a better understanding of this, we will carry out the exercise in the next section.

Exercise 12: Text Cleaning and Tokenization

In this exercise, we will clean a text and extract the tokens from it. Follow these steps to implement this exercise:

  1. Open a Jupyter notebook.
  2. Import the re package:

    import re

  3. Store the text to be cleaned in a sentence variable:

    sentence = 'Sunil tweeted, "Witnessing 70th Republic Day of India from Rajpath, \

    New Delhi. Mesmerizing performance by Indian Army! Awesome airshow! @india_official \

    @indian_army #India #70thRepublic_Day. For more photos ping me sunil@photoking.com :)"'

  4. Delete all characters other than digits, alphabetical characters, and whitespaces from the text. Use the split() function to split the strings into parts. Add the following code to implement this:

    re.sub(r'([^\s\w]|_)+', ' ', sentence).split()

    This command fragments the string wherever any blank space is present. The output should be as follows:

Figure 2.6: Fragmented string

We have learned about how to extract the tokens from a text. Often, extracting each token separately does not help. For instance, consider the sentence, "I don't hate you, but your behavior." Here, if we process each of the tokens, such as "hate" and "behavior," separately, then the true meaning of the sentence would not be comprehended. In this case, the context in which these tokens are present becomes essential. Thus, we consider n consecutive tokens at a time. n-grams refers to the grouping of n consecutive tokens together. In the next section, we will look at an exercise where n-grams can be extracted from a given text.

Exercise 13: Extracting n-grams

In this exercise, we will extract n-grams using three different methods; namely, via custom-defined functions, via nltk, and via TextBlob. Follow these steps to implement this exercise:

  1. Open a Jupyter notebook.
  2. Import the re package and define a custom-defined function, which we can use to extract n-grams. Add the following code to do this:

    import re

    def n_gram_extractor(sentence, n):

    tokens = re.sub(r'([^\s\w]|_)+', ' ', sentence).split()

    for i in range(len(tokens)-n+1):

    print(tokens[i:i+n])

  3. To check the bi-grams, we pass the function with text and n. Add the following code to do this:

    n_gram_extractor('The cute little boy is playing with the kitten.', 2)

    The code generates the following output:

    Figure 2.7: Bi-grams

  4. To check the tri-grams, we pass the function with the text and n. Add the following code to do this:

    n_gram_extractor('The cute little boy is playing with the kitten.', 3)

    The code generates the following output:

    Figure 2.8: Tri-grams

  5. To check the bi-grams using the nltk library, add the following code:

    from nltk import ngrams

    list(ngrams('The cute little boy is playing with the kitten.'.split(), 2))

    The code generates the following output:

    Figure 2.9: Bi-grams

  6. To check the tri-grams using the nltk library, add the following code:

    list(ngrams('The cute little boy is playing with the kitten.'.split(), 2))

    Figure 2.10: Tri-grams

  7. To check the bi-grams using the TextBlob library, add the following code:

    from textblob import TextBlob

    blob = TextBlob("The cute little boy is playing with the kitten.")

    blob.ngrams(n=2)

    The code generates the following output:

    Figure 2.11: Bi-grams

  8. To check the tri-grams using the TextBlob library, add the following code:

    blob.ngrams(n=3)

    The code generates the following output:

Figure 2.12: Tri-grams

Keras and TextBlob are two of the most popular Python libraries used for performing various NLP tasks. TextBlob provides a simple and easy-to-use interface to do so. Keras is used mainly for performing deep learning-based NLP tasks. In the next section, we will carry out an exercise where we use the Keras and TextBlob libraries to tokenize texts.

Exercise 14: Tokenizing Texts with Different Packages – Keras and TextBlob

In this exercise, we will make use of Keras and TextBlob to tokenize texts. Follow these steps to implement this exercise:

  1. Open a Jupyter notebook, insert a new cell, and declare a variable sentence:

    sentence = 'Sunil tweeted, "Witnessing 70th Republic Day of India from Rajpath, \

    New Delhi. Mesmerizing performance by Indian Army! Awesome airshow! @india_official \

    @indian_army #India #70thRepublic_Day. For more photos ping me sunil@photoking.com :)"'

  2. Import the keras and textblob libraries:

    from keras.preprocessing.text import text_to_word_sequence

    from textblob import TextBlob

  3. To tokenize using the keras library, add the following code:

    text_to_word_sequence(sentence)

    The code generates the following output:

    Figure 2.13: Tokenization using Keras

  4. To tokenize using the TextBlob library, add the following code:

    blob = TextBlob(sentence)

    blob.words

    The code generates the following output:

Figure 2.14: Tokenization using TextBlob

We have learned how to tokenize texts using the Keras and TextBlob libraries. In the next section, we will discuss different types of tokenizers.

Types of Tokenizers

There are different types of tokenizers that come in handy for specific tasks. Let's look at them one by one:

  • Tweet tokenizer: This is specifically designed for tokenizing tweets. It is capable of dealing with emotions and expressions of sentiment, which are used widely on Twitter.
  • MWE tokenizer: MWE stands for Multi-Word Expression. Here, certain groups of multiple words are treated as one entity during tokenization, such as "United States of America," "People's Republic of China," "not only," and "but also."
  • Regular expression tokenizer: These tokenizers are developed using regular expressions. Sentences are split based on the occurrence of a particular pattern.
  • Whitespace tokenizer: This tokenizer splits a string whenever a space, tab, or newline character is present.
  • Word Punkt tokenizer : This splits a text into a list of alphabetical characters, digits, and non-alphabetical characters.

Now that we have learned about the different types of tokenizers, in the next section, we will carry out an exercise to get a better understanding of them.

Exercise 15: Tokenizing Text Using Various Tokenizers

In this exercise, we will use make use of different tokenizers to tokenize text. Follow these steps to implement this exercise:

  1. Open a Jupyter notebook.
  2. Insert a new cell and the following code to declare a variable sentence:

    sentence = 'Sunil tweeted, "Witnessing 70th Republic Day of India from Rajpath, \

    New Delhi. Mesmerizing performance by Indian Army! Awesome airshow! @india_official \

    @indian_army #India #70thRepublic_Day. For more photos ping me sunil@photoking.com :)"'

  3. To tokenize the text using TweetTokenizer, add the following code:

    from nltk.tokenize import TweetTokenizer

    tweet_tokenizer = TweetTokenizer()

    tweet_tokenizer.tokenize(sentence)

    The code generates the following output:

    Figure 2.15: Tokenization using TweetTokenizer

  4. To tokenize the text using the MWE tokenizer, add the following code:

    from nltk.tokenize import MWETokenizer

    mwe_tokenizer = MWETokenizer([('Republic', 'Day')])

    mwe_tokenizer.add_mwe(('Indian', 'Army'))

    mwe_tokenizer.tokenize(sentence.split())

    The code generates the following output:

    Figure 2.16: Tokenization using the MWE tokenizer

  5. In the preceding figure, the words "Indian" and "Army!" were supposed to be treated as a single identity, but they got separated. This is because "Army!" (not "Army") is treated as a token. Let's see how this can be fixed in the next step.
  6. Add the following code to fix the issues in the previous step:

    mwe_tokenizer.tokenize(sentence.replace('!','').split())

    The code generates the following output:

    Figure 2.17: Tokenization using the MWE tokenizer after removing the "!" sign

  7. To tokenize the text using the regular expression tokenizer, add the following code:

    from nltk.tokenize import RegexpTokenizer

    reg_tokenizer = RegexpTokenizer('\w+|\$[\d\.]+|\S+')

    reg_tokenizer.tokenize(sentence)

    The code generates the following output:

    Figure 2.18: Tokenization using the regular expression tokenizer

  8. To tokenize the text using the whitespace tokenizer, add the following code:

    from nltk.tokenize import WhitespaceTokenizer

    wh_tokenizer = WhitespaceTokenizer()

    wh_tokenizer.tokenize(sentence)

    The code generates the following output:

    Figure 2.19: Tokenization using the whitespace tokenizer

  9. To tokenize the text using the Word Punkt tokenizer, add the following code:

    from nltk.tokenize import WordPunctTokenizer

    wp_tokenizer = WordPunctTokenizer()

    wp_tokenizer.tokenize(sentence)

    The code generates the following output:

Figure 2.20: Tokenization using the Word Punkt tokenizer

We have learned how to tokenize text using different tokenizers.

Issues with Tokenization

Although tokenization appears to be an easy task, in reality, it is not so. This is primarily because of ambiguities that arise due to the presence of whitespaces and hyphens. Moreover, sentences in certain languages, such as Chinese and Japanese, do not have words separated by whitespaces, thus making it difficult to tokenize them. In the next section, we will discuss another pre-processing step: stemming.

Stemming

In languages such as English, the original forms of words get changed when used in sentences. The process of restoring the original form of a word is known as stemming. It is essential to restore words back to their base form, because without this, compilers and computing machines would treat two or more different forms of the same word as different entities, despite them having the same meaning. RegexpStemmer and the Porter stemmer are the most widely used stemmers. Let's learn about them one by one.

RegexpStemmer

RegexpStemmer uses regular expressions to check whether morphological or structural prefixes or suffixes are present. For instance, in many cases, the gerund form of a verb (so, the form ending with "ing") can be restored back to the base form simply by removing "ing" from the end; for example, "playing" -> "play".

Let's do the following exercise to get some hands-on experience of using RegexpStemmer.

Exercise 16: Converting words in gerund form into base words using RegexpStemmer

In this exercise, we will use RegexpStemmer on text to transform words ending with "ing" into their base form. Follow these steps to implement this exercise:

  1. Open a Jupyter notebook.
  2. Insert a new cell and add the following code to declare a sentence variable:

    sentence = "I love playing football"

  3. Now we'll make use of regex_stemmer to stem each word of the sentence variable. Add the following code to do this:

    from nltk.stem import RegexpStemmer

    regex_stemmer = RegexpStemmer('ing$', min=4)

    ' '.join([regex_stemmer.stem(wd) for wd in sentence.split()])

    The code generates the following output:

Figure 2.21: Stemming using RegexpStemmer

In the next section, we will discuss the Porter stemmer.

The Porter Stemmer

The Porter stemmer is the most common stemmer for dealing with English words. It removes various morphological and inflectional endings (such as suffixes and prefixes) from English words. In doing so, it helps us to extract the base form of a word from its variations. To get a better understanding of this, we will carry out an exercise in the next section.

Exercise 17: The Porter Stemmer

In this exercise, we will apply the Porter stemmer to some text. Follow these steps to implement this exercise:

  1. Open a Jupyter notebook.
  2. Import nltk and related packages, and declare a sentence variable. Add the following code to do this:

    sentence = "Before eating, it would be nice to sanitize your hands with a sanitizer"

    from nltk.stem.porter import *

  3. Now we'll make use of the Porter stemmer to stem each word of the sentence variable:

    ps_stemmer = PorterStemmer()

    ' '.join([ps_stemmer.stem(wd) for wd in sentence.split()])

    The code generates the following output:

Figure 2.22: Stemming using the Porter stemmer

In the next section, we will learn about another pre-processing step: lemmatization.

Lemmatization

A problem that occurs while stemming is that, often, stemmed words do not carry any meaning. For instance, if we use the Porter stemmer on the word "independence," we get "independ." Now, the word "independ" is not present in the English dictionary; it does not carry any meaning. Lemmatization deals with such cases by using a vocabulary and analyzing the words' morphologies. It returns the base forms of words that can actually be found in dictionaries. To get a better understanding of this, let's look at an exercise in the next section.

Exercise 18: Lemmatization

In this exercise, we will make use of lemmatization to lemmatize a given text. Follow these steps to implement this exercise:

  1. Open a Jupyter notebook.
  2. Import nltk and related packages, then declare a sentence variable. Add the following code to implement this:

    import nltk

    from nltk.stem import WordNetLemmatizer

    from nltk import word_tokenize

    nltk.download('wordnet')

    lemmatizer = WordNetLemmatizer()

    sentence = "The products produced by the process today are far better than what it produces generally."

  3. To lemmatize the tokens extracted from the sentence, add the following code:

    ' '.join([lemmatizer.lemmatize(word) for word in word_tokenize(sentence)])

    The code generates the following output:

Figure 2.23: Lemmatization using the WordNet lemmatizer

In the next section, we will deal with other kinds of word variations, looking at singularizing and pluralizing words. TextBlob provides a nice function to singularize and pluralize words. Let's see how this is done in the next section.

Exercise 19: Singularizing and Pluralizing Words

In this exercise, we will make use of the TextBlob library to singularize and pluralize the words in a given text. Follow these steps to implement this exercise:

  1. Open a Jupyter notebook.
  2. Import TextBlob and declare a sentence variable. Add the following code to implement this:

    from textblob import TextBlob

    sentence = TextBlob('She sells seashells on the seashore')

    To check the list of words in sentence, type the following code:

    sentence.words

    The code generates the following output:

    Figure 2.24: Extracting words from a sentence using TextBlob

  3. To singluarize the second word in the given sentence, type the following code:

    sentence.words[2].singularize()

    The code generates the following output:

    Figure 2.25: Singularizing a word using TextBlob

  4. To pluralize the fifth word in the given sentence, type the following code:

    sentence.words[5].pluralize()

    The code generates the following output:

Figure 2.26: Pluralizing a word using TextBlob

In the next section, we will learn about another pre-processing task: language translation.

Language Translation

Different languages are often used together to convey something. In such cases, translating the entire text into a single language becomes an essential pre-processing task for analyzing it. Let's look at an exercise in the next section.

Exercise 20: Language Translation

In this exercise, we will make use of the TextBlob library to translate a sentence from Spanish to English. Follow these steps to implement this exercise:

  1. Open a Jupyter notebook.
  2. Import the TextBlob library:

    from textblob import TextBlob

  3. Make use of the translate function of TextBlob to translate the input text from Spanish to English. Add the following code to do this:

    en_blob = TextBlob(u'muy bien')

    en_blob.translate(from_lang='es', to='en')

    The code generates the following output:

Figure 2.27: Language translation using TextBlob

In the next section, we will look at another pre-processing task: stop-word removal.

Stop-Word Removal

Stop words, such as "am," "the," and "are," support words and sentences. They help us to construct sentences. But they do not affect the meaning of the sentence in which they are present. Thus, we can safely ignore their presence. To get a better understanding of this, let's look at an exercise in the next section.

Exercise 21: Stop-Word Removal

In this exercise, we will remove the stop words from a given text. Follow these steps to implement this exercise:

  1. Open a Jupyter notebook.
  2. Import nltk and declare a sentence variable with the text in question:

    from nltk import word_tokenize

    sentence = "She sells seashells on the seashore"

  3. Define a custom list of stop words and execute the following lines of code:

    custom_stop_word_list = ['she', 'on', 'the', 'am', 'is', 'not']

    ' '.join([word for word in word_tokenize(sentence) if word.lower() not in custom_stop_word_list])

    The code generates the following output:

Figure 2.28: Removing a custom set of stop words

We have learned how to remove the stop words from a given sentence. In the next section, we will explore the concept of extracting features from texts.