concept stopword in category nlp

appears as: stopwords, stopwords, stopword
Getting Started with Natural Language Processing MEAP V06

This is an excerpt from Manning's book Getting Started with Natural Language Processing MEAP V06.

This suggests the first improvement to the developed algorithm: let’s remove the less meaningful words. In NLP applications, the less meaningful words are called stopwords, and luckily you don’t have to bother with enumerating them – since stopwords are highly repetitive in English, most NLP toolkits have a specially defined stopwords list, so you can rely on this list when processing the data, unless you want to customize it. For example if you believe that it should be extended with more words or that some words that are included in the standard stopwords list should not be there, you could use your own list of stopwords.

In addition to removing stopwords note that Figure 3.6 doesn’t have punctuation marks, i.e. full stops, commas and question marks, highlighted. Punctuation marks may prove useful in some applications, but will unlikely help here: many queries will contain question marks and documents won’t necessarily have any, while all documents will have commas and full stops, so punctuation marks are not going to be informative. Let’s filter them out, too. Code in Listing 6 shows how to do that.

Listing 6 Preprocessing: Stopwords and punctuation marks removal
import nltk
import string    #A
from nltk import word_tokenize
from nltk.corpus import stopwords    #B
 
def process(text):
    stoplist = set(stopwords.words('english'))    #C
    word_list = [word for word in word_tokenize(text.lower())
                 if not word in stoplist and not word in string.punctuation]    #D
    return word_list
 
word_list = process(documents.get("1"))
print(word_list)    #E

If you run the code from Listing 6 to preprocess document 1, it will return the list of words including ['18', 'editions', 'dewey', 'decimal', 'classifications', …] for the original text of document 1 from Table 3.3 that goes as “18 Editions of the Dewey Decimal Classifications …” That is, the preprocessing step helps removing the stopwords like “of” and “the” from the word list.

6.2.2   Counts of stopwords and proportion of stopwords as features

You came across the notion of stopwords several times in this book: stopwords are words that are used frequently in language, and most of the time, they don’t have a meaning of their own. Usually, they connect other meaningful words to each other or express some other function rather than meaning: for instance, articles like a and the are stopwords – they are very frequent, and their main goal in language is to express whether you came across a mention of a particular object before (definite article the) or not (indefinite article a). Prepositions like at, on, about and others usually connect meaningful words to the notions of location (stay at home), time (meet on Friday), topic (talk about politics), and so on. In many applications they can be disposed of, as they don’t contribute much to the task itself – you’ve seen examples of such applications in Chapters 3 and 4.

However, one’s particular writing style is a whole different matter. As it turns out, different authors use function words of different types with different frequencies. For instance, if you prefer using word “but” whenever I use “however”, our writing styles will differ with respect to the use of these stopwords even if we otherwise use absolutely the same set of words. If you notice that you tend to use expressions like “well”, “sort of” or “you know”, these are also mostly composed of stopwords.

To this end, let’s introduce a new feature type that will estimate the number of times each stopword is used in a sentence, and a feature estimating the proportion of stopwords in a sentence. As opposed to calculating the number of times various words are used in texts, in the case of stopwords we are talking about a much more compact set of words (e.g., the spaCy’s list of stopwords contains 305 words only) that frequently occur across sentences. Code Listing 6.5 shows how to implement a method that counts the number of times each word occurs in text as well as the proportion of times words of a certain type occur in a sentence:

Listing 6.5 Code to calculate the number and proportion of times certain words occur
def word_counts(text):
    counts = {}
    for word in text:
        counts[word.lower()] = counts.get(word.lower(), 0) + 1    #A
    return counts
 
def proportion_words(text, wordlist):
    count = 0
    for word in text:
        if word.lower() in wordlist:
            count += 1
    return float(count)/float(len(text))   #B

Now let’s calculate the number of times we see each of the stopwords in sentences written by each of the authors, as well as the proportion of stopwords in their sentences using these methods. We are going to use the stopwords list from spaCy, and for each of the 305 words we are going to add one feature to the feature set representing the number of times this particular stopword occurs in a particular sentence, and then add one extra feature representing the proportion of all stopwords as opposed to all words used in each sentence by each writer. This means that our feature set at this point will contain 308 features. For instance, one of the new features in this new feature set corresponds to the count of the stopword ‘a’: for the sentence “It is yet too early in life to despair of such a happiness” this count equals 1.0, while for the sentence “Not so happy, yet much happier” it is 0.0. At the same time, for the feature representing the count for the stopword ‘not’ the feature values are exactly the opposite: 0.0 and 1.0. 9 out of the total of 13 words in the sentence “It is yet too early in life to despair of such a happiness” are stopwords (all underlined), therefore the stopwords proportion for this sentence equals to 0.69. There are 7 words in the sentence “Not so happy, yet much happyer” in total, and 4 of them (underlined) are stopwords, making the proportion for this sentence equal to 0.57, as Figure 6.11 illustrates:

Figure 6.12 Accuracy scores after adding new types of feature: counts of stopwords (F3) and their proportion in a sentence (F4) compared to the previous models
sitemap

Unable to load book!

The book could not be loaded.

(try again in a couple of minutes)

manning.com homepage
test yourself with a liveTest