concept `word` in category `python`

appears as: word, words, The word, word, words, The words, A word

Tiny Python Projects: Learn coding and testing with puzzles and games

This is an excerpt from Manning's book Tiny Python Projects: Learn coding and testing with puzzles and games. Login to get full access to this book.

Named options --Most command-line programs define a short name like -n (one dash and a single character) and a long name like --name (two dashes and a word) followed by some value, like the name in the hello.py program. Named options allow arguments to be provided in any order--their position is not relevant. This makes them the right choice when the user is not required to provide them (they are options, after all). It’s good to provide reasonable default values for options. When we changed the required positional name argument of hello.py to the optional --name argument, we used “World” for the default so that the program could run with no input from the user. Note that some other languages, like Java, might define long names with a single dash, like -jar.

to see more go to Appendix. Using argparse

Figure 16.1 Our program will take input text from the command line or a file and will scramble the letters in words with four or more characters.

to see more go to 16 The scrambler: Randomly reordering the middles of words

$ ./scrambler.py -h
usage: scrambler.py [-h] [-s seed] text
 
Scramble the letters of words
 
positional arguments:
  text                  Input text or file

optional arguments:
  -h, --help            show this help message and exit
  -s seed, --seed seed  Random seed (default: None)

to see more go to 16 The scrambler: Randomly reordering the middles of words

16.1.4 Scrambling a word

Now that we have a way to process the lines and then words of the text, let’s think about how we’ll scramble the words by starting with just one word. You and I will need to use the same algorithm for scrambling the words in order to pass the tests, so here are the rules:

If the word is three characters or shorter, return the word unchanged.

Use a string slice to copy the characters, not including the first and last.

Use the random.shuffle() method to mix up the letters in the middle.

Return the new “word” by combining the first, middle, and last parts.

to see more go to 16 The scrambler: Randomly reordering the middles of words

Python Workout: 50 ten-minute exercises

This is an excerpt from Manning's book Python Workout: 50 ten-minute exercises. Login to get full access to this book.

Try the same thing, but have the program choose a random word from the dictionary, and then ask the user to guess the word. (You might want to limit yourself to words containing two to five letters, to avoid making it too horribly difficult.) Instead of telling the user that they should guess a smaller or larger number, have them choose an earlier or later word in the dict.

to see more go to 1 Numeric types

def plword(word):
    if word[0] in 'aeiou':
        return word + 'way'
 
    return word[1:] + word[0] + 'ay'
 
 
def plfile(filename):
    return ' '.join(plword(one_word)
                    for one_line in open(filename)        #1
                     for one_word in one_line.split())    #2

to see more go to 7 Functional programming with comprehensions

Before we can create a set of supervocalic words, or read from a file, we need to find a way to determine if a word is supervocalic. (Again, this isn’t the precise, official definition.) One way would be to use in five times, once for each vowel. But this seems a bit extreme and inefficient.

to see more go to 7 Functional programming with comprehensions

Mastering Large Datasets with Python: Parallelize and Distribute Your Python Code

This is an excerpt from Manning's book Mastering Large Datasets with Python: Parallelize and Distribute Your Python Code. Login to get full access to this book.

Let’s look at our tweet-level transformation first. At the tweet level, we’ll convert a Tweet ID into a single score for that tweet, representing the gender score of that tweet. We’ll score the tweets by giving them points based on the words they use. Some words will make the tweet more of a “man’s tweet,” and some will make the tweet more of a “woman’s tweet.” We can see this process playing out in figure 3.9.

to see more go to 3.3.1. Tweet-level pipeline

We can turn this dataset into a sequence of words by calling the .flatMap method of the RDD. The .flatMap method is like map but results in a flat sequence, not a nested sequence. .flatMap also returns an RDD, so we can use the .filter method of the RDD to filter down to only the large words, and then the .countByValue method of that resulting RDD to gather the counts. We can see this whole process in just a few lines in the following listing.
Listing 7.7. Counting words of six letters or more in Spark
#! /usr/bin/env python3
import re
from pyspark import SparkContext

if __name__ == "__main__":                           #1
  sc = SparkContext(appName="Nightingale")           #2
  PAT = re.compile(r'[-./:\s\xa0]+')                 #3
  fp = "/path/to/florence/nightingale/*"
  text_files = sc.textFile(fp)                       #4
  xs = text_files.flatMap(lambda x:PAT.split(x))\    #5
                 .filter(lambda x:len(x)>6)\         #6
                 .countByValue()\                    #7

  for k,v in xs.items():                             #8
    print("{:<30}{}".format(k.encode("ascii","ignore"),v))
copy
When you’re done running the code, you should see a long list of large words output. If all’s right, the words should all be over six letters in length. There will also be a bunch of output related to the Spark job that was run to process this code. The final result will look something like the following listing.

to see more go to Chapter 7. Processing truly big datasets with Hadoop and Spark

7.3.2. Scoring words using Hadoop Streaming

Let’s turn back to our example of finding the counts of long words. For Hadoop, we’ll focus only on the words by Florence and the Machine. (We’ll save the texts of Florence Nightingale for Spark later in this chapter.) To get counts of specific words with Hadoop—instead of simply an overall count of words—we’ll have to modify our mapper and our reducer. Before we jump right into the code, let’s take a look at how this process will compare with our word counting example. I’ve diagrammed both processes, step by step, in figure 7.7.

Figure 7.7. Counting words and getting the frequencies of a subset of words have similar forms but require different mappers and reducers.

With our word count mapper, we had to extract the words from the document and print them to the terminal. We’ll do something very similar for our long word frequency example; however, we’ll want to add a check to ensure we’re only printing out long words. Note that this behavior—doing our filtering and breaking our documents into sequences of words—is very similar to how the workflow might execute in Python. As we iterated through the sequence, both the transformation and the filter would lazily be called on the lines of a document.

For our word count reducer, we had a counter that we incremented every time we saw a word. This time, we’ll need more complex behavior. Luckily, we already have this behavior on hand. We’ve implemented a frequency reduction several times and can reuse that reduction code here. Let’s modify our reducer from listing 7.2 so it uses the make_counts function we first wrote back in chapter 5. Our mapper will look like listing 7.4, and our reducer will look like listing 7.5.
Listing 7.4. Hadoop mapper script to get and filter words
#!/usr/bin/env python3
import sys

for line in sys.stdin:
  for word in line.split():
    if len(word)>6: print(word)
copy
Listing 7.5. Hadoop reducer script to accumulate counts
#!/usr/bin/env python3
import sys
from functools import reduce

def make_counts(acc, nxt):                      #1
    acc[nxt] = acc.get(nxt,0) + 1
    return acc

for w in reduce(make_counts, sys.stdin, {}):    #2
    print(w)
copy
The output of our MapReduce job will be a single file with a sequence of words and their counts in it. The results should look like figure 7.8. We also should see some log text printed to the screen. We can quickly check to see that all the words are longer than six letters, just as we’d hoped. In chapter 8, we’ll explore Hadoop in more depth and tackle scenarios beyond word filtering and counting.

Figure 7.8. The output of our MapReduce job is a sequence of words.

to see more go to 7.3.2. Scoring words using Hadoop Streaming

Hello World! Third Edition

This is an excerpt from Manning's book Hello World! Third Edition. Login to get full access to this book.

Do you know how to figure out how long it will take to get somewhere in a car? The formula (in words) is “travel time equals distance divided by speed.” Make a program to calculate the time it will take to drive 200 km at 80 km per hour and display the answer.

to see more go to Chapter 3. Basic Math

def wrong(self):
    self.pieces_shown += 1
    for i in range(self.pieces_shown):
        self.pieces[i].setHidden(False)
    if self.pieces_shown == len(self.pieces):
        message = "You lose.  The word was " + self.currentword
        QtWidgets.QMessageBox.warning(self,"Hangman",message)
        self.new_game()

to see more go to Chapter 22. File Input and Output

Natural Language Processing in Action: Understanding, analyzing, and generating text with Python

This is an excerpt from Manning's book Natural Language Processing in Action: Understanding, analyzing, and generating text with Python. Login to get full access to this book.

The meaning and intent of words can be deciphered by machines.

to see more go to 1.9. Natural language IQ

So you’re ready to save the world with the power of natural language processing? Well the first thing you need is a powerful vocabulary. This chapter will help you split a document, any string, into discrete tokens of meaning. Our tokens are limited to words, punctuation marks, and numbers, but the techniques we use are easily extended to any other units of meaning contained in a sequence of characters, like ASCII emoticons, Unicode emojis, mathematical symbols, and so on.

Retrieving tokens from a document will require some string manipulation beyond just the str.split() method employed in chapter 1. You’ll need to separate punctuation from words, like quotes at the beginning and end of a statement. And you’ll need to split contractions like “we’ll” into the words that were combined to form them. Once you’ve identified the tokens in a document that you’d like to include in your vocabulary, you’ll return to the regular expression toolbox to try to combine words with similar meaning in a process called stemming. Then you’ll assemble a vector representation of your documents called a bag of words, and you’ll try to use this vector to see if it can help you improve upon the greeting recognizer sketched out at the end of chapter 1.

Think for a moment about what a word or token represents to you. Does it represent a single concept, or some blurry cloud of concepts? Could you be sure you could always recognize a word? Are natural language words like programming language keywords that have precise definitions and a set of grammatical usage rules? Could you write software that could recognize a word? Is “ice cream” one word or two to you? Don’t both words have entries in your mental dictionary that are separate from the compound word “ice cream”? What about the contraction “don’t”? Should that string of characters be split into one or two “packets of meaning?”

And words could be divided even further into smaller packets of meaning. Words themselves can be divided up into smaller meaningful parts. Syllables, prefixes, and suffixes, like “re,” “pre,” and “ing” have intrinsic meaning. And parts of words can be divided further into smaller packets of meaning. Letters or graphemes (https://en.wikipedia.org/wiki/Grapheme) carry sentiment and meaning.^[1]

¹ Morphemes are parts of words that contain meaning in and of themselves. Geoffrey Hinton and other deep learning deep thinkers have demonstrated that even graphemes (letters)—the smallest indivisible piece of written text—can be treated as if they are intrinsically meaningful.

We’ll talk about character-based vector space models in later chapters. But for now let’s just try to resolve the question of what a word is and how to divide up text into words.

What about invisible or implied words? Can you think of additional words that are implied by the single-word command “Don’t!”? If you can force yourself to think like a machine and then switch back to thinking like a human, you might realize that there are three invisible words in that command. The single statement “Don’t!” means “Don’t you do that!” or “You, do not do that!” That’s three hidden packets of meaning for a total of five tokens you’d like your machine to know about. But don’t worry about invisible words for now. All you need for this chapter is a tokenizer that can recognize words that are spelled out. You’ll worry about implied words and connotation and even meaning itself in chapter 4 and beyond.^[2]

to see more go to Chapter 2. Build your vocabulary (word tokenization)

So storing all those zeros, and trying to remember the order of the words in all your documents, doesn’t make much sense. It’s not practical. And what you really want to do is compress the meaning of a document down to its essence. You’d like to compress your document down to a single vector rather than a big table. And you’re willing to give up perfect “recall.” You just want to capture most of the meaning (information) in a document, not all of it.

What if you split your documents into much shorter chunks of meaning, say sentences. And what if you assumed that most of the meaning of a sentence can be gleaned from just the words themselves. Let’s assume you can ignore the order and grammar of the words, and jumble them all up together into a “bag,” one bag for each sentence or short document. That turns out to be a reasonable assumption. Even for documents several pages long, a bag-of-words vector is still useful for summarizing the essence of a document. You can see that for your sentence about Jefferson, even after you sorted all the words lexically, a human can still guess what the sentence was about. So can a machine. You can use this new bag-of-words vector approach to compress the information content for each document into a data structure that’s easier to work with.

If you summed all these one-hot vectors together, rather than “replaying” them one at a time, you’d get a bag-of-words vector. This is also called a word frequency vector, because it only counts the frequency of words, not their order. You could use this single vector to represent the whole document or sentence in a single, reasonable-length vector. It would only be as long as your vocabulary size (the number of unique tokens you want to keep track of).

Alternatively, if you’re doing basic keyword search, you could OR the one-hot word vectors into a binary bag-of-words vector. And you could ignore a lot of words that wouldn’t be interesting as search terms or keywords. This would be fine for a search engine index or the first filter for an information retrieval system. Search indexes only need to know the presence or absence of each word in each document to help you find those documents later.

Just like laying your arm on the piano, hitting all the notes (words) at once doesn’t make for a pleasant, meaningful experience. Nonetheless this approach turns out to be critical to helping a machine “understand” a whole group of words as a unit. And if you limit your tokens to the 10,000 most important words, you can compress your numerical representation of your imaginary 3,500 sentence book down to 10 kilobytes, or about 30 megabytes for your imaginary 3,000-book corpus. One-hot vector sequences would require hundreds of gigabytes.

Fortunately, the words in your vocabulary are sparsely utilized in any given text. And for most bag-of-words applications, we keep the documents short; sometimes just a sentence will do. So rather than hitting all the notes on a piano at once, your bag-of-words vector is more like a broad and pleasant piano chord, a combination of notes (words) that work well together and contain meaning. Your chatbot can handle these chords even if there’s a lot of “dissonance” from words in the same statement that aren’t normally used together. Even dissonance (odd word usage) is useful information about a statement that a machine learning pipeline can make use of.

Here’s how you can put the tokens into a binary vector indicating the presence or absence of a particular word in a particular sentence. This vector representation of a set of sentences could be “indexed” to indicate which words were used in which document. This index is equivalent to the index you find at the end of many textbooks, except that instead of keeping track of which page a word occurs on, you can keep track of the sentence (or the associated vector) where it occurred. Whereas a textbook index generally only cares about important words relevant to the subject of the book, you keep track of every single word (at least for now).

Here’s what your single text document, the sentence about Thomas Jefferson, looks like as a binary bag-of-words vector:
>>> sentence_bow = {}
>>> for token in sentence.split():
...     sentence_bow[token] = 1
>>> sorted(sentence_bow.items())
[('26.', 1)
 ('Jefferson', 1),
 ('Monticello', 1),
 ('Thomas', 1),
 ('age', 1),
 ('at', 1),
 ('began', 1),
 ('building', 1),
 ('of', 1),
 ('the', 1)]
copy
One thing you might notice is that Python’s sorted() puts decimal numbers before characters, and capitalized words before lowercase words. This is the ordering of characters in the ASCII and Unicode character sets. Capital letters come before lowercase letters in the ASCII table. The order of your vocabulary is unimportant. As long as you are consistent across all the documents you tokenize this way, a machine learning pipeline will work equally well with any vocabulary order.

And you might also notice that using a dict (or any paired mapping of words to their 0/1 values) to store a binary vector shouldn’t waste much space. Using a dictionary to represent your vector ensures that it only has to store a 1 when any one of the thousands, or even millions, of possible words in your dictionary appear in a particular document. You can see how it would be much less efficient to represent a bag of words as a continuous list of 0’s and 1’s with an assigned location in a “dense” vector for each of the words in a vocabulary of, say, 100,000 words. This dense binary vector representation of your “Thomas Jefferson” sentence would require 100 kB of storage. Because a dictionary “ignores” the absent words, the words labeled with a 0, the dictionary representation only requires a few bytes for each word in your 10-word sentence. And this dictionary could be made even more efficient if you represented each word as an integer pointer to each word’s location within your lexicon—the list of words that makes up your vocabulary for a particular application.

to see more go to Chapter 2. Build your vocabulary (word tokenization)

The set of English stop words that sklearn uses is quite different from those in NLTK. At the time of this writing, sklearn has 318 stop words. Even NLTK upgrades its corpora periodically, including the stop words list. When we reran listing 2.8 to count the NLTK stop words with nltk version 3.2.5 in Python 3.6, we got 179 stop words instead of 153 from an earlier version.

This is another reason to consider not filtering stop words. If you do, others may not be able to reproduce your results.

Depending on how much natural language information you want to discard ;), you can take the union or the intersection of multiple stop word lists for your pipeline. Here’s a comparison of sklearn stop words (version 0.19.2) and nltk stop words (version 3.2.5).

to see more go to 2.2.4. Extending your vocabulary with n-grams

Equation 6.2. Distance between the singular and plural versions of a word

to see more go to Chapter 6. Reasoning with word vectors (Word2vec)

Get Programming: Learn to code with Python

This is an excerpt from Manning's book Get Programming: Learn to code with Python. Login to get full access to this book.

word = input("Tell me a word: ")
print(word)
if " " in word:
    print("You did not follow directions!")

to see more go to Appendix A. Answers to lesson exercises

Figure 18.1. Flowchart of the guessing game. A user guesses a word. The guessing loop is represented by the gray diamond, which checks whether the user guess is equal to the secret word. If it is, the game finishes. If it isn’t equal, you tell the player that they’re wrong, ask them to guess again, and add 1 to the number of times that they made a guess.

to see more go to Lesson 18. Repeating tasks while conditions hold

Do you care about the ordering of the words in files? If two files have the same words but in different order, are they still the same?

to see more go to Lesson 29. Capstone project: document similarity

As you’re going through the keys of one dictionary, you check whether the key is also in the other dictionary. Recall that you’re looking at the value for each key; the value is the number of times the word occurs in one text. If the word is in both dictionaries, take the difference between the two frequency counts. If it isn’t, take the count from the one dictionary in which it exists.

After you finish going through one dictionary, go through the other dictionary. You no longer need to look at the difference between the two dictionary values because you already counted that previously. Now you’re just looking to see whether any words in the other dictionary weren’t in the first one. If so, add up their counts.

Finally, when you have the sum of the differences, divide that by the total number of words in both dictionaries. Take 1 minus that value to match the original problem specifications for scoring.
Listing 29.4. Calculate similarity given two input dictionaries
def calculate_similarity(dict1, dict2):
    """
    dict1: frequency dictionary for one text
    dict2: frequency dictionary for another text
    returns: float, representing how similar both texts are to each other
    """
    diff = 0
    total = 0

    for word in dict1.keys():                                   #1
        if word in dict2.keys():                                #2
            diff += abs(dict1[word] - dict2[word])              #3
        else:                                                   #4
            diff += dict1[word]                                 #5

    for word in dict2.keys():                                   #6
        if word not in dict1.keys():                            #7
            diff += dict2[word]                                 #8

    total = sum(dict1.values()) + sum(dict2.values())           #9
    difference = diff / total                                   #10
    similar = 1.0 – difference                                  #11

    return round(similar, 2)                                    #12
copy

to see more go to Lesson 29. Capstone project: document similarity

You wrote modular code by using functions that could be reused.

to see more go to 29.7. One possible extension

Deep Learning with Python

This is an excerpt from Manning's book Deep Learning with Python. Login to get full access to this book.

A dataset of text documents, where we represent each document by the counts of how many times each word appears in it (out of a dictionary of 20,000 common words). Each document can be encoded as a vector of 20,000 values (one count per word in the dictionary), and thus an entire dataset of 500 documents can be stored in a tensor of shape (500, 20000).

to see more go to 2.2.9. Vector data

This chapter explores deep-learning models that can process text (understood as sequences of words or sequences of characters), timeseries, and sequence data in general. The two fundamental deep-learning algorithms for sequence processing are recurrent neural networks and 1D convnets, the one-dimensional version of the 2D convnets that we covered in the previous chapters. We’ll discuss both of these approaches in this chapter.

to see more go to Chapter 6. Deep learning for text and sequences

Extract n-grams of words or characters, and transform each n-gram into a vector. N-grams are overlapping groups of multiple consecutive words or characters.

Collectively, the different units into which you can break down text (words, characters, or n-grams) are called tokens, and breaking text into such tokens is called tokenization. All text-vectorization processes consist of applying some tokenization scheme and then associating numeric vectors with the generated tokens. These vectors, packed into sequence tensors, are fed into deep neural networks. There are multiple ways to associate a vector with a token. In this section, I’ll present two major ones: one-hot encoding of tokens, and token embedding (typically used exclusively for words, and called word embedding). The remainder of this section explains these techniques and shows how to use them to go from raw text to a Numpy tensor that you can send to a Keras network.

Figure 6.1. From text to tokens to vectors

Understanding n-grams and bag-of-words

Word n-grams are groups of N (or fewer) consecutive words that you can extract from a sentence. The same concept may also be applied to characters instead of words.

to see more go to Chapter 6. Deep learning for text and sequences

6.1.1. One-hot encoding of words and characters

One-hot encoding is the most common, most basic way to turn a token into a vector. You saw it in action in the initial IMDB and Reuters examples in chapter 3 (done with words, in that case). It consists of associating a unique integer index with every word and then turning this integer index i into a binary vector of size N (the size of the vocabulary); the vector is all zeros except for the ith entry, which is 1.

Of course, one-hot encoding can be done at the character level, as well. To unambiguously drive home what one-hot encoding is and how to implement it, listings 6.1 and 6.2 show two toy examples: one for words, the other for characters.

Listing 6.1. Word-level one-hot encoding (toy example)

import numpy as np samples = ['The cat sat on the mat.', 'The dog ate my homework.'] #1 token_index = {} #2 for sample in samples: for word in sample.split(): #3 if word not in token_index: token_index[word] = len(token_index) + 1 #4 max_length = 10 #5 results = np.zeros(shape=(len(samples), max_length, max(token_index.values()) + 1)) #6 for i, sample in enumerate(samples): for j, word in list(enumerate(sample.split()))[:max_length]: index = token_index.get(word) results[i, j, index] = 1. #1 - Initial data: one entry per sample (in this example, a sample is a sentence, but it could be an entire document) #2 - Builds an index of all tokens in the data #3 - Tokenizes the samples via the split method. In real life, you’d also strip punctuation and special characters from the samples. #4 - Assigns a unique index to each unique word. Note that you don’t attribute index 0 to anything. #5 - Vectorizes the samples. You’ll only consider the first max_length words in each sample. #6 - This is where you store the results.

copy

to see more go to 6.1.1. One-hot encoding of words and characters

concept word in category python

Tiny Python Projects: Learn coding and testing with puzzles and games

Figure 16.1 Our program will take input text from the command line or a file and will scramble the letters in words with four or more characters.

16.1.4 Scrambling a word

Python Workout: 50 ten-minute exercises

Mastering Large Datasets with Python: Parallelize and Distribute Your Python Code

Listing 7.7. Counting words of six letters or more in Spark

7.3.2. Scoring words using Hadoop Streaming

Figure 7.7. Counting words and getting the frequencies of a subset of words have similar forms but require different mappers and reducers.

Listing 7.4. Hadoop mapper script to get and filter words

Listing 7.5. Hadoop reducer script to accumulate counts

Figure 7.8. The output of our MapReduce job is a sequence of words.

Hello World! Third Edition

Natural Language Processing in Action: Understanding, analyzing, and generating text with Python

Equation 6.2. Distance between the singular and plural versions of a word

Get Programming: Learn to code with Python

Listing 29.4. Calculate similarity given two input dictionaries

Deep Learning with Python

Figure 6.1. From text to tokens to vectors

Understanding n-grams and bag-of-words

6.1.1. One-hot encoding of words and characters

Listing 6.1. Word-level one-hot encoding (toy example)

Unable to load book!

concept `word` in category `python`