concept vocabulary in category nlp

This is an excerpt from Manning's book Real-World Natural Language Processing MEAP V06.
These assignments are usually managed by a look-up table. The entire, finite set of words that one NLP application or task deals with is called vocabulary. But this method isn’t any better than dealing with raw words. Just because words are now represented by numbers doesn’t mean you can do arithmetic operations on them and conclude that “cat” is equally similar to “dog” (difference between 1 and 2) as “dog” to “pizza” (difference between 2 and 3). Those indices are still discrete and arbitrary.
The second step in many NLP applications is to build the vocabulary. In computer science, vocabulary is a theoretical concept that represents the set of all possible words in a language. In NLP, though, it often means just the set of all unique tokens which appeared in a dataset. It is simply impossible to know all the possible words in a language nor is it necessary for an NLP application. What is stored in a vocabulary is called a vocabulary item (or just an item). A vocabulary item is usually a word, although depending on the task at hand, it can be any form of linguistic units, including characters, character n-grams, and labels for linguistic annotation.
AllenNLP provides a class called Vocabulary. It not only takes care of storing vocabulary items that appeared in a dataset, but it also holds mappings between vocabulary items and their IDs. As mentioned before, neural networks and machine learning models in general can only deal with numbers, and there needs to be a way to map discrete items such as words to some numerical representations such as word IDs. The vocabulary is also used to map the results of an NLP model back to original words and labels so that humans can actually read them.
You can create a Vocabulary object from instances as follows:
vocab = Vocabulary.from_instances(train_dataset + dev_dataset, min_count={'tokens': 3})There are a couple of things to note here. AllenNLP’s Vocabulary class supports namespaces, which are a system to separate different sets of items so that they don’t get mixed up. Here’s why they are useful—say, you are building a machine translation system, and you just read a dataset that contains English and French translations. Without namespaces, you’d have just one set that contains all words in English and French. This is usually not a big issue because English words (“hi”, “thank you”, “language”) and French words (“bonjour”, “au revoir”, “langue”) look quite different in most cases. However, there are a number of words that look exactly the same in both languages. For example, “chat” means “cat” in French, but it’s hard to imagine anybody wanting to mix those two words and assign the same ID (and embeddings). In order to avoid this conflict, Vocabulary implements namespaces and assign separate sets of items of different types.
You may have noticed the form_instances() function call has a min_count argument. For each namespace, this specifies the minimum number of occurrences in the dataset that is necessary for an item to be included in the vocabulary. All the items that appear less frequently than this threshold are treated as “unknown” items. Here’s why this is a good idea: in a typical language, there are a very small number of words that appear a lot (in English, “the”, “a”, “of” …) and a very large number of words that appear very infrequently. This usually exhibits a long tail distribution of word frequencies. But it is not likely that these super infrequent words add anything useful to the model, and precisely because they appear infrequently it is difficult to learn any useful patterns from them anyway. Also, because there are so many of them, they inflate the size of the vocabulary and the number of model parameters. In such a case, a common practice in NLP is to cut this long tail and collapse all the infrequent words to a single entity <UNK> (for “unknown” words).