chapter eleven

11 Named Entity Recognition

This chapter covers

Introduction to the task of Named Entity Recognition (NER)
Overview of sequence labelling approaches in NLP using NER as an example
Integration of NER into downstream tasks
Introduction to further data preprocessing tools and techniques (pandas)

Previous chapters overviewed a number of NLP tasks: from binary classification tasks, such as author identification and sentiment analysis, to multi-class classification tasks, such as topic analysis. These applications deployed machine learning models and relied on a range of linguistic features, most often related to words or word characteristics. While it is true that individual words express information useful in the context of many NLP applications, often the information-bearing unit is actually larger than a single word. In chapter 4, you looked into the task of Information Extraction. Here is a reminder: this task allows you to extract facts and relevant information from an otherwise unstructured data, such as, for example, raw unprocessed text. As we discussed in chapter 4, this task is instrumental in a number of applications – from information management to database completion, to question answering. For instance, suppose you have a collection of texts on various personalities, including the Wikipedia article on Albert Einstein.^[1] Figure 11.1 shows a sentence from this article:

11.1 Named Entity Recognition: Definitions and Challenges

11.1.1 Named Entity Types

11.1.2 Challenges in Named Entity Recognition

11.2 Named Entity Recognition as a Sequence Labelling Task

11.2.1 The Basics: BIO Scheme

11.2.2 What does it Mean for a Task to be Sequential?

11.2.3 Sequential Solution for NER

11.3 Practical Applications of NER

11.3.1 Data Loading and Exploration

11.3.2 Named Entity Types Exploration with spaCy

11.3.3 Information Extraction Revisited

11.3.4 Named Entities Visualization

11.4 Summary

11.5 Conclusions

11.6 Solutions to exercises