chapter eight

8 Nearest neighbors search

This chapter covers

Finding closest points in a multidimensional dataset
Introducing data structures to index multidimensional space
Understanding the issues with indexing high dimensional spaces
Introducing efficient nearest neighbor search

So far in this book we have worked with containers that were holding unidimensional data: the entries that we stored in queues, trees, and hash tables were always assumed to be (or to be translatable to) numbers—simple values that could be compared in the most intuitive mathematical sense.

In this chapter, we will see how this simplification doesn’t always hold true in real datasets, and we’ll examine the issues connected to handling more complex, multidimensional data. Do not despair, though, because in the next chapters we will also describe data structures that can help handle this data and see real applications that leverage efficient nearest neighbor search as part of their workflow, such as clustering.

As we’ll see, the discussion on this area is particularly rich, and there is no way that we can cover it, not even just its crucial bits, in a single chapter. Therefore, while in part 1 each chapter was following the same pattern to explain topics, we’ll have to stretch this pattern to the whole part 2, where each chapter will cover only a single piece of our usual discussion:

8.1 The nearest neighbors search problem

8.2 Solutions

8.2.1 First attempts

8 Nearest neighbors search

This chapter covers

8.1 The nearest neighbors search problem

8.2 Solutions

8.2.1 First attempts

8.2.2 Sometimes caching is not the answer

8.2.3 Simplifying things to get a hint

8.2.4 Carefully choose a data structure

8.3 Description and API

8.4 Moving to k-dimensional spaces

8.4.1 Unidimensional binary search

8.4.2 Moving to higher dimensions

8.4.3 Modeling 2-D partitions with a data structure

Summary