9 K-d trees: Multidimensional data indexing

This chapter covers

Indexing a 2-D (and in general k-D) dataset efficiently
Implementing nearest neighbor search with k-d trees
Discussing k-d trees’ strengths and flaws

This chapter will be structured slightly differently from our book’s standard, simply because we will continue here a discussion started in chapter 8. We introduced a problem: searching multidimensional data for the nearest neighbor(s) of a generic point (possibly not in the dataset itself).

In this chapter, we follow up on those topics, so we won’t introduce a new problem, but pick up the “closest hub” example from chapter 8 and show a different option to solve it, using k-d trees.

9.1 Right where we left off

Let’s recap where we left off in previous chapters. We are designing software for an e-commerce company, an application to find the closest warehouse selling a given product for any point on a very large map. See figure 9.1 to visualize. To have a ballpark idea of the kind of scale we need, we want to serve millions of clients per day across the country, taking products from thousands of warehouses, also spread across the map.

In section 8.2, we established that a brute-force solution where we skim through the whole list of points to compare differences can’t work for a live solution.

We have also seen how the multidimensional structure of the data prevents us from using the basic solutions we saw in the first part of the book, from heaps to hash maps.

9.2 Moving to k-D spaces: Cycle through dimensions

9.2.1 Constructing the BST

9.2.2 Invariants

9 K-d trees: Multidimensional data indexing

This chapter covers

9.1 Right where we left off

9.2 Moving to k-D spaces: Cycle through dimensions

9.2.1 Constructing the BST

9.2.2 Invariants

9.2.2 The importance of being balanced

9.3 Methods

9.3.1 Search

9.3.2 Insert

9.3.3 Balanced tree

9.3.4 Remove

9.3.5 Nearest neighbor

9.3.6 Region search

9.3.7 A recap of all methods

9.4 Limits and possible improvements

Summary