chapter ten

10 Similarity Search Trees:Approximate Nearest Neighbors Search for Image Retrieval

 

This chapter covers

  • Discussing the limits of k-d trees
  • Describing image retrieval as a use case where k-d trees would struggle
  • Introducing a new data structure, the R-tree
  • Presenting SS-trees, a scalable variant of R-trees
  • Comparing SS-trees and k-d trees
  • Introducing approximate similarity search

This chapter will be structured slightly differently from our book’s standard, simply because we will continue here a discussion started in chapter 8. Back there, we introduced a problem, searching multidimensional data for the nearest neighbor(s) of a generic point (possibly not in the dataset itself); in chapter 9, we introduce k-d trees, a data structure specifically invented to solve this problem.

K-d trees are the best solution to date for indexing low to medium dimensional datasets that will completely fit in memory; when, however, either we have to operate on high-dimensional data, or with big datasets that won’t fit in memory, then k-d trees are not enough, we will need to use more advanced data structures.

In this chapter we first present a new problem, one that will push our indexing data-structure beyond its limits, and then introduce two new data structures, R-trees and SS-trees, that can help us solve this category of problems efficiently.

10.1   Right Where We Left

10.1.1   A New (More Complex) Example

10.1.2   Overcoming k-d trees Flaws

10.2   R-tree

10.2.1   A step back: Introducing B-trees

10.2.2   From B-Tree to R-tree

10.2.3   Inserting Points in an R-tree

10.2.4   Search

10.3   Similarity Search Tree

10.3.1   SS-tree Search

10.3.2   Insert

10.3.3   Insertion: Variance, Means, and Projections

10.3.4   Insertion: Split Nodes

10.3.5   Delete

10.4   Similarity Search

10.4.1   Nearest Neighbor Search

10.4.2   Region Search

10.4.3   Approximated Similarity Search

10.5   Ss+-Tree[70]

10.5.1   Are Ss-trees Better?

10.5.2   Mitigating Hyper-Spheres Limitations

10.5.3   Improved Split Heuristic

10.5.4   Reducing Overlap

10.6   Summary