2 Review of hash tables and modern hashing

 

This chapter covers:

  • Reviewing dictionaries and why hashing is ubiquitous in modern systems
  • Refreshing some basic collision-resolution techniques: theory and real-life implementations
  • Exploring cache-efficiency in hash tables
  • Using hash tables for distributed systems and consistent hashing
  • Learning how consistent hashing works in P2P networks: use case of Chord

We begin with the topic of hashing for a number of reasons. First, classical hash tables have proved irreplaceable in modern systems, deeming it harder to find a system that does not use them than the one that does. Second, recently there has been a lot of innovative work addressing algorithmic issues that arise as hash tables grow to fit massive data, such as efficient resizing, compact representation and space-saving tricks, etc. In a similar vein, hashing has over time been adapted to serve in massive peer-to-peer systems where the hash table is split among servers; here, the key challenge is assignment of resources to servers and load-balancing of resources as servers dynamically join and leave the network. Lastly, we begin with hashing because it forms the backbone of all succinct data structures we present in Part I of the book.

2.1    Ubiquitous hashing

2.2    A crash course on data structures

2.3    Usage scenarios in modern systems

2.3.1   Deduplication in backup/storage solutions

2.3.2   Plagiarism detection with MOSS and Rabin-Karp fingerprinting

2.4    O(1) --- what’s the big deal?

2.5    Collision Resolution: theory vs. practice

2.6    Usage scenario: How Python’s dict does it

2.7    MurmurHash

2.8    Hash Tables for Distributed Systems: Consistent Hashing

2.8.1   A typical hashing problem?

2.8.2   Hashring

2.8.3   Lookup

2.8.4   Adding a new node/resource

2.8.5   Removing a node

2.8.6   Consistent hashing scenario: Chord

2.9    Summary