chapter two

2 Review of hash hables and modern hashing

This chapter covers:

Reviewing dictionaries and why hashing is ubiquitous in modern systems
Refreshing some basic collision-resolution techniques: theory and real-life implementations
Exploring cache-efficiency in hash tables
Using hash tables for distributed systems and consistent hashing
Learning how consistent hashing works in P2P networks: use case of Chord

We begin with the topic of hashing for a number of reasons. First, classical hash tables have proved irreplaceable in modern systems, deeming it harder to find a system that does not use them than the one that does. Second, recently there has been a lot of innovative work addressing algorithmic issues that arise as hash tables grow to fit massive data, such as efficient resizing, compact representation and space-saving tricks, etc. In a similar vein, hashing has over time been adapted to serve in massive peer-to-peer systems where the hash table is split among servers; here, the key challenge is assignment of resources to servers and load-balancing of resources as servers dynamically join and leave the network. Lastly, we begin with hashing because it forms the backbone of all succinct data structures we present in Part I of the book.

2.1 Ubiquitous hashing

2.2 A crash course on data structures

2.3 Usage scenarios in modern systems

2.3.1 Deduplication in backup/storage solutions

2.3.2 Plagiarism detection with MOSS and Rabin-Karp fingerprinting

2.4 O(1) --- what’s the big deal?

2.5 Collision Resolution: theory vs. practice

2.6 Usage scenario: How Python’s dict does it

2.7 MurmurHash

2.8 Hash Tables for Distributed Systems: Consistent Hashing

2.8.1 A typical hashing problem?

2.8.2 Hashring

2.8.3 Lookup

2.8.4 Adding a new node/resource

2.8.5 Removing a node

2.8.6 Consistent hashing scenario: Chord

2.9 Summary