11 Autocomplete/typeahead

This chapter covers

Comparing autocomplete with search
Separating data collection and processing from querying
Processing a continuous data stream
Dividing a large aggregation pipeline into stages to reduce storage costs
Employing the byproducts of data processing pipelines for other purposes

We wish to design an autocomplete system. Autocomplete is a useful question to test a candidate’s ability to design a distributed system that continuously ingests and processes large amounts of data into a small (few MBs) data structure that users can query for a specific purpose. An autocomplete system obtains its data from strings submitted by up to billions of users and then processes this data into a weighted trie. When a user types in a string, the weighted trie provides them with autocomplete suggestions. We can also add personalization and machine learning elements to our autocomplete system.

11.1 Possible uses of autocomplete

We first discuss and clarify the intended use cases of this system to ensure we determine the appropriate requirements. Possible uses of autocomplete include:

11.2 Search vs. autocomplete

11.3 Functional requirements

11.3.1 Scope of our autocomplete service

11.3.2 Some UX details

11.3.3 Considering search history

11.3.4 Content moderation and fairness

11.4 Non-functional requirements

11.5 Planning the high-level architecture

11.6 Weighted trie approach and initial high-level architecture

11.7 Detailed implementation

11.7.1 Each step should be an independent task

11.7.2 Fetch relevant logs from Elasticsearch to HDFS

11.7.3 Split the search strings into words and other simple operations

11.7.4 Filter out inappropriate words

11.7.5 Fuzzy matching and spelling correction

11.7.6 Count the words

11.7.7 Filter for appropriate words

11.7.8 Managing new popular unknown words

11.7.9 Generate and deliver the weighted trie

11.8 Sampling approach