chapter five

5 What’s up with you, Directed Acyclic Word Graph?

 

This chapter covers

  • Several techniques for storing a large list of words and searching it by prefix
  • Costs and benefits of hash sets, sorted lists, prefix trees (tries) and word graphs (DAWGs)
  • Algorithms for building tries and DAWGs when given a word list in sorted order
  • The connection between DAWGs and automata theory

My day improved only slightly after the disastrous first interview I described in the preface. The second interviewer asked me a much better question: given the seven letters on your rack in a turn of Scrabble, what algorithm finds all the “bingos”? That is, how do you list the legal seven letter words made up of exactly those letters? The function should take as its input a string such as “NISATEV” and produce the sequence “NAIVEST”, “NATIVES” and “VAINEST”.

It's not a super hard problem and I was quite pleased with my solution, but I struggled with the follow-up questions about harder word list searching problems. I knew that a prefix tree – a “trie” -- was a suitable data structure for word lists but was fuzzy on the details, and I admitted confusion when the interviewer asked me if I knew what a “dog” was; it turns out he meant a Directed Acyclic Word Graph, DAWG. That was my second NO HIRE of the day.

5.1 Two problems, two weak solutions

5.1.1 Hash sets are a non-solution

5.1.2 Sorted lists are… fine?

5.2 The prefix tree

5.2.1 Nodes and edges

5.2.2 The trie building algorithm

5.2.3 Using a trie as a word list

5.2.4 How much memory are we saving with a trie?

5.3 Tries are not-so-secretly finite state automata

5.4 Building a DAWG

5.4.1 What problems must we solve to build a DAWG?

5.4.2 Equivalence of nodes

5.4.3 The DAWG building algorithm.

5.4.4 How much expense does optimizing the graph as you go add?

5.5 DAWG vs trie for ENABLE

5.6 Summing Up