concept large dataset in category algorithms
appears as: large dataset, large datasets

This is an excerpt from Manning's book Algorithms and Data Structures for Massive Datasets MEAP V01.
Figure 1.1: In this example, we build a (comment-id, frequency) hash table to help us eliminate duplicate comments. So for example, the comment identified by comment-id 36457 occurs 6 times in the dataset. We also build “keyword” hash tables, where, for each keyword of interest, we count how many times the keyword is mentioned in the comments of a particular article. So for example, the word ‘science’ is mentioned 21 times in the comments of the article identified by article-id 8999. For a large dataset of 3 billion comments, storing all these data structures can easily lead to needing dozens to a hundred of gigabytes of RAM memory.
![]()