Appendix F. Working with large datasets

 

R holds all of its objects in virtual memory. For most of us, this design decision has led to a zippy interactive experience, but for analysts working with large datasets, it can lead to slow program execution and memory-related errors.

Memory limits depend primarily on the R build (32- versus 64-bit) and the OS version involved. Error messages starting with “cannot allocate vector of size” typically indicate a failure to obtain sufficient contiguous memory, whereas error messages starting with “cannot allocate vector of length” indicate that an address limit has been exceeded. When working with large datasets, try to use a 64-bit build if at all possible. See ?Memory for more information.

There are three issues to consider when working with large datasets: efficient programming to speed execution, storing data externally to limit memory issues, and using specialized statistical routines designed to efficiently analyze massive amounts of data. First we’ll consider simple solutions for each. Then we’ll turn to more comprehensive (and complex) solutions for working with big data.

F.1. Efficient programming

A number of programming tips can help you improve performance when working with large datasets:

F.2. Storing data outside of RAM

F.3. Analytic packages for out-of-memory data

F.4. Comprehensive solutions for working with enormous datasets