16 File operations for a parallel world
This chapter covers
- Modifying a parallel application for standard file operations
- Writing out data using parallel file operations with MPI-IO and HDF5
- Tuning parallel file operations for different parallel file systems
File systems create a streamlined workflow of retrieving, storing, and updating data. For any computing work, the product is the output - whether it be data, graphics or statistics. This includes final results, but also intermediate output for graphics, checkpointing, and analysis. Checkpointing is a special need on large HPC systems with long-running calculations that might span days, weeks, or months.
Definition: checkpointing
- the practice of periodically storing the state of a calculation to disk so that the calculation can be restarted due to system failures or finite length run times in a batch system
When processing data for highly parallel applications, there needs to be a safe and performant way of reading and storing data at runtime. Therein lies the need to understand file operations in a parallel world. Some of the concerns you should keep in mind are correctness, reducing duplicate output, and performance.