Part 2. Processing data files
One of the key ingredients in data science projects is, obviously, data.
Much of the time that data will be stored in files. Those files might be found in different places, and they probably will have different formats and different structures. It will be your job to get those files and decide how to extract their data and combine in way that is meaningful for your project. You will also almost certainly need to massage and clean that data in various ways.
The following chapter, “Processing data files,” is from my book, The Quick Python Book, 3rd edition. While it’s intended to be a book introducing all of the Python language, for the 3rd edition I chose to use the last 5 chapters to focus a bit more on how to use Python to handle data.
This chapter introduces several aspects of reading and writing data files, from plain text files and delimited files to more structured formats like JSON and XML, even to spreadsheet files. It also discusses several common scenarios for massaging and cleaning the data extracted from those files, including handling incorrectly encoded files, null bytes, and other common hassles.