Chapter 2 Processing data files

 

This chapter covers:

    Using ETL (extract-transform-load)

    Reading text data files (plain text and CSV)

    Reading spreadsheet files

    Normalizing, cleaning, and sorting data

    Writing data files

    Much of the data available is contained in text files. This data can range from unstructured text, such as a corpus of tweets or literary texts, to more structured data in which each row is a record and the fields are delimited by a special character, such as a comma, a tab, or a pipe (|). Text files can be huge; a data set can be spread over tens or even hundreds of files, and the data in it can be incomplete or horribly dirty. With all the variations, it’s almost inevitable that you’ll need to read and use data from text files. This chapter gives you strategies for using Python to do exactly that.

    21.1 Welcome to ETL

    21.2 Reading text files

    21.2.1 Text encoding: ASCII, Unicode, and others

    21.2.2 Unstructured text

    21.2.3 Delimited flat files

    21.2.4 The csv module

    21.2.5 Reading a csv file as a list of dictionaries

    21.3 Excel files

    21.4 Data cleaning

    21.4.1 Cleaning

    21.4.2 Sorting

    21.4.3 Data cleaning issues and pitfalls

    21.5 Writing data files

    21.5.1 CSV and other delimited files

    21.5.2 Writing Excel files

    21.5.3 Packaging data files

    Summary