concept DataFrame in category dask

appears as: DataFrame
Data Science with Python and Dask

This is an excerpt from Manning's book Data Science with Python and Dask.

If you’re an experienced Pandas user, listing 2.1 will look very familiar. In fact, it is syntactically equivalent! For simplicity’s sake, I’ve unzipped the data into the same folder as the Python notebook I’m working in. If you put your data elsewhere, you will either need to find the correct path to use or change your working directory to the folder that contains your data using os.chdir. Inspecting the DataFrame we just created yields the output shown in figure 2.2.

Figure 2.2 Inspecting the Dask DataFrame

c02_02.eps

The output of listing 2.1 might not be what you expected. While Pandas would display a sample of the data, when inspecting a Dask DataFrame, we are shown the metadata of the DataFrame. The column names are along the top, and underneath is each column’s respective datatype. Dask tries very hard to intelligently infer datatypes from the data, just as Pandas does. But its ability to do so accurately is limited by the fact that Dask was built to handle medium and large datasets that can’t be loaded into RAM at once. Since Pandas can perform operations entirely in memory, it can quickly and easily scan the entire DataFrame to find the best datatype for each column. Dask, on the other hand, must be able to work just as well with local datasets and large datasets that could be scattered across multiple physical machines in a distributed filesystem. Therefore, Dask DataFrames employ random sampling methods to profile and infer datatypes from a small sample of the data. This works fine if data anomalies, such as letters appearing in a numeric column, are widespread. However, if there’s a single anomalous row among millions or billions of rows, it’s very improbable that the anomalous row would be picked in a random sample. This will lead to Dask picking an incompatible datatype, which will cause errors later on when performing computations. Therefore, a best practice to avoid that situation would be to explicitly set datatypes rather than relying on Dask’s inference process. Even better, storing data in a binary file format that supports explicit data types, such as Parquet, will avoid the issue altogether and bring some additional performance gains to the table as well. We will return to this issue in a later chapter, but for now we will let Dask infer datatypes.

Listing 5.1 should look very familiar. With the first couple of lines we’re importing the modules we’ll need for the chapter. Next, we’re loading the schema dictionary we created in chapter 4. Finally, we create a DataFrame called nyc_data_raw by reading the four CSV files, applying the schema, and selecting the columns that we defined in the schema (usecols=dtypes.keys()). Now we’re ready to go!

In chapter 3 you learned that Dask DataFrames have three structural elements: an index and two axes (rows and columns). To refresh your memory, figure 5.2 shows a visual guide to the structure of a DataFrame.

Figure 5.2 The structure of a DataFrame

c05_02.eps
Listing 5.2 Selecting a single column from a DataFrame
with ProgressBar():
    display(nyc_data_raw['Plate ID'].head())

# Produces the following output:
# 0    GBB9093
# 1    62416MB
# 2    78755JZ
# 3    63009MA
# 4    91648MC
# Name: Plate ID, dtype: object

You’ve already seen a few times that the head method will retrieve the first n rows of a DataFrame, but in those examples we’ve retrieved the entire DataFrame’s first n rows. In listing 5.2 you can see that we’ve put a pair of square brackets ([…]) to the right of nyc_data_raw and inside those square brackets we specified the name of one of the DataFrame’s columns (Plate ID). The column selector accepts either a string or a list of strings and applies a filter to the DataFrame that returns only the requested columns. In this particular case, since we specified only one column, what we get back is not another DataFrame. Instead, we get back a Series object, which is like a DataFrame that doesn’t have a column axis. You can see that, like a DataFrame, a Series object has an index, which is actually copied over from the DataFrame. Oftentimes when selecting columns, however, you’ll want to bring back more than one. Listing 5.3 demonstrates how to select more than one column from a DataFrame, and figure 5.3 shows the output of the listing.

Listing 5.3 Selecting multiple columns from a DataFrame using an inline list
with ProgressBar():
    print(nyc_data_raw[['Plate ID', 'Registration State']].head())
sitemap

Unable to load book!

The book could not be loaded.

(try again in a couple of minutes)

manning.com homepage
test yourself with a liveTest