concept apply in category dask

This is an excerpt from Manning's book Data Science with Python and Dask.
Finally, since a Dask DataFrame is made up of many Pandas DataFrames, operations that are inefficient in Pandas will also be inefficient in Dask. For example, iterating over rows by using the
apply
anditerrows
methods is notoriously inefficient in Pandas. Therefore, following Pandas best practices will give you the best performance possible when using Dask DataFrames. If you’re not well on your way to mastering Pandas yet, continuing to sharpen your skills will not only benefit you as you get more familiar with Dask and distributed workloads, but it will help you in general as a data scientist!
While the methods for recoding values that you learned in the previous section are very useful, and you’re likely to use them often, it’s also good to know how to create new columns that are derived from other existing columns in the DataFrame. One scenario that comes up often with structured data, like our NYC Parking Ticket dataset, is the need to parse and work with date/time dimensions. Back in chapter 4, when we constructed our schema for the dataset, we opted to import the date columns as strings. However, to properly use dates for our analyses, we need to change those strings to datetime objects. Dask gives you the ability to automatically parse dates when reading data, but it can be finicky with formatting. An alternative approach that gives you more control over how the dates are parsed is to import the date columns as strings and manually parse them as part of your data prep workflow. In this section, we’ll learn how to use the
apply
method on DataFrames to apply generic functions to our data and create derived columns. More specifically, we’ll parse the Issue Date column, which represents the date that the parking citation was issued and converts that column to a datetime datatype. We’ll then create a new column containing the month and year that the citation was issued, which we’ll use again later in the chapter. With that in mind, let’s get to it!Listing 5.23 Parsing the Issue Date column
from datetime import datetime issue_date_parsed = nyc_data_recode_stage6['Issue Date'].apply(lambda x: datetime.strptime(x, "%m/%d/%Y"), meta=datetime) nyc_data_derived_stage1 = nyc_data_recode_stage6.drop('Issue Date', axis=1) nyc_data_derived_stage2 = nyc_data_derived_stage1.assign(IssueDate=issue_date_parsed) nyc_data_derived_stage3 = nyc_data_derived_stage2.rename(columns={'IssueDate':'Issue Date'})In listing 5.23, we first need to import the datetime object from Python’s standard library. Then, as you’ve seen in a few previous examples, we create a new Series object by selecting the Issue Date series from our DataFrame (
nyc_data_recode_stage6
) and use theapply
method to perform the transformation. In this particular call toapply
, we create an anonymous (lambda) function that takes a value from the input Series, runs it through thedatetime.strptime
function, and returns a parsed datetime object. Thedatetime.strptime
function simply takes a string as input and parses it into a datetime object using the specified format. The format we specified was"%m/%d/%Y"
, which is equivalent to an mm/dd/yyyy date. The last thing to note about theapply
method is themeta
argument we had to specify. Dask tries to infer the output type of the function passed into it, but it’s better to explicitly specify what the datatype is. In this case, datatype inference will fail so we’re required to pass an explicit datetime datatype. The next three lines of code should be very familiar by now: drop, assign, rename—the pattern we learned before to add a column to our DataFrame. Let’s take a look at what happened.