chapter thirteen

13 Project: Finding the fastest way to get around NYC

 

This chapter covers

  • Setting up an Airflow pipeline from scratch
  • Structuring intermediate output data
  • Developing idempotent tasks
  • Implementing one operator to handle multiple similar transformations

By now, we’ve discussed most of the ins and outs of using Airflow, and you’re well on your way to becoming an Airflow expert. It’s time for you to make good use of all that knowledge and see how to apply your new skills to a real-life use case.

13.1 Use case: Investigating traffic in New York City

Transportation in New York City (NYC) can be hectic. It’s always rush hour. Fortunately, more alternative ways of transportation are available than ever. In May 2013, Citi Bike started operating in New York City with 6,000 bikes. Over the years, Citi Bike has grown and expanded, becoming a popular method of transportation in the city.

Another iconic method of transportation is the Yellow Cab taxi. Taxis were introduced in NYC in the late 1890s and have always been popular. In recent years, however, the number of taxi drivers has plummeted, and many drivers have started driving for ride-sharing services such as Uber and Lyft.

Regardless of the type of transportation you choose in NYC, your typical goal is to go from point A to point B as fast as possible. Luckily, the city of New York is active about publishing data, including rides from Citi Bike and Yellow Cab taxis.

13.2 Understanding the data

13.2.1 Yellow Cab file share

13.2.2 Citi Bike REST API

13.2.3 Deciding on a plan of approach

13.3 Extracting the data

13.3.1 Downloading Citi Bike data

13.3.2 Downloading Yellow Cab data

13.4 Applying similar transformations to data

13.5 Structuring a data pipeline

13.6 Developing idempotent data pipelines

Summary