chapter thirty three

33 The movies dataset

In this capstone, you will

Define ordered sequences of elements
Transform and count the items of a list
Find the minimum and maximum elements according to specific features
Filter items based on their characteristics and selecting them based on their position
Sort lists and produce string representation for them

In this capstone, you’ll analyze data for more than 45,000 movies. The information is a subset of a popular and publicly accessible dataset called “The Movies Dataset” by Rounak Banik. On its website, you can find its latest version as well as an extensive description of its content:

These files contain metadata for all 45,000 movies listed in the Full MovieLens Dataset. The dataset consists of movies released on or before July 2017. Data points include cast, crew, plot keywords, budget, revenue, posters, release dates, languages, production companies, countries, TMDB vote counts and vote averages.

This dataset also has files containing 26 million ratings from 270,000 users for all 45,000 movies. Ratings are on a scale of 1–5 and have been obtained from the official GroupLens website.

From https://www.kaggle.com/rounakbanik/the-movies-dataset

For this capstone, you’ll focus on a subset of its data contained in a file called movies_ metadata.csv. Its rows provide information on movies, such as their title, language, release date, vote average, and popularity. Table 33.1 is a list of the properties you’ll consider for this capstone.

33 The movies dataset

In this capstone, you will

33.1 Download the base project

33.2 Parsing a movie entity

33.3 Printing query results

33.4 Querying the movie data set

Summary