6 Working with strings

 

This chapter covers

  • UTF-8 encoding of Julia strings; byte versus character indexing
  • Manipulating strings: interpolation, splitting, using regular expressions, parsing
  • Working with symbols
  • Using the InlineStrings.jl package to work with fixed-width strings
  • Using the PooledArrays.jl package to compress vectors of strings

In this chapter, you will learn how to handle text data in the Julia language. Text data is stored in strings. Strings are one of the most common data types that you will encounter when doing data science projects, especially involving natural language processing tasks.

As an application of string processing, we will analyze movie genres that were given ratings by Twitter users. We want to understand which movie genre is most common and how the relative frequency of this genre changes with the movie year.

For this analysis, we will use the movies.dat file. The file URL is http://mng.bz/9Vao, and the file is shared on the GitHub repository https://github.com/sidooms/MovieTweetings under an MIT license.

We will analyze the movie genre data according to the following steps, which are described in the subsequent sections of this chapter and depicted in figure 6.1:

  1. Get the data from the web.
  2. Read in the data in Julia.
  3. Parse the original data to extract the year and genre list for each analyzed movie.
  4. Create frequency tables to find which movie genre is most common.
  5. Create a plot of popularity of the most common genre by year.

6.1 Getting and inspecting the data

6.1.1 Downloading files from the web

6.1.2 Using common techniques of string construction

6.1.3 Reading the contents of a file

6.2 Splitting strings

6.3 Using regular expressions to work with strings

6.3.1 Working with regular expressions