chapter six

6 Working with strings

 

This chapter covers

  • UTF-8 encoding of Julia strings; byte vs. character indexing
  • Manipulating strings: interpolation, splitting, using regular expressions, parsing
  • Working with symbols
  • Using the InlineStrings.jl package to work with fixed-width strings
  • Using the PooledArrays.jl package to compress vectors of strings

In this chapter you will learn how to handle text data in the Julia language. Text data is stored in strings. Strings are one of the most common data types that you will encounter when doing data science projects, especially involving natural language processing tasks.

As an application of string processing, we will analyze the genre of movies that were given ratings by Twitter users. We will want to understand what movie genre was most common and how the relative frequency of this genre changed with the movie year.

For the analysis we will use the movies.dat file. The file URL is https://github.com/sidooms/MovieTweetings/blob/44c525d0c766944910686c60697203cda39305d6/snapshots/10K/movies.dat that is shared on the GitHub repository https://github.com/sidooms/MovieTweetings under MIT license.

The process of analyzing the movie genre data will be executed according to the following steps, which are described in the subsequent sections of this chapter; see also figure 6.1:

6.1 Getting and inspecting the data

6.2 Splitting strings

6.3 Working with strings using regular expressions

6.4 Extracting a subset from a string with indexing

6.5 Analyzing genres frequency in movies.dat

6.6 Introducing symbols

6.6.1 Creating symbols

6.6.2 Using symbols

6.7 Using fixed-width string types to improve performance

6.8 Compressing vectors of strings with PooledArrays.jl

6.8.1 Creating a file containing flower names

6.8.2 Reading in the data to a vector and compressing it

6.8.3 Internal design of PooledArray

6.9 Choosing an appropriate storage for collections of strings

6.10  Summary