chapter six

6 Working with strings

 

This chapter covers

  • UTF-8 encoding of Julia strings; byte vs. character indexing
  • Manipulating strings: interpolation, splitting, using regular expressions, parsing
  • Working with symbols
  • Using the InlineStrings.jl package to work with fixed-width strings
  • Using the PooledArrays.jl package to compress vectors of strings

In this chapter you will learn how to handle text data in the Julia language. Text data is stored in strings. Strings are one of the most common data types that you will encounter when doing data science projects, especially involving natural language processing tasks.

As an application of string processing, we will analyze genre of movies that were given ratings by Twitter users. We will want to understand what movie genre was most common and how the relative frequency of this genre changed with the movie year.

For the analysis we will use the movies.dat file. The file source is https://github.com/sidooms/MovieTweetings/blob/44c525d0c766944910686c60697203cda39305d6/snapshots/10K/movies.dat that is shared on GitHub repository https://github.com/sidooms/MovieTweetings under MIT license.

The process of analyzing the movie genre data will be executed in the following steps, which are described in the consecutive sections of this chapter; see also figure 6.1:

6.1 Getting and inspecting the data

6.2 Splitting strings

6.3 Working with strings using regular expressions

6.4 Extracting a subset from a string with indexing

6.5 Analyzing genres frequency in movies.dat

6.6 Introducing symbols

6.6.1 Creating symbols

6.6.2 Using symbols

6.7 Using fixed-width string types to improve performance

6.8 Compressing vectors of strings with PooledArrays.jl

6.8.1 Creating a file containing flower names

6.8.2 Reading in the data to a vector and compressing it

6.8.3 Internal design of PooledArray

6.9 Choosing an appropriate storage for collections of strings

6.10 Summary