6 Working with strings
This chapter covers
- UTF-8 encoding of Julia strings; byte vs. character indexing
- Manipulating strings: interpolation, splitting, using regular expressions, parsing
- Working with symbols
- Using the InlineStrings.jl package to work with fixed-width strings
- Using the PooledArrays.jl package to compress vectors of strings
In this chapter you will learn how to handle text data in the Julia language. Text data is stored in strings. Strings are one of the most common data types that you will encounter when doing data science projects, especially involving natural language processing tasks.
As an application of string processing, we will analyze the genre of movies that were given ratings by Twitter users. We will want to understand what movie genre was most common and how the relative frequency of this genre changed with the movie year.
For the analysis we will use the movies.dat file. The file URL is https://github.com/sidooms/MovieTweetings/blob/44c525d0c766944910686c60697203cda39305d6/snapshots/10K/movies.dat that is shared on the GitHub repository https://github.com/sidooms/MovieTweetings under MIT license.
The process of analyzing the movie genre data will be executed according to the following steps, which are described in the subsequent sections of this chapter; see also figure 6.1: