This chapter covers
- UTF-8 encoding of Julia strings; byte versus character indexing
- Manipulating strings: interpolation, splitting, using regular expressions, parsing
- Working with symbols
- Using the InlineStrings.jl package to work with fixed-width strings
- Using the PooledArrays.jl package to compress vectors of strings
In this chapter, you will learn how to handle text data in the Julia language. Text data is stored in strings. Strings are one of the most common data types that you will encounter when doing data science projects, especially involving natural language processing tasks.
As an application of string processing, we will analyze movie genres that were given ratings by Twitter users. We want to understand which movie genre is most common and how the relative frequency of this genre changes with the movie year.
For this analysis, we will use the movies.dat file. The file URL is http://mng.bz/9Vao, and the file is shared on the GitHub repository https://github.com/sidooms/MovieTweetings under an MIT license.