chapter six

6 Strings

This chapter covers

Problems with characters that don’t fit the Java char type
Bugs caused by relying on the default system locale
Discrepancies between format string and subsequent format arguments
Accidental use of regular expressions
Pitfalls associated with Java escape sequences
Possible mistakes when using the indexOf() method

A variety of bugs may come up when using strings. Strings may appear deceivingly simple, but in fact, working with them correctly is quite difficult, as many common assumptions about them are incorrect.

6.1 Mistake 45: Assuming that char value is a character

Developers often assume the Java char type corresponds to a single displayed character. They naturally expect the String.length() method to return the number of displayed characters and think it’s OK to process strings char by char. This is true in simple cases, but if the character Unicode code point is higher than 0xffff (65,535), such characters lay outside of the so-called Basic Multilingual Plane (BMP) and are represented as surrogate pairs: two Java char values that represent a single character. Many emoji characters are located outside the BMP and require a surrogate pair to be represented.

For example, if it’s necessary to split the text into fixed chunks to distribute it to several rows in the UI, a naïve approach would be to use something like this (for simplicity, let’s omit bounds checking):

6.2 Mistake 46: Unexpected case conversions

6 Strings

This chapter covers

6.1 Mistake 45: Assuming that char value is a character

6.2 Mistake 46: Unexpected case conversions

6.3 Mistake 47: Using String.format with the default locale

6.4 Mistake 48: Mismatched format arguments

6.5 Mistake 49: Using plain strings instead of regular expressions

6.6 Mistake 50: Accidental use of replaceAll

6.7 Mistake 51: Accidental use of escape sequences

6.8 Mistake 52: Comparing strings in different case

6.9 Mistake 53: Not checking the result of indexOf method

6.10 Mistake 54: Mixing arguments of indexOf