6 Strings
This chapter covers
- Problems with characters that don’t fit the Java ‘char’ type
- Bugs caused by relying on the default system locale
- Discrepancies between format string and subsequent format arguments
- Accidental use of regular expressions
- Pitfalls around Java escape sequences
- Possible mistakes when using the
indexOf()
method.
There are many possible bugs involving strings. Strings look deceivingly simple, while in fact working with them correctly is quite difficult. Many common assumptions about strings are wrong.
6.1 Mistake #45. Assuming that char value is a character
Developers often assume that the Java char
type corresponds to a single displayed character and that the String.length
method returns the number of displayed characters or strings can be processed char-by-char. This is true in simple cases, but if the character Unicode code point is higher than 0x10000, such characters lay outside of the so-called Basic Multilingual Plane (BMP) and are represented as surrogate pairs: two Java char
values represent a single character. Many emoji characters are located outside the BMP and require a surrogate pair to be represented.
For example, if it’s necessary to split the text into fixed chunks to distribute it to several rows in the UI, a naïve approach would be to use something like this (for simplicity, let’s omit bounds checking):
String part = string.substring(rowNumber * rowLength, (rowNumber + 1) * rowLength);