6 Strings
This chapter covers
- Problems with characters that don’t fit the Java
char
type - Bugs caused by relying on the default system locale
- Discrepancies between format string and subsequent format arguments
- Accidental use of regular expressions
- Pitfalls associated with Java escape sequences
- Possible mistakes when using the
indexOf()
method
A variety of bugs may come up when using strings. Strings may appear deceivingly simple, but in fact, working with them correctly is quite difficult, as many common assumptions about them are incorrect.
6.1 Mistake 45: Assuming that char value is a character
Developers often assume the Java char
type corresponds to a single displayed character. They naturally expect the String.length()
method to return the number of displayed characters and think it’s OK to process strings char
by char
. This is true in simple cases, but if the character Unicode code point is higher than 0xffff
(65,535), such characters lay outside of the so-called Basic Multilingual Plane (BMP) and are represented as surrogate pairs: two Java char
values that represent a single character. Many emoji characters are located outside the BMP and require a surrogate pair to be represented.
For example, if it’s necessary to split the text into fixed chunks to distribute it to several rows in the UI, a naïve approach would be to use something like this (for simplicity, let’s omit bounds checking):