8 Unicode and wide characters

 

In the beginning was Morse code, a simple method of translating electrical pulses—long and short—into a string of characters and readable text. Morse wasn’t the first electronic encoding method, but it’s perhaps the best known. Developed in 1840, it’s named after Samuel Morse, who helped invent the telegraph and who also bears an uncanny resemblance to Lost in Space’s Dr. Smith.

Some 30 years after Morse code came the Baudot code. Also used in telegraph communications, Baudot (baw-DOH) represents letters of the alphabet using a 5-bit sequence. This code was later modified into Murray code for use on teletype machines with keyboards, as well as early computers. Then came IBM’s Binary Coded Decimal (BCD) for use on their mainframe computers. Eventually, the ASCII encoding standard ruled the computer roost until Unicode solved everyone’s text encoding problems in the late 20th century.

This chapter’s topic is character encoding, the art of taking an alphabet soup of characters and assigning them a code value for digital representation in a computer. The culmination of this effort is Unicode, which slaps a value to almost every imaginable written scribble in the history of mankind. To help explore Unicode in the C language, in this chapter you will:

8.1 Text representation in computers

8.1.1 Reviewing early text formats

8.1.2 Evolving into ASCII text and code pages

8.1.3 Diving into Unicode

8.2 Wide character programming

8.2.1 Setting the locale

8.2.2 Exploring character types

8.2.3 Generating wide character output

8.2.4 Receiving wide character input

8.2.5 Working with wide characters in files