Java: Unicode
Unicode is a system of encoding characters. All characters and Strings in Java use the Unicode encoding, which allows truly international programming.
About Unicode
- The Unicode effort is not coordinated with Java. At the time that Java was started, all 50,000 defined Unicode characters could be reprensented with 16 bits (2 bytes). Consequently, Java used the 2-byte (sometimes called UTF-16) representation for characters.
- ASCII. Most programming languages before Java (C/C++, Pascal, Basic, ...) use an 8-bit encoding of ASCII (American Standard Coding for Information Interchange). ASCII only defines the first 128 characters, and the other 128 values are often used for various extensions.
- All of the world's major human languages can be represented in Unicode (including Chinese, Japanese, and Korean).
- The first 64 characters of Unicode have the same values as the equivalent ASCII characters. The first 128 characters are the same as ISO-8895-1 Latin-1.
However, Unicode, now at version 4.0, has defined more characters than fit into two bytes. To accommodate this unfortunate occurrance, Java 5 has added facilities to work with surrogate pairs, which can represent characters with multiple character codes. As a practical matter, most Java programs are written with the assumption that all characters are two bytes. The characters that don't fit into two bytes are largely unused, so it doesn't seem to be a serious deficiency. We'll see how this works out in the future.
Unicode Fonts
Altho Java stores characters as Unicode, there are still some very practical operating system problems in entering or displaying many Unicode characters. Most fonts display only a very small subset of all Unicode characters, typically about 100 different characters.
References
- www.unicode.org.
- Counting Characters From Tom White's blog.