The 7 Ways of Counting Characters

Hello, I am SJ, an engineer working at LINE.

In this post, I would like to talk about counting characters. There are many places in various LINE services where the number of characters must be counted such as profile or group names, and status messages. Counting the characters on-screen is important for several reasons. The text must not be shorter or longer than necessary, and storage capacity must be allocated accordingly. As LINE is used worldwide, it is crucial that string length can be precisely calculated for various languages. One day, we encountered an issue where emojis would be counted as 2 when they should count as 1. Originating from Japan, emoji is a language comprised of images used all around the world today.

Now, it is even officially recognized and included in the Unicode standard. At first, we started analyzing the issue thinking that the problem lied in the surrogate not being calculated correctly. Simply put, a surrogate is a character set that is used to expand UTF-16 encoding to more than 16 bits. Since there are characters among emoji that are displayed through a surrogate, we naturally assumed that the issue would lie there. On further investigation, we’ve discovered that some emoji counted as 2 characters regardless of surrogates. With this in mind we came to the conclusion that some emoji had an extra character being added to the end. Just as we were planning to set up an exception rule, we received another issue report.

“Text in Thai is not counted correctly.”

Soon we learned that other languages such as Arabic and Hindi also had the same issue. Thailand is one of the countries that had the most LINE users, India is the 2nd most populated country in the world, and there were many LINE users who spoke Arabic such as those from Iran. A fundamental solution was in need. The first thing we discovered was that Thai (ภาษาไทย), Devanagari (देवनागरी, Hindi) and Arabic (العربية) were all segmental writing systems. While we continued our research on the Unicode standard for segmental writing systems, we learned that something as simple as counting characters was not so simple when you are working on a global service.

The Details

Q: How many characters are there in “
A: It depends on how you define a “character.”

  1. Bytes : 8-bit. The number of bytes that a Unicode string will take up in memory or storage depends on the encoding.
  2. Code Units : The smallest bit combination that can be used to express a single unit in text encoding. For example 1 code unit in UTF-8 would be 1 byte, 2 bytes in UTF-16, 4 bytes in UTF-32.
  3. Code Points : Unicode character. A single integer value (from U+0000-U+10FFFF) on a Unicode space.
  4. Grapheme clusters : A single character perceived by the user. 1 grapheme cluster consists of several code points.

Several methods of defining “characters” and how to count them.

  • Grapheme
    • Characters perceived by the user. The smallest distinctive unit of writing in the context of a particular writing system. 1 Grapheme consists of N code points.
    • e.g. : A 각 image2015-1-27 12_35_56 image2015-1-27 1_41_28
    • How to count
public static int getGraphemeLength(String value) {
    BreakIterator it = BreakIterator.getCharacterInstance(); 
    it.setText(value); 
    int count = 0; 
    while (it.next() != BreakIterator.DONE) { 
        count++; 
    }
    return count;
}
  • Code Point
    • A Unicode character. Any value in the Unicode code space; the range of integers is from 0 to 10FFFF
    • e.g. : U+AC01
    • How to count
String.codePointCount()
  • UTF-16BE
    • A multibyte encoding for text that represents each code point with 2 or 4 bytes (Big Endian) 1:1 to java primitive ‘char’ To encode code point at U+10000-U+10FFFF, must be encoded in 4 bytes (2 code units) of high/low surrogate
    • e.g. : 0xAC01
    • How to count
String.length() 
(code unit count) 
  • UTF-8
    • The Unicode encoding form that assigns each code point value to an unsigned byte sequence of 1 to 4 bytes in length.
    • e.g. : 0xEA,0xB0,0x81,0xF0,0x9F,0x85,0xB1
    • How to count
String.getBytes().length 
String.getBytes("UTF-8").length 
(byte count) 
  • CESU-8
    • The Unicode encoding form that assigns each code point value to an unsigned byte sequence of 1,2,3 or 6 bytes in length.
    • e.g. : 0xED,0xA0,0xBC,0xED,0xB5,0xB1
    • How to count
public static int getCESU8Length(String str) {
    int strlen = str.length(), utflen = 0, c = 0;
    for (int i = 0; i < strlen; i++) {
        c = str.charAt(i); 
        if ((c >= 0x0000) && (c <= 0x007F)) utflen++;
        else if (c > 0x07FF) utflen += 3;
        else utflen += 2; 
    } 
    return utflen; 
} 
  • Modified UTF-8
    • Modified UTF-8 is a special CESU-8 encoding that encodes null (U+0000) to 0xC0,0x80 (Only used in Java serialization, class file, etc)
    • e.g. : 0xED,0xA0,0xBC,0xED,0xB5,0xB1,0XC0,0x80
    • How to count
public static int getModifiedUTF8Length(String str) { 
    int strlen = str.length(), utflen = 0, c = 0; 
    for (int i = 0; i < strlen; i++) { 
        c = str.charAt(i); 
        if ((c >= 0x0001) && (c <= 0x007F)) utflen++;
        else if (c > 0x07FF) utflen += 3;
        else utflen += 2; 
    } 
    return utflen; 
} 

Examples

GEMINI would be U+264A in code point, and encoded in 3 bytes in UTF-8.
When GEMINI is inputted on devices such as the iPhone, a Variation-Selector character (VS15) is attached; expressing it as 2 code points.
Another example of an emoji. The base character U+1F171 is allocated in an area that exceeds 16 bits. When encoded in UTF-16, the high/low surrogate is encoded in 4 bytes, also 4 bytes in UTF-8, and 6 bytes in CESU-8.
As you can see above, there are also emoji that are expressed in 3 code points.
When writing in Devanagari, a single character can sometimes be expressed in 4 code points. Arabic and Thai also normally have several code points expressing a single character as well.

The Korean language and some Latin texts also have composition expressions. (e.g. Korean jamo, diacritics) For example, “각” (U+AC01) can also be expressed as “ㄱㅏㄱ” (U+1100, U+1161, U+11A8). These are called NFC (각), and NFD (ㄱㅏㄱ). Any program that handles Unicode must recognize the two as identical characters. This is where Unicode normalization comes into play. While the modern Korean language can be expressed with 1 code point when using NFC, ancient Korean, Devanagari, Arabic, and Thai still require several code points even in NFC. (The Devanagari character kshi in the example above is in NFC.) The use of NFC and NFD can differ between operating systems. Mac OS uses NFD when handling Unicode filepaths, which leads to Korean filenames in archive files being incorrectly displayed when opened in Windows.

Nevertheless, grapheme clusters are always counted as 1. This is why grapheme cluster counts should be used to count the number of characters a user would perceive, rather than using code units or code points.

Counting Graphemes in Other Programming Languages

  • Java
public static int getGraphemeLength(String value) {
    BreakIterator it = BreakIterator.getCharacterInstance();
    it.setText(value);
    int count = 0;
    while (it.next() != BreakIterator.DONE) {
        count++;
    }
    return count;
}
  • C++
int getGraphemeLength(const UnicodeString &str) {
    UErrorCode err = U_ZERO_ERROR;
    std::unique_ptr<BreakIterator> iter(
        BreakIterator::createCharacterInstance(Locale::getDefault(), err));
    assert(U_SUCCESS(err));
    iter->setText(str);
    int count = 0;
    while(iter->next() != BreakIterator::DONE) ++count;
    return count;
}
  • Go
func grLen(s string) int {
    if len(s) == 0 {
        return 0
    }
    gr := 1
    _, s1 := utf8.DecodeRuneInString(s)
    for _, r := range s[s1:] {
        if !unicode.Is(unicode.Mn, r) {
            gr++
        }
    }
    return gr
}
  • Perl
say 'møøse'.graphs;
  • PHP
$length = grapheme_strlen('Hello, world!')
  • Swift
countElements(str)

Why is Java’s primitive “char” designed to respond to 1 code unit of UTF-16 instead of 1 grapheme or 1 code point? Because when Java was first designed, Unicode’s entire code points were defined in 16 bit.

The concept of “encoding every character in 16 bits” was something that the original designers of Unicode were proud enough to include in their design principles.(Not long after Java was announced, Unicode was expanded beyond 16 bits. As of Unicode 7.0, It is defined as U+10FFFF, or 17*65536=1,114,112.) Meanwhile, MySQL or Oracle’s “utf8” charset is more closer to CESU-8 than it is to UTF-8, possibly requiring more space. When encoding in UTF-8, charsets “AL32UTF8” (Oracle) or “utf8mb4” (MySQL) must be used. Swift, one of the most recent programming languages, is defined so that a character type is expressed as 1 grapheme.

The recommended allocation of space for saving 1 grapheme is 4 code points, 12 bytes for saving 4 ‘char’ or UTF-8. Finding the maximum required code points for 1 grapheme is a complicated matter; the different automata and writing systems for each language must be taken into account. For that very reason, not even the Unicode standard has a specific section on the issue.

Going back to the first question, is 6 graphemes, 13 code points, and 36 bytes if encoded in UTF-8.