Game Localization and UTF-8

After some research I have decided a pretty good way to localize your game is to store all text in text files, in UTF-8 format. UTF-8 is super widely used and 100% backwards compatible with ASCII. Also the clever design of the UTF-8 encoding lets programmers treat UTF-8 buffers as naive byte arrays. UTF-8 buffers can be naively sorted and still retain valid results, as an example!


https://en.wikipedia.org/wiki/UTF-8

However sometimes UTF-16 is needed to deal with Windows APIs… So some conversions from UTF-8 to UTF-16 can be pretty useful. To make things worse there doesn’t exist good information on encoding/decoding UTF-8 and UTF-16, other than the original RFC documents. So in swift fashion I hooked up a tiny header with the help of Mitton’s UTF-8 encoder/decoder from tigr.

The result is tinyutf.h, yet another single file header library. Perfect for doing any string processing stuff, and not overly complicated in any way. Just do a quick google search for utf8 string libraries and every single one of them is absolutely nuts. They are all over-engineered and heavyweight. Probably because they all attempt to be general-purpose string libraries, which are debatably dumb to begin with.

The hard part of localization is probably the rendering of all kinds of glyphs. Rendering aside, localization can be as simple and storing original English translations in text files. A localizer (translator) can read the English and swap out all English phrases for utf8 encoded text symbols by typing the phrases in their native language. As long as localization is planned from project start, it can be very easy to support!

TwitterRedditFacebookShare

2 thoughts on “Game Localization and UTF-8

    1. Randy Gaul Post author

      That looks decent, but it does come with downsides. Just check out the example for one downside:

      std::string u8 = u8"z\u00df\u6c34\U0001f34c";
      std::u16string u16 = u"z\u00df\u6c34\U0001f34c";

      // UTF-8 to UTF-16/char16_t
      std::u16string u16_conv = std::wstring_convert< std::codecvt_utf8_utf16, char16_t>{}.from_bytes(u8);
      assert(u16 == u16_conv);
      std::cout < < "UTF-8 to UTF-16 conversion produced " << u16_conv.size() << " code units:\n"; for (char16_t c : u16_conv) std::cout << std::hex << std::showbase << c << ' ';

      In my opinion it's just... Ridiculous! This is *not* a good API, or an easy to use API. Compared to the header I wrote out:

      int size;
      const char* utf8_text = tdReadFileToMemory( "utf8.txt", &size );
      wchar_t* utf16_text = (wchar_t*)malloc( size * 2 );

      // to utf16
      tuWiden( utf8_text, size, utf16_text );

      // to utf8
      tuShorten( utf16_text, size, utf8_processed );

      And a big problem with that API is generally different little processing or functions need to be written that do some task. The most common is to go from utf8 to utf32, in order to map characters to run-time operations. For example mapping characters in text to quads to render letters on the screen, this is easily done by mapping ints to quads. So we would want a utf8 to utf32 that decodes a single codepoint. Most use-cases just want to decode a single character at a time:

      const char* utf8_text = text;
      int cp; // utf32 codepoint
      utf8_text = tuDecode8( utf8_text, &cp );
      // do something with cp ...

      Or perhaps the user is typing in letters one at a time, and we're grabbing them off of some kind of event signaler. Again, just encoding a single codepoint at a time:

      int key = PressedKey( );
      if ( !IsLetter( key ) ) continue;
      utf8_text = tuEncode8( utf8_text, key );

      Reply

Leave a Reply

Your email address will not be published. Required fields are marked *