gettingstarted
[Top] [All Lists]

Text Encoding FAQ

To: "Getting Started" <gettingstarted at lists dot realsoftware dot com>
Subject: Text Encoding FAQ
From: "Joseph J. Strout" <joe at realsoftware dot com>
Date: Sun, 23 Nov 2003 12:49:24 -0600
Frequently Asked Questions
about Text Encoding in REALbasic 5
----------------------------------

0. What's all this about text encodings?  Isn't a string just a string?

Not to a computer. A computer contains only numbers; in particular, it contains bytes, each of which has the value 0-255. A string contains a bunch of bytes. But usually, you don't want to think of them as bytes; you want to think of them as representing some text which you can read, find words in, and so on. So early computer developers had to answer the question: how are we going to map these numbers into text? The mapping from numbers to text (and vice versa) is known as a text encoding.

The earliest and most widely accepted standard encoding is known as "ASCII" (American Standard Code for Information Interchange). ASCII values range from 0 to 127, and represent all the uppercase and lowercase English letters, the digits from 0-9, and a variety of common punctuation marks. The text you're reading can be represented entirely in ASCII.

But ASCII is insufficient for representing text in almost any language other than English. For example, it has no accented characters needed by many European languages. So, a number of extensions to ASCII were made, that make use of byte values 128-255 to represent additional characters. But different people need different additional characters, so there are a wide variety of single-byte encodings of this sort: MacRoman, MacIcelandic, ISO-Latin-1, and so on.

And for many languages, 255 different characters just isn't enough. Japanese and Chinese require tens of thousands of characters, for example. Again there were various encodings developed to solve this problem, such as MacJapanese and Shift-JIS. These use one byte for each ASCII character, but two bytes for any non-ASCII character (and are commonly referred to as "double-byte character systems").

But this proliferation of encodings presents a big problem: you can no longer tell what text a bunch of bytes is meant to represent, just by looking at the bytes. You need to also know the encoding associated with them. So, in REALbasic, every string contains both the bytes and the encoding (which may be nil if the encoding is unknown or undefined). This is how REALbasic knows how to draw the string as text, do text-based operations like InStr and Mid, etc.

In recent years, an industry group called the Unicode Consortium has been developing a new standard designed to supercede all the others. Unicode can represent every character in every writing system on the planet, all in one encoding. This has been embraced by all major OS vendors, and also by REALbasic. Unicode allows you to represent different languages (e.g., a mixture of Greek and Japanese) in one string, and eventually it will mean that you can safely assume that any text you receive is Unicode, rather than having to handle hundreds of possible encodings. (See http://www.unicode.org/ for more info straight from the source.)


1. OK, so a Unicode string is a Unicode string?

Well, not quite. Unicode is a mapping between numbers and text. But we still have to map the numbers into bytes, since a byte can contain only 0-255 and Unicode values can be much larger. REALbasic supports two different formats of Unicode: UTF-8 and UTF-16. See question 2.


2. What encoding are my string literals, constants, etc. in?

All strings in your REALbasic project are compiled as UTF-8. This is a Unicode encoding that uses one byte for ASCII characters, and up to four bytes for non-ASCII characters. It has a number of other handy properties too, for example, an ASCII character will never appear as part of a multi-byte character.


3. So when do I need to care about encodings?

Usually, not at all. However, when you receive text data from some outside source, such as a database or file, you need to let REALbasic know what encoding that text is in. You can use the DefineEncoding method to do this, or in RB 5.1 or later, you can set the Encoding property of the TextInputStream, or use the optional encoding parameter of functions like Read, ReadLine, and ReadAll.

And sometimes, you'll need to provide text to another app which requires it to be in a certain encoding. In that case, use ConvertEncoding to change your text into that other encoding.


4. Which is faster, ConvertEncoding or TextConverter.Convert?

In most cases, ConvertEncoding is much faster than using TextConverter.Convert. ConvertEncoding has a number of optimizations for common cases, such as converting the same string multiple times, or converting from one superset of ASCII to another. (All WorldScript encodings, most Windows encodings, and UTF-8 are all supersets of ASCII.)

So, you should usually use ConvertEncoding, but if you really need the speed then you should just measure it both ways and see which performs better in your particular situation.


5. How do I get a specific byte into a string?

Use ChrB. ChrB takes a byte value (0-255) and returns a string with undefined encoding, containing exactly that byte. You can build a string containing multiple bytes by just adding these together.

Of course, don't expect such a string to display as text in any sensible way. If you want to make text, see the next question.


6. How do I get a specific character by its code point (or "ASCII value")?

Use TextEncoding.Chr. This returns a one-character string with the character you specified by its code point within that encoding. For example, a capital A in the ASCII character set would be:

   s = Encodings.ASCII.Chr(65)

A copyright symbol represented in UTF-8 would be:

   s = Encodings.UTF8.Chr(169)


7. How do I find the code point of a given character?

Use the Asc function. This returns the code point of the first character of the given string, in the encoding of that string. So, for example, if you have a string s in any variant of Unicode, then Asc(s) is the Unicode code point of the first character of s.


8. What encoding do I get when I add two strings together?

When you concatenate two strings (e.g. A + B), if the two have the same encoding, then the result is in the same encoding. If one encoding is a superset of the other -- e.g., as MacRoman is a superset of ASCII -- then the result is that encoding (MacRoman in our example). Note that most encodings, with the notable exception of UTF-16, are supersets of ASCII, so in most cases adding an ASCII string to some other string will result in the encoding of that other string. Finally, if you add two strings of incompatible encodings -- say, MacRoman and UTF-8, or MacJapanese and MacIcelandic -- then both strings will be converted internally to UTF-8, and the result will be represented in UTF-8.


9. How do I find out what encoding a string is in?

Use the Encoding function, which can be used in either of two ways; like this:

  enc = Encoding(s)

or like this:

  enc = s.Encoding

This returns a TextEncoding object, or if the string's encoding is undefined, it returns nil.


10. When I write Unicode text to a text file, why do some other apps fail to properly load and render the text?

The problem is that there is no standard file type or file-name extension that distinguishes a UTF-8 text file from a file in some legacy encoding, like MacRoman. For backwards compatibility, many text file readers will assume that an unknown file is in some common legacy encoding rather than UTF-8, unless you specifically tell it otherwise (through some option in the Preferences or file-open dialog). In addition, if you're using UTF-16, then endian issues come into play: PCs usually write the low-order byte of each character first, while other computers write the high-order byte first. Getting the endianness wrong will turn a UTF-16 file into gibberish.

However, there is a trick that may help in both cases. You can add a special character known as a "Byte Order Mark" (or BOM for short) to the beginning of the file. This is character U+FEFF, which normally means "zero width non-breaking space". Many apps will interpret this character at the start of a file as a signature indicating a Unicode file with a particular encoding and endianness. And those which don't, should simply render it as an invisible character.

To use this to tag a UTF-8 file, just write Encodings.UTF8.Chr(&hFEFF) as the first character in the file. The file name should end in ".txt" in this case. For example:

   f = GetFolderItem("sample.txt")
   outp = f.CreateTextFile
   outp.Write Encodings.UTF8.Chr(&hFEFF)
   outp.Write ConvertEncoding(myData, Encodings.UTF8)

For a UTF-16 file, you would use Encodings.UTF16.Chr(&hFEFF) as the first character in the file, and the name should end in ".utxt".

When reading the file back in, be sure to check whether the first character of the first line equals the BOM character, and if so, strip it off like so:

        data = inp.ReadAll
        if left(data,1) = Encodings.UTF8.Chr(&hFEFF) then
           data = Mid(data,2)
        end if

The above would work for a UTF-8 file; for a UTF-16 file it would be similar.

For more information on the BOM, see:
                <http://www.unicode.org/unicode/faq/utf_bom.html>


11. How to I assign the encoding of a string when I read it from a file, socket, etc.?

The easiest way is to use REALbasic 5.2 or later, where all Read methods take an optional "encoding" parameter. Simply pass in an encoding (e.g. Encodings.UTF8), and the string read will be defined as that encoding. In addition, a TextInputStream has an encoding property, which defaults to UTF-8; any strings returned by the stream will be defined as that encoding, unless you override it by passing an encoding to the Read method.

In RB 5.0 or 5.1, these facilities are not available, so you must instead use DefineEncoding after reading your string.


--
,------------------------------------------------------------------.
|    Joseph J. Strout           REAL Software, Inc.                |
|    joe at realsoftware dot com       http://www.realsoftware.com        |
`------------------------------------------------------------------'

- - -
Unsubscribe or switch delivery mode:
<http://support.realsoftware.com/listmanager/>

Search the archives of this list here:
<http://support.realsoftware.com/listarchives/lists.html>

<Prev in Thread] Current Thread [Next in Thread>
  • Text Encoding FAQ, Joseph J. Strout <=