tips
[Top] [All Lists]

REALbasic Tip: Working with text files? You need to know about text enco

To: "REALbasic Tips" <realbasic-tips at lists dot realsoftware dot com>
Subject: REALbasic Tip: Working with text files? You need to know about text encodings.
From: Geoff Perlman <geoff at realsoftware dot com>
Date: Tue, 2 Dec 2003 16:20:15 -0600
Introduction
The way REALbasic reads text files and saves data to text files has changed in version 5 to fully support Unicode (a type of text encoding). If you are creating applications that open, create or modify text files, you will need to understand how text encodings work and what changes you may need to make to your code to make sure it continues to work properly.

Text Encodings: From ASCII to Unicode
As you probably already know, computers don't really store or understand characters. They store everything as a number. For example, the carriage return is character number 13. When the computer industry was in its infancy, each computer maker came up with their own numbering scheme. As a result, information could not easily be exchanged between computers made by different manufacturers. It quickly became clear that a standard character encoding scheme was needed. In 1963 the American Standards Association (which later changed its name to the American National Standards Institute) announced the American Standard Code for Information Interchange (ASCII) which was based the character set available on an English language typewriter. Finally computers could exchange information easily.

Over the years, computers became more and more popular outside of the United States and ASCII started to show its weaknesses. For example, some languages (like French and German) use accented characters which were not defined as part of the ASCII specification. ASCII only defines 128 characters. That barely covers what is available on an English-language typewriter. So If a German Mac user sent a file with the word "öffnen" (which means "open") in it to his buddy that is running Windows, the word would not be display properly because the accented character is not part of the ASCII standard and thus changes from Mac to Windows. The problem is even worse for users of languages that have large character sets like Japanese. Because there are so many characters, most characters require two bytes of data (rather than one byte per character in ASCII). Apple eventually created various text encodings to make it easier to manage data. MacRoman is a text encoding for files that use ASCII. MacJapanese is a text encoding for files that store Japanese characters. There are others as well. But these encodings were Mac specific. They didn't make exchanging data with other operating systems any easier and mixing data with different encodings (typing a sentence in Japanese in the middle of a English-Language document for example) was problematic.

In 1986, people working at Xerox and Apple Computer both had different problems to solve that required the same solution. Before long, the concept of a universal character encoding that contained all the characters for all languages, became the obvious solution. The universal encoding was dubbed "Unicode" by one of the people at Xerox that helped to create it. Unicode solves all of these problems. Any character you need from any language is supported and will be the same character on any computer that supports Unicode. And as a bonus, you can mix characters from different languages together in one document since all are defined in Unicode.

Unicode support began appearing on the Mac with System 7.6 and on Windows with Windows 95. You could translate files between other text encodings and Unicode but Unicode was still the exception and not the rule. it wasn't until Mac OS X and Windows 2000 that Unicode became the standard.

Computer users are now in a transition. There are some using older systems where Unicode is not the standard. All new systems are running Mac OS X, Windows or Linux use Unicode as the standard encoding. As a result, you may have to deal with text files of different encodings for a while. That means you may need to modify your code to handle this. At some point in the future, it may be so rare that you can assume all files are in Unicode format but until then, you may need to make some modifications to your code so that your application operates properly when it encounters text with different types of encoding.

Changing Your Code To Handle Text Encodings
Unfortunately, there is no 100% accurate way to determine the encoding of a file. You have to know what encoding the file is using. If it's coming from an English user of Mac OS 9 (for example) you can probably assume it's MacRoman (but it never hurts to ask). If it's coming from an English user of Windows 95, it's probably Windows ANSI.

This example code reads data from a text file and displays it in an editfield. It makes no assumptions about encoding and a result, if it's not a Unicode file, it probably won't display properly. This example has been kept simply intentionally to focus on the encoding issue:

  dim f As FolderItem
  dim t as TextInputStream
  f = GetFolderItem("Sample.txt")
  t = f.OpenAsTextFile
  editfield1.text = t.ReadAll
  t.close

If you know the encoding, this code can easily be changed to read the file properly. In this example, the encoding property of the textinputstream is set to MacRoman. Since the example file is in MacRoman format, it displays properly.

  dim f As FolderItem
  dim t as TextInputStream
  f = GetFolderItem("Sample.txt")
  t = f.OpenAsTextFile
  t.Encoding = Encodings.MacRoman
  editfield1.text = t.ReadAll
  t.close

The Encodings object contains functions for all the different encodings.

If your application needs to write out a file in a particular encoding, you must specify that each time you write out data. Here's a simple example that writes text from an editfield to a file specifying MacRoman as the encoding. It uses the ConvertEncoding function to convert the encoding of the text to MacRoman before writing it to the file:

  dim f As FolderItem
  dim t as TextOutputStream
  f = GetFolderItem("Sample.txt")
  t = f.CreateTextFile
  t.Write ConvertEncoding(Editfield1.text, Encodings.MacRoman)
  t.close

If your application reads and writes its own files, you don't have to worry about this issue. UTF8 (a specific format of Unicode) is assumed when reading text files and is assumed when writing text to a file. So if you do nothing, your files will be read as Unicode and will write out in Unicode format.

More Information on Text Encodings
This tip covers reading and writing files. This is the area where you are most likely to encounter encoding issues. There are other areas, but they are less common. To learn more about text encoding, read Joe Strout's Text Encoding FAQ: <http://support.realsoftware.com/listarchives/realbasic-nug/2003-11/ msg00389.html>

There's also a good article called "Text Encodings Explained" by Matt Neuburg in the current issue of REALbasic Developer magazine:
<http://www.rbdeveloper.com/>
--
Geoff Perlman
President and CEO
REAL Software, Inc.
512-328-7325 x711 (voice)
512-328-7372 (fax)

- - - - - - - - - -
Got a useful tip to share? Send it to us at:
<REALbasic-tips at lists dot realsoftware dot com>

Click here to unsubscribe:
<http://support.realsoftware.com/listmanager/>

<Prev in Thread] Current Thread [Next in Thread>
  • REALbasic Tip: Working with text files? You need to know about text encodings., Geoff Perlman <=