Introduction
The way REALbasic reads text files and saves data to text files has
changed in version 5 to fully support Unicode (a type of text
encoding). If you are creating applications that open, create or modify
text files, you will need to understand how text encodings work and
what changes you may need to make to your code to make sure it
continues to work properly.
Text Encodings: From ASCII to Unicode
As you probably already know, computers don't really store or
understand characters. They store everything as a number. For example,
the carriage return is character number 13. When the computer industry
was in its infancy, each computer maker came up with their own
numbering scheme. As a result, information could not easily be
exchanged between computers made by different manufacturers. It quickly
became clear that a standard character encoding scheme was needed. In
1963 the American Standards Association (which later changed its name
to the American National Standards Institute) announced the American
Standard Code for Information Interchange (ASCII) which was based the
character set available on an English language typewriter. Finally
computers could exchange information easily.
Over the years, computers became more and more popular outside of the
United States and ASCII started to show its weaknesses. For example,
some languages (like French and German) use accented characters which
were not defined as part of the ASCII specification. ASCII only defines
128 characters. That barely covers what is available on an
English-language typewriter. So If a German Mac user sent a file with
the word "öffnen" (which means "open") in it to his buddy that is
running Windows, the word would not be display properly because the
accented character is not part of the ASCII standard and thus changes
from Mac to Windows. The problem is even worse for users of languages
that have large character sets like Japanese. Because there are so many
characters, most characters require two bytes of data (rather than one
byte per character in ASCII). Apple eventually created various text
encodings to make it easier to manage data. MacRoman is a text encoding
for files that use ASCII. MacJapanese is a text encoding for files that
store Japanese characters. There are others as well. But these
encodings were Mac specific. They didn't make exchanging data with
other operating systems any easier and mixing data with different
encodings (typing a sentence in Japanese in the middle of a
English-Language document for example) was problematic.
In 1986, people working at Xerox and Apple Computer both had different
problems to solve that required the same solution. Before long, the
concept of a universal character encoding that contained all the
characters for all languages, became the obvious solution. The
universal encoding was dubbed "Unicode" by one of the people at Xerox
that helped to create it. Unicode solves all of these problems. Any
character you need from any language is supported and will be the same
character on any computer that supports Unicode. And as a bonus, you
can mix characters from different languages together in one document
since all are defined in Unicode.
Unicode support began appearing on the Mac with System 7.6 and on
Windows with Windows 95. You could translate files between other text
encodings and Unicode but Unicode was still the exception and not the
rule. it wasn't until Mac OS X and Windows 2000 that Unicode became
the standard.
Computer users are now in a transition. There are some using older
systems where Unicode is not the standard. All new systems are running
Mac OS X, Windows or Linux use Unicode as the standard encoding. As a
result, you may have to deal with text files of different encodings for
a while. That means you may need to modify your code to handle this. At
some point in the future, it may be so rare that you can assume all
files are in Unicode format but until then, you may need to make some
modifications to your code so that your application operates properly
when it encounters text with different types of encoding.
Changing Your Code To Handle Text Encodings
Unfortunately, there is no 100% accurate way to determine the encoding
of a file. You have to know what encoding the file is using. If it's
coming from an English user of Mac OS 9 (for example) you can probably
assume it's MacRoman (but it never hurts to ask). If it's coming from
an English user of Windows 95, it's probably Windows ANSI.
This example code reads data from a text file and displays it in an
editfield. It makes no assumptions about encoding and a result, if it's
not a Unicode file, it probably won't display properly. This example
has been kept simply intentionally to focus on the encoding issue:
dim f As FolderItem
dim t as TextInputStream
f = GetFolderItem("Sample.txt")
t = f.OpenAsTextFile
editfield1.text = t.ReadAll
t.close
If you know the encoding, this code can easily be changed to read the
file properly. In this example, the encoding property of the
textinputstream is set to MacRoman. Since the example file is in
MacRoman format, it displays properly.
dim f As FolderItem
dim t as TextInputStream
f = GetFolderItem("Sample.txt")
t = f.OpenAsTextFile
t.Encoding = Encodings.MacRoman
editfield1.text = t.ReadAll
t.close
The Encodings object contains functions for all the different encodings.
If your application needs to write out a file in a particular encoding,
you must specify that each time you write out data. Here's a simple
example that writes text from an editfield to a file specifying
MacRoman as the encoding. It uses the ConvertEncoding function to
convert the encoding of the text to MacRoman before writing it to the
file:
dim f As FolderItem
dim t as TextOutputStream
f = GetFolderItem("Sample.txt")
t = f.CreateTextFile
t.Write ConvertEncoding(Editfield1.text, Encodings.MacRoman)
t.close
If your application reads and writes its own files, you don't have to
worry about this issue. UTF8 (a specific format of Unicode) is assumed
when reading text files and is assumed when writing text to a file. So
if you do nothing, your files will be read as Unicode and will write
out in Unicode format.
More Information on Text Encodings
This tip covers reading and writing files. This is the area where you
are most likely to encounter encoding issues. There are other areas,
but they are less common. To learn more about text encoding, read Joe
Strout's Text Encoding FAQ:
<http://support.realsoftware.com/listarchives/realbasic-nug/2003-11/
msg00389.html>
There's also a good article called "Text Encodings Explained" by Matt
Neuburg in the current issue of REALbasic Developer magazine:
<http://www.rbdeveloper.com/>
--
Geoff Perlman
President and CEO
REAL Software, Inc.
512-328-7325 x711 (voice)
512-328-7372 (fax)
- - - - - - - - - -
Got a useful tip to share? Send it to us at:
<REALbasic-tips at lists dot realsoftware dot com>
Click here to unsubscribe:
<http://support.realsoftware.com/listmanager/>
|