Department of Engineering

University of Cambridge > Engineering Department > computing help

Character sets

Contents

Introduction
Encodings and character sets
Unicode
Mime
WWW
Other Language Issues
Other applications
- Emacs
- LaTeX
- Java
See Also

Introduction

Files are made up of bytes. Each byte stores a number from 0 to 255. When a file is displayed as text, what happens?

A long time ago, things were "easy" (because there was so little choice).

On IBM mainframes the EBCDIC system was used: a byte with a value 193 represented 'A', 194 represented 'B', etc
On most other machines ASCII was used: 65=A, 66=B, etc

Not all the possible numbers in the range 0-255 represented a character. Sometimes they represented a newline or a beep. Sometimes they had no meaning at all. In ASCII, for example, values greater than 127 had no meaning.

However, non-english-speakers soon wanted to use their own languages with many extra characters. Many varieties of ASCII appeared, and it wasn't always obvious from looking at the file which version of ASCII the author had used. Languages with more than 256 characters required more than one byte to represent each character. The situation became more confusing as files were increasingly exchanged between applications and machines. In particular e-mail and the WWW posed problems.

Encodings and character sets

One solution is to keep many encodings (i.e. mappings from bytes to characters) but to standardise them. As long as any program that reads a file can find out the encoding used when writing the file, the contents should be faithfully interpreted.

In the simplest case, one byte still corresponds to one character according to some mapping (encoding). There are several different encodings, such as US-ASCII and the ISO Latin family of encodings. The correct interpretation and processing of character data requires knowledge about the encoding used. Previously an ASCII encoding was usually assumed by default. Nowadays things are more standardised. ISO Latin 1 (defined in the ISO 8859-1 standard, which in turn is part of the ISO 8859 family of standards) is often the default and can be regarded as an extension of ASCII.

Many of these encodings use the same values as ASCII to represent common characters, but differ in the details, so you may find that even if your program treats ISO Latin 3 as uS-ASCII, many characters will look ok. Proprietary encodings aren't always well handled. In ISO 8859-1, code positions 128 - 159 are explicitly reserved for control purposes. The so-called Windows character set (WinLatin1) uses some of those positions for printable characters which can cause trouble if a WinLatin1-encoded file is treated as if it were an ISO Latin 1 file.

Unicode

An alternative approach to having hundreds of character sets is to create an all-inclusive character set with a huge number of characters in it. ISO 10646 is an international standard that defines UCS, Universal Character Set. Currently tens of thousands of characters have been defined, and new amendments are defined fairly often. Unicode is a standard, by the Unicode Consortium, intended to be fully compatible with ISO 10646, and an encoding for it.

The "native" Unicode encoding, UCS-2, presents each code number as two consecutive bytes m and n so that the number equals 256*m+n (which is in the range 0-65535). ISO 10646 can be, and often is, encoded in other ways, to save space. For example in UTF-8 character codes less than 128 (effectively, the ASCII repertoire) are presented using one byte for each character. All other codes are presented, according to a relatively complicated method, so that one character is defined by a sequence of two to six bytes, each of which is in the range 128 - 255.

Mime

As said earlier, it's not enough for files to be correctly written, programs that read the files need to know how to interpret the files. "Internet media types", often called MIME media types, can be used to tell an application (in particular a mail program or a web browser) how to deal with an associated file. There's an agreed list of terms to specify major media types (such as text), subtypes (such as html or plain), and an encoding. So for example, in the header of a mail message there might be the line

    Content-Type: text/plain; charset=us-ascii

WWW

On the WWW it's common for a file from one country written using a particular program to be viewed in a different country using a different program. When a browser asks for a file it receives a stream of bytes. Some of those bytes are the page contents but first a 'header' is transmitted that tells the browser about the file. On most Unix systems you can see this info by typing something like

   lynx -head  http://www.eng.cam.ac.uk/

In the header there should be a 'Content-Type' line which help to indicate to the browser how the rest of the message is to be interpreted. Typical is something like

   Content-Type: text/html; charset=iso-8859-1

Knowing this the browser can then

convert the characters it receives into a stream of unicode characters
parse the result as HTML
display the result

HTML

HTML has 2 ways to specify "funny" characters

&entity; representations (for example é). 252 of these characters are available officially - they're symbolic names defined for some numeric character references.
&#number; character references (for example ̕). Note that these numbers refer to the intended character's position in the unicode character set (i.e the HTML "Document Character Set"), and not to the transmitted character coding (which at this stage is no longer relevant - that information is only used to convert the received bytes into unicode).

Browser issues

Not all browsers do as they should.

charset problems -
- There are various ways for the charset to be specified. Browsers should ignore the meta-tag method and honour the server headers
- MIME says the default encoding for a text/* mime type is US-ASCII
- HTTP says the default encoding for text/html is iso-8859-1
- HTML4.0 introduced a rule that there is no charset default; and CERT advisory CA-2000-02 describes some actual security reasons for not omitting an explicit charset specification.
Under the View menu recent versions of Internet Explorer have an Encoding submenu so that users can force the issue. In Mozilla the submenu's entitled Character Coding.
Font problems - Finally, having interpreted the character stream as HTML, the browser needs to render it onto a display. Although browsers may be able to support Unicode encoding schemes (eg utf-8), they typically rely on the platform/OS they run on to show text, so that platforms without complete Unicode support won't be able to show text correctly. E.g. win 98, and Lynx apparently recognises the charset but can only show docs in charsets compatible with that on which the terminal is running.

Writing WWW docs

It can be hard to writing "funny" characters. Sometimes you need specialised software and keyboards. In HTML the whole document is in the one character coding, so you cannot switch between codings in mid document in the way that you might switch fonts. To cope with old and new browsers, A.J. Flavell suggests the following

Compose documents using only the characters of us-ascii
Represent characters that are not in iso-8859-1 by using their &#bignumber; references ("bignumber" > 255)
Represent "8-bit" iso-8859-1 characters by using their &entity; (preferred) or &#number; representations,
Send out the document with its charset advertised as utf-8

The rationale is as follows: Netscape browser versions 4.xx fail to render most of the unicode characters represented by &#bignumber; at most settings of the charset attribute. They can however work when the charset is set to utf-8. It's a very useful fact that us-ascii (7-bit) is a subset of utf-8. However, this forces you to represent your "8-bit" Latin-1 characters by means of &-constructions, unless you know how to generate correctly-coded utf-8 data streams.

The result can validly be advertised with any charset that includes us-ascii, such as us-ascii itself, iso-8859-1, or utf-8. By advertising it as utf-8, Netscape versions 4.xx are capable of displaying it correctly.

Other Language Issues

In HTML and Word Processors one can select fonts (using HTML's FACE, perhaps) and natural languages (LANG in HTML). Both of these concepts are distinct from the other issues so far mentioned. One can write in english using an italian font, and there are many hundreds of fonts that each show 'a' in a different though recognisable way.

Other applications

Emacs

With the emacs editor you can read, edit and write files in a variety of encodings - Chinese-BIG5, Latin-8 (Celtic), Latin-9 (updated Latin-1, with the Euro sign), etc. If you open an existing file emacs will try to guess its encoding system. New files will be in a default encoding which might be determined by your LANG environmental variable. Typing

  M-x describe-coding-system

will display the current settings and

  M-x list-coding-systems

will display all the options. Each coding system has variants to cope with different end-of-line conventions: e.g. iso-latin-1-unix, iso-latin-1-dos and iso-latin-1-mac all exist.

LaTeX

The inputenc package lets the LaTeX word processing program cope with files of various encodings. For example

   \usepackage[latin5]{inputenc}

can deal with ISO Latin 5 encoded files.

Java

The java compiler takes a -encoding command line option so that it can cope with source code in various encodings. If you want to do I/O with various character sets, note that Readers and Writers are Unicode-aware and can perform all the necessary encoding and decoding for you.