My Blog List

Characters and Fonts

Character Sets

ASCII and ye shall receive...

TIP: Don't use special characters (anything other than letters, numbers, or common punctuations) unless you know the standards and how to use them!
The standard character set for computers has traditionally been ASCII (American Standard Code for Information Interchange). (Actually, the current standard version is called "US-ASCII", presumably distinguishing it from non-US ASCIIs.) This group of characters is numbered from 0 through 127, and comprises the upper and lower case letters, numbers, and punctuation, as well as some control characters such as tabs and linefeeds. No provision is made in ASCII for foreign characters or specialized symbols. Hence, various so-called "extended ASCII" sets (a misnomer since these extensions aren't part of the ASCII standard) have been developed to provide these things. Windows, the Mac, and the IBM PC's text-mode have different sets of extended characters (and versions of these in various countries have yet different versions). Since they're widely divergent from one another, the only "safe" characters to transmit and count on the user receiving correctly have traditionally been the "7-bit" characters 0-127.
(Well, usually the 7-bit ASCII characters are "safe," but there have been some exceptions due to nonstandard character set handling of various computer models. For instance, the old Commodore PET, 64, and 128 computers used a weird "PET-ASCII" set with lowercase letters where uppercase should be and uppercase letters somewhere else entirely, with various graphic characters sitting in the normal lowercase section. The earliest Apple II computers had no lowercase letters at all, and instead of doing the sensible thing by mapping those characters onto their corresponding uppercase letters, they instead displayed as random garbage. But, as of now, most computers in common usage get the original ASCII range correctly, aside from the annoyance of the inconsistent handling of carriage returns and linefeeds that makes for headaches in transferring files with line breaks.)
However, the Web is set up to support a broader range of characters. Prior to HTML 4.0, the "standard" character set for HTML was ISO 8859-1 (sometimes referred to as ISO Latin-1), an extended character set with twice as many characters as ASCII. Out of international political-correctness, however, newer versions of HTML have no default character set in order not to favor the languages (mostly Western European) supported by ISO 8859-1 over the ones that aren't. Hence, an explicit "charset" parameter is required in the HTTP content-type header (although the HTTP protocol standards themselves still say that "ISO-8859-1" is the default value for this). The Windows character set is mostly the same as the ISO character set, but the Macintosh set is very different, so you need to be aware of whether your editing program is inserting characters in accordance with the standard or in a proprietary, platform-specific character set. (In recent times, it's getting more popular to use UTF-8, a character set different from the older traditional ones which supports the entire range of Unicode; this will be discussed more later.) Or you can learn the character numbers of the characters you want and insert them with escape sequences starting with an ampersand, like È for character #200 (È). (Don't forget the semicolon (;) at the end of the escape sequence!) Or, for certain characters, you can use "entity names" like é for an "e" with an acute accent (é), but these are not always as widely supported on all browsers, so character numbers are "safer" if you know them. (See links below for some sources for character number lists.) Note that, according to the HTML standards, numeric character references are always supposed to be in the ISO 8859-1 character set (or its superset, Unicode, which I discuss below), even if the character encoding is specified as something different. The actual performance of browsers can vary, unfortunately.
Some "entity names" you should use are &amp; for ampersands (&) used as literal characters in your text, so they don't get interpreted as part of a character entity code, and &lt; and &gt; for the less-than (<) and greater-than (>) signs, so they don't get interpreted as part of HTML tags. Some people also replace the double quote (") with &quot;, but this is not necessary except within a tag's quoted attribute value; that's the only place where the quote mark has meaning that must be protected. In normal body text, quotes are harmless. (That is, the standard ASCII single and double quotes, ' and ", are harmless... those non-standard word-processor "smart quotes" are another story!)
NOTE: One particular, often-overlooked instance where you should use an entity name instead of the raw character is for ampersands found within URLs, generally in the parameter string passed to a CGI script. It's common to use a series of parameters separated by ampersands (e.g., stuff.cgi?this=1&that=2), deriving from the standard use of this syntax by browsers generating requests from a Web form. Unfortunately, this syntax isn't proper within an HTML document, and some browsers have been known to interpret parts of such parameter strings as embedded special characters. In particular, the sequence &section=, often found as part of the parameter string for a database query, is sometimes interpreted as containing the entity &sect;, a "section header character" (§). To prevent this from happening, replace & with &amp; anywhere it occurs, or else reprogram your CGI scripts to accept a "safer" parameter separator, like a semicolon. (A semicolon, however, might be a problem if you're using such URLs in META refresh tags, as the semicolon is a reserved separator in such contexts.)
As a final note, when you do intend an ampersand sequence to be interpreted as a character reference, be sure to include the concluding semicolon. It's &amp;, not just &amp. There are complicated syntax rules about when the semicolon is or isn't required, and even more complicated browser-specific error-correction behavior to deal with cases where the syntax rules aren't followed properly (one such error-correction attempt leads to the &sect; problem noted above). It's safest to always terminate your entity references with semicolons, and always "entify" ampersands when they're not intended as the start of an entity reference.

Watch out for those not-so-smart quotes!

TIP: Beware of word processors inserting nonstandard characters without you even noticing it!
Don't use platform-specific characters not found in a standardized character set. In particular, many word processors will change your quotes and apostrophes to so-called "smart quotes", which curl to the right or left depending on which side of a quotation they're on. These are not part of the ISO 8859-1 character set, and are likely to have unpredictable effects on Web pages. If you're viewing somebody's Web page and it has something weird like an "AE" ligature where an apostrophe should be, you know the developer used an operating system's proprietary characters that aren't supported by the standard. (For that matter, you shouldn't use these funny characters in e-mail either; they make messages look really weird in mail readers that don't support them. A lot of "spammers," annoying enough already, insist on using this sort of junk to make their messages an even bigger pain. E-mail, to be safe, should stick wherever possible to the standard US-ASCII 7-bit character set; Web pages, as you'll see later, have a larger character repertoire available if you use the proper coding techniques.)
Note that some operating systems put such characters in the range #128-#159, but these are reserved for control characters and are not used for printable characters in the ISO standard. The only control characters you're supposed to use in HTML documents are the tab (#9), the linefeed (#10), and the carriage return (#13). Other control characters from #0-31 and #128-159 are undefined in their effect and are not supposed to be present in standard HTML documents. (Of course, some of these other characters have meaning in various programs and operating systems, but not in Web documents. One character, #7, is the "Bell" character in the official ASCII standard, calling for the computer or terminal to beep when it's received; thank goodness Web browser developers didn't implement this, or all the teenage Web page creators who now fill their pages with <BLINK> tags would also be making pages that beep at you. However, such new "innovations" as the <BGSOUND> tag in some browsers let people do this sort of thing anyway.)
One thing that was added to HTML 4.0 (but browser makers were slow to implement, so it was seldom used) was the ability to do quotes in a "smart" fashion using the <Q> tag, which can be rendered using appropriate left and right quotes on the browser's target system -- either standard ASCII quotes or "smart quotes" depending on what's available on your system.
Browser support for <Q> slowly arrived over the course of the decade after the standard was released. Mozilla-based browsers have long supported the <Q> tag, but at first only used "straight" quotes, not the curly typographic ones; later, curly quote support was added. Some more obscure browsers like Alis Tango, Cyberdog (for the Mac), and iCab (also for the Mac) used proper typographic quotes for this tag at an early date, and Lynx uses quotes as appropriate for the character set of the terminal being used.
If your browser supports the <Q> tag, this sentence will be shown in quotes!
One more thing on the subject of quotes: A quoting style found often on the Internet is what I call "Unix-Geek Style". It consists of the use of a backquote (`) for the opening quote, and a normal straight quote (') for the closing one. Or, with double quotes, this style uses two backquotes (``) as the opening quote, and a normal double quote (") to close. This looks really strange in most computer fonts currently in use, where the backquotes lean to the side, but the regular quotes are straight. This style is based on an outdated version of the ASCII standard, obsolete at least since the '80s, which implied that the apostrophe character ought to lean forward and complement the backward-leaning backquote. The current standard calls for the single and double quotes to be straight, and that is how modern fonts show it. The old IBM PC monochrome text-mode font did have a leaning apostrophe, but it didn't quite lean at the same angle as the backquote, so they still didn't match well. Some Unix fonts, however, have matching back and forward quotes as the presentation glyphs for these characters, which is why the "Unix geeks" like this style of quoting. This sort of quoting can also be found in news articles from wire services, which probably follow standards based on archaic Teletypes. But since these quotes don't match in most fonts in current use, nor do the current ASCII standards imply that they ought to, I would suggest avoiding this style and using the straight single and double quotes as both opening and closing quotes.

UNICODE

TIP: Be familiar with the Unicode standard and its standard encodings when using characters outside the ASCII range. In the past, you needed to be careful about using them, even in a standards-compliant way, due to inconsistent browser support, but these days they're pretty safe to use if you do it correctly.
As of version 4.0 of the HTML standard (by now an old, established standard), Unicode is the official document character set, meaning that numeric character references are always interpreted with regard to Unicode, as opposed to the character encoding, which is the character set used to transmit the characters over the network (and possibly also to store the Web pages on the server's file system, but not necessarily as the server might transform the characters as it transmits them). This encoding has no standard value under the HTML specs, and is supposed to be specified in the HTTP content-type header, but the numerical character references should be unaffected by the chosen encoding of a document.
The first 256 characters of Unicode (#0-#255) are equivalent to the ISO Latin 1 standard, which in turn has its first 128 characters equivalent to the older US-ASCII (with the minor exception that Unicode has chosen not to give any definition as to the functions of the control characters from #0-#31 and #128-#159, leaving them entirely system-specific) so existing Web documents will work the same as always. But additional characters #256 and up are also available, including many other foreign languages, mathematical characters, and more, including curly quotes. (Look at the series of characters beginning with &#8216;. Here's what your browser displays: ‘’‚‛“”„‟) Originally, character numbers went to #65535 (the 16-bit character range), but the standard was later amended to include even higher numbers than that, to encompass character repertoires not originally granted a place in the Unicode system (such as highly specialized symbol sets or characters from obscure or dead languages). The original Unicode set could be represented by files with two bytes per character (which would take twice the space of the equivalent ASCII file), but the current Unicode standard must be encoded in a manner that involves variable numbers of bytes for different characters, if the characters are to be directly represented instead of being given as numeric references as with the "ampersand" codes above. (UTF-8 is the most popular such encoding, as will be mentioned below.)
Note that there are some really old browsers that don't support Unicode characters (but by now they're too ancient for most people to need to worry about), and even Unicode-supporting browsers may not have access to all the foreign characters (you may not have a Cyrillic font on your system unless you deal regularly with Russian documents; and Chinese and Japanese documents require thousands of different characters), so don't count on your Web pages being as widely readable. "Smart" quotes and apostrophes are rather more commonly available to users than Cyrillic, etc., and the latest browsers all support the Unicode character references for these. My own innate conservatism led me to continue for a very long time using plain US-ASCII single and double quote characters, which both old and new browsers support, but by now pretty much everybody has Unicode fonts installed which contain most of the characters in question, so these concerns are pretty archaic.

Setting a different character set

The server can send, as part of its MIME type identifier for HTML documents, a character set code like:
Content-type: text/html; charset=iso-8859-1
This tells the browser to expect the document to be in the indicated character set, which would allow the use of special characters from that set without needing escape codes or entity names. Characters inserted with ampersand escape codes will still always be from the Unicode set, while the actual characters of the document will be in the "local" character set selected by the document encoding. (At least that's how it's supposed to work; some browsers may vary.)
You might think that, because you're not the server administrator, you're not able to configure this sort of MIME type header. It can be "faked" through a META HTTP-EQUIV tag, but that's not a very good solution as it makes some browsers draw the screen twice, first in the default character set and then in the one you've selected once it realizes it needs to do that. Besides, it's logically a silly thing to specify your character set within the document itself; if the browser doesn't know what character set you're using in advance, how does it know how to interpret any embedded tags that specify one? What if your document were in EBCDIC, a character set with nothing in common with ASCII or ISO-8859-1? Actually, in real life, all character sets commonly used on the Web share the 128 standard ASCII characters, which are all that are needed for standard HTML tags, so that's not actually much of a problem.
Some WYSIWYG editors throw in such a META tag automatically even when your character set is the normal one; this is unnecessary, and produces an annoying screen flicker on some browsers. Not to mention that some of these editors will also merrily use nonstandard characters like the MS-Windows "smart quotes", in which case any header it inserts to the effect that ISO-8859-1 is the character set in use is actually a lie.
On the other hand, if you're creating pages which are intended to be used in situations other than HTTP servers, such as to be placed on CD-ROMs, in which case there are no server headers to identify the character encoding, then the use of a META tag may be the only reasonable option.
But, at any rate, the real server headers are more accessible than you might think; if your site is hosted via the Apache server software (the most popular server), try putting a file named .htaccess in your site's root directory, with this line:
AddType text/html;charset=ISO-8859-1 .html
Replace "ISO-8859-1" with another official character set name if you're using a different character set (e.g., one with Cyrillic, Greek, or Hebrew characters; preferably UTF-8 these days), and replace ".html" with whatever file extension you're using for HTML files if it's different. (You might use different extensions for pages in different character sets, like ".ru.html" for pages in Russian that need the Cyrillic set.)

UTF-8

Of the several encodings for the full Unicode character repertoire, UTF-8 is by far the most popular. Unlike earlier encodings that had one character per byte (or, in the case of some Asian encodings, two bytes per character), UTF-8 uses a variable number of bytes per character. The bytes with decimal values from 0 through 127 are used in the same manner as US-ASCII and ISO-8859-1, as a single character in the ASCII range, but byte values from 128 to 255 are considered to be the first byte of a longer sequence of two or more bytes (exactly how many is determined by what value is in each byte). A UTF-8 file needs to be parsed from beginning to end in accordance with the encoding rules in order even to determine how many characters are in it, unlike an ASCII file where the number of characters can be seen in the file size in bytes.
Many editor programs these days support UTF-8 natively, so it may be possible for you to create documents in this encoding, letting you use all the characters in Unicode without having to use the "ampersand codes" noted above. If you do use UTF-8, be sure you get your Web server to announce this properly; the text may look like gibberish if served with an incorrect encoding header.
If you stick to ASCII characters, then you can announce the encoding as UTF-8, ISO-8859-1, US-ASCII, or a number of other values and have it work identically, since the ASCII range is encoded identically in all of these encodings. You can even have whatever characters you wish in the form of ampersand codes in HTML and still have this equivalence, but if you insert any raw characters outside the ASCII range such as an accented letter or a curly quote, you need to know what encoding you're saving them in.

Fonts

TIP: The FONT FACE attribute is deprecated; use stylesheets instead. But if you insist on using it, at least know how it works and what to watch out for.
Originally, HTML authors had no way to set fonts; all documents were displayed in the fonts chosen by the browser. Many people think this is the way it ought to be; the author of a browser for a particular platform is better able to choose attractive, readable fonts for that system than the author of a Web site which will be viewed on many different systems. However, current-day browsers support the use of the FONT tag to change fonts. This is done with <FONT FACE="Arial,Helvetica,Sans">, where the value of the attribute is a list of font names. If the first font in the list is available on the user's system, it will be used; if not, the second one will be used if available, and so on. If none are available, the normal default font is used.
This is a feature to use very carefully if you use it at all. You don't know for sure what fonts are available on a user's system, which may be running under Windows, MacOS, UNIX, or another platform. Maybe one user has a font named "Arial" that's totally different from the "Arial" you're familiar with. Maybe the user's implementation of "Helvetica" can't scale well to the point sizes your document needs. You can easily produce a total mess on some users' systems. These days, now that Cascading Style Sheets are widely supported, the stylesheet is the proper place to be suggesting fonts, and I've pretty much completely eliminated the use of FONT tags myself (not that I was ever a heavy user of them in the first place). Since I wrote the original version of this page before stylesheets were as well established, I didn't mention them much, and gave examples instead using the old deprecated tags like FONT. However, many of my notes and cautions about the use and misuse of fonts still apply even when you use stylesheets to suggest them.
If you use any special non-ASCII characters, such as foreign-language accented characters or non-Latin alphabets (Cyrillic, etc.), it was for a long time considered a bad idea to use any hardcoded font settings... there was enormous variation in the availability of special characters in different fonts, and often there were both U.S. and foreign versions of a given font that have a different character repertoire. Not to mention the multiple character encodings for some alphabets that have different characters at different positions, so that some fonts may be in a different order from others. A properly Unicode-compliant browser ought to adjust for this and display the correct characters in any case, but you couldn't always count on all browsers being properly compliant. If you refrain from specifying a specific font, you give the browser and the user's own configurations a chance to find a font that works for the particular language of the document, but if you specify a font you may be forcing the use of a font that just doesn't work. However, this is hopefully only an academic concern now, with Unicode support well-established in modern browsers.
If you use font settings for such things as headers, captions, and sidebars, at least restrain yourself from changing the font on the normal body text of your document. This text is especially important that it be readable, and generally the browser default font does the best job of this. In particular, lots of developers these days seem to like to use Arial as their body text font; I don't know why, since I actually like the looks of the normal default (Times New Roman) better. Arial is a sans-serif font, and I've heard of studies that show that serif fonts are generally more readable for large blocks of text. Then again, I've heard of other studies that show that this applies only to paper, not computer screens. However, I still prefer the serif fonts for body text myself (and people do sometimes print out Web pages!) Arial is better suited for brief headlines. Arial also tends to look a bit larger than other fonts, which in turn encourages developers to use the SIZE attribute of the FONT tag to make it smaller, another bad idea because it can produce hard-to-read text for some users.
Avoid FONT SIZE settings for your normal body text. By definition, the browser's normal font size is supposed to be the most readable size for normal text. That's why browsers have a configuration setting to let the user choose a font size for normal text, so that they can choose one that is good for them. When you use <FONT SIZE="-1">, what you're actually saying is "Take the font size the user has chosen as a readable size for normal text, and make the font one size smaller than that." That's not so polite to the user you're expecting to read your text. You should use variant font sizes only for special cases where a particular text element must be smaller or larger than normal, like small-print legalese at the bottom of your page mandated by your lawyer but that you don't really expect anyone to read. And, by the way, there's really no difference between "relative" sizes like "-1" and "absolute" sizes like "2". They're really all relative. Normal browser text is defined as size "3" (unless you use the rarely-used <BASEFONT> tag), and the other sizes are relative to this. The numbers are in an arbitrary scale with no relation to any absolute units like points, picas, pixels, or millimeters. The actual sizes they represent can and will vary by browser and platform, even if the users don't change their defaults through preference options. There's no way to force an exact font size on all users of all browsers, though stylesheets let you do a better job of suggesting one than the old font tags did.
TIP: Don't "fake" special characters (foreign alphabets, mathematical symbols, "dingbat" images, etc.) using FONT FACE elements. This is bound to fail on many browsers now and even more in the future, while support for the proper Unicode representations will only increase.
The use of non-ASCII-based fonts like "Dingbats" or "Symbol", or specialized fonts for foreign alphabets, is a "presentational hack" that obscures the logical structure of a document and should be avoided. Any user who doesn't have that particular specialized font will see the ASCII character at the same position, which will probably totally change the meaning of your document. But, in the future, it's likely that even browsers on platforms that do have the given special font will fail to render this "hack" technique as the author intended. The reason is that a full-fledged Unicode support entails decoupling the logical characters from the particular fonts used to display them. An "a" is always a Latin lowercase letter "a", and an "alpha" (Unicode character 945 decimal, entity &#945;: α) is always a Greek letter "alpha", no matter what FONT tags might surround its text block for presentational reasons. When a sequence is encountered like <FONT FACE="Symbol">a</FONT> or <FONT FACE="Arial">&#945;</FONT>, it's regarded as an attempt, respectively, to display an "a" using the Symbol font, and an alpha using the Arial font. If the browser then finds that the respective characters aren't present in these fonts -- there's no "a" in Symbol and no alpha in (the American version of) Arial, then it is supposed to ignore the font tag and render the correct character in another font which does have it. What character, if any, happens to be at the same code position in one font as the given character is in another, is totally irrelevant. Full, correct support of this in future browser versions will make it possible to express a wide range of mixed characters in documents in a platform-independent way, helpful to linguists and mathematicians. But it will also break some earlier, nonstandard attempts to "force" special characters via font tags.
One final note about the FONT element; the HTML 4.0 specs define this as a character-level element, not a block-level element. What this means in plain English is that a span of text between <FONT> and </FONT> cannot include any block-level markup such as <P>; font elements can be within a paragraph, but paragraphs can't be within a font element. You need to open a new font element in each paragraph or other block construct that you want to specify a font for, and close it at the end of the paragraph (properly nested within any other constructs you use). Failure to do this will produce a lot of weird errors in a validator if you try to validate your site.

Hall of Shame

Make your site better by looking at other sites that show, by example, what not to do!
NOTE: The inclusion of a site in my "Hall of Shame" links should not be construed as any sort of personal attack on the site's creator, who may be a really great person, or even an attack on the linked Web site as a whole, which may be a source of really great information and/or entertainment. Rather, it is simply to highlight specific features (intentional or accidental) of the linked sites which cause problems that could have been avoided by better design. If you find one of your sites is linked here, don't get offended; improve your site so that I'll have to take down the link!
  • People do fix their mistakes sometimes; a couple of pages I formerly cited here for messed-up characters later changed their code so it worked properly. One of them is no longer online, but this page was the other; its accented letters formerly showed as something weird due to characters apparently input in one encoding and served in another, but now they're correct. Congratulations... you're not on my Hall of Shame any more!
  • Another former Hall-of-Shamer that is now reformed: Many of the domain name dispute decisions on the National Arbitration Forum site used to be hard to read because all the apostrophes, and some of the quotation marks, come out as some weird accented letters. This was apparently the result of some program putting in so-called "smart quotes" in a platform-specific character set that's badly mangled somewhere in the process of getting it onto the Web. However, they seem to have fixed it since then.
  • On the other hand, this decision on the competing dispute resolution provider WIPO displays some character set confusion. The page is sent by its server with the MIME "charset" parameter of ISO-8859-1. The authors then attempt to override this with a META tag giving the charset euc-cn, but under the standards the server-supplied charset takes precedence over document META tags. Thus, a standards-compliant browser (e.g., Mozilla) will attempt to render the page as ISO-8859-1, meaning that the Chinese characters, curly apostrophes, and other non-ASCII characters come out as gibberish.
  • Similarly, this page is served as ISO-8859-1 but has UTF-8 in its META tag; standards-compliant browsers show a big mess everywhere such characters as so-called "smart quotes" are used.
  • This Internet draft proposal is served as iso-8859-1 but has a contradictory META tag saying that it's IBM437. In Mozilla, it shows with funny characters all over the place.

No comments:

Post a Comment

Related Posts Plugin for WordPress, Blogger...

dg3