My Blog List

Languages

Since "World Wide" is part of the name of the Web, it was never intended to be an English-only medium. You can use any language in your Web sites, and various facilities have been added to the standards of the Web to allow you to indicate which languages you are using for the benefit of indexers and translators, and even to intelligently serve different language versions of your pages to suit user preferences. This page describes some of these techniques.
The "languages" considered here are human languages like Spanish and Mandarin Chinese, not computer languages such as HTML and JavaScript. What computer languages you use in your site is also an important choice, but that's not the subject of this article!

Indicating Language

HTML gives you the ability to mark a document, or a part of it, with what language it is written in. This is done with the LANG attribute, which can be placed in just about any tag.
To indicate that the entire page is in English, you would put this in the HTML tag at the beginning:
<HTML lang="en">
The value of the attribute is the code for the language the document is in. Most of the common languages of the world have two-letter codes (defined in the standard ISO 639-1), and a larger list (including "dead" languages) have three-letter codes (in ISO 639-2). RFC 3066 states that the two-letter code must be used in preference to the three-letter code when one exists, so normally only two-letter codes will be found in Web language attributes. Here and here are some references to the two-letter codes.
In order to specify a particular dialect, a language code can be suffixed with a dash and an additional code, often a country code. When this is done, traditionally the base language code is given in lowercase and the suffixed country code in uppercase (though this isn't required; the codes are non-case-sensitive). So, American English can be indicated as en-US and British English as en-GB. (Note that GB is the proper country code for the United Kingdom, not UK, even though .uk is the domain suffix used there. In other cases, though, the standard country code agrees with the country-code domain for a given country.) es-MX is the code for Mexican Spanish. Non-national dialects can be indicated by their full names -- en-cockney for the Cockney dialect of English.
You can also indiate smaller portions of a document as being in a particular language by putting a lang attribute on the opening tag of an element. For instance, you can give the language of a quotation by putting a lang attribute in the blockquote tag. If no single element covers the section of the document you wish to so mark, you can use a DIV or SPAN element for this purpose. (DIV is a block-level element which can enclose entire paragraphs and other block elements; SPAN is a character-level element which can be contained within a paragraph or other block.) For instance:
<BLOCKQUOTE lang="en">How are you? Or, as I'd say in Spanish, "<SPAN lang="es">¿Como estás, o como diga en inglés, '<SPAN lang="en">How are you?</SPAN>'?</SPAN>"</BLOCKQUOTE>
This comes out like this in your browser:
How are you? Or, as I'd say in Spanish, "¿Como estás, o como diga en inglés, 'How are you?'?"
Probably there is no visible effect from those various language attributes, though in some cases a browser will make use of them to vary the presentation in some manner that is suited to a particular language, like using a different default font. However, in a normal graphical browser, displaying languages such as English and Spanish that are expressable in standard ASCII and Latin-1 characters, there is little need for such things. (For other languages, which use more "exotic" character sets, Netscape 7 and Mozilla do indeed vary the font based on the lang attribute.) In any case, the information about what language each section of the passage is in is there to be used by any user agent that does have a use for it, like a speaking browser that can adjust its pronunciation accordingly, or an indexing robot that builds different indices of material in different languages. (In fact, the IBM Home Page Reader speaking browser does recognize "lang" attributes and this affects the way it pronounces text.) The W3C Accessibility Guidelines require that you faithfully mark up the language of all text in your site.
Note that elements with language attributes can be nested to an arbitrary depth. The innermost element surrounding a particular segment of text is always the one that governs what language it is in. Thus, in the above example, the quotation as a whole is in English, but it has a section in Spanish which in turn contains a subsection that is in English.

Another way to indicate the document's language

The above attributes are part of HTML. However, another layer of the standards and protocols by which Web documents reach the users is HTTP, the HyperText Transfer Protocol. That, too, has a method of specifying what language the document is in. If you have your server send this header:
Content-Language: en-US
then this will indicate that the page is in American English. I discuss the HTTP protocol and how to configure your server for language-related settings below in the section on language negotiation. If you're not doing content language negotiation, there's probably no need to mess with HTTP protocol settings; the HTML language attributes are easier to use and more flexible (since they can mark sections of a page, not just the page as a whole).

Online Translators

One type of user agent that could make great use of these attributes is an online translator, which takes a Web page and translates its text into a different language. Unfortunately, the ones in present use don't seem to make much use of these attributes. The AltaVista Babelfish translator, the most popular one at present, doesn't seem to care about these attributes at all, instead relying on the user to specify what language the source document is in through a pulldown menu, and then trusting this regardless of what the page actually indicates. A few other online translators make some minimal use of lang attributes, noticing one in the top-level HTML element indicating the language of the entire document, but not heeding other lang attributes elsewhere in the document. This produces odd situations when a page is mostly in one language but contains quotations in other languages; the translators will try to translate the quotes as if they were in the main language (e.g., in a Spanish quotation within an English page, it will see the word "sin", Spanish for "without", as the English word "sin", and translate it as "pecado".) This is a case where a more intelligent translator could notice attributes denoting the language of document parts and realize that those sections should either be left untranslated or be translated from a different origin language.
This situation seems to be one of the many "chicken-or-the-egg" dilemmas that plague the introduction of any logical, structural elements and attributes in HTML; program developers don't bother to implement support for them because few Web authors are using them, and Web authors don't bother to use them because few programs do anything with them. This sort of Gordian knot doesn't seem to affect "presentationalist" enhancements that aim to create "neat" visual effects, such as were introduced many times by Netscape and Microsoft over the years; those things seem to have a large crowd of fans of flashiness eager to start using them even while browser support is still iffy (slapping a "Best Viewed with the Latest Version of My Favorite Browser" icon on the front page to get people to upgrade). Or, at least, this was the case in the early years of the mass popularity of the Web; things seem to have matured and stabilized much more now. But any enhancement that affects the underlying logical structure, behind the scenes, without noticeable visual effect, doesn't excite the same interest among Web designers or software developers, so those things are much slower to be deployed.
To break this cycle, I advocate that Web authors do their best to use this sort of logical markup; it won't hurt the rendering of their pages, and it will help encourage future generations of user agent authors to finally attempt to do something useful with the tags and attributes in question.

Tagging Links

In addition to indicating the language of sections of text in your own page, you can also indicate the language of other documents you link to; this uses the hreflang attribute. For instance:
<A hreflang="es" lang="en" href="spanish.html">A page in Spanish</A>
Note that the above element has both a lang and a hreflang attribute. The lang attribute indicates the language of the text within the element, in this case "A page in Spanish", which is in English as indicated. The hreflang attribute indicates that the page being linked to is in Spanish.
While there isn't much user agent support for this attribute, the Mozilla browser does show the destination document language in the information window which can be brought up with the "Properties" item in the right-click menu when clicking on a link.

Language Negotiation

Now for something more complex. You might have versions of a page in several different languages. A mechanism exists to let you give users their preferred language, all from the same URL. This mechanism, called "content language negotiation", has been present in the HTTP protocol for years, but is fairly seldom used. Instead, Web developers have picked this as one of the wheels that they keep reinventing, in various ways that are usually inferior to the original system. Thus, sites are full of complicated contraptions with cookies and JavaScript to attempt to route people to their preferred language (possibly failing to take them anywhere if cookies or JavaScript are disabled), or attempt to redirect the user based on what country the server thinks they are in (a very unreliable thing to determine, as users might be logged in through an ISP in a different country than their actual location, and anyway their physical location does not determine with certainty what language they speak).
There is a better way. Browsers offer as part of their configurations the ability to specify what languages their user wants to use, in the order of preference. This is sent as part of the HTTP request for any Web pages the user retrieves, and the server can use it to choose which version to send.

Configuring Your Browser's Language Settings

Usually, the browser preference is in the form of a pick list of languages which the user can select one at a time, forming an ordered list. In Mozilla and Netscape, this can be found in the "Preferences" item of the "Edit" menu; within this, it is the "Languages" subection in the "Navigator" section. In Opera, it's the "Preferences" item of the "File" menu, under "Languages". In MSIE, it's the "Internet Options" item in the "Tools" menu; push the "Languages" button at the bottom of the "General" tab.
Unfortunately, none of these browsers do that good a job of explaining to the user how to set their language preferences. In particular, the relationship of generic languages (es = Spanish) and specific language varieties (es-MX = Mexican Spanish) isn't explained. If a user selects a specific variety of a language (one country's variant, for instance), but also understands other dialects of that language, then the user should also select the generic version to ensure that pages of all varieties of the user's preferred language are available, not just the ones of the particular country the user most prefers. For instance, somebody who understands Spanish and English, and prefers the Mexican variety of Spanish and the U.S. variety of English (but will accept other varieties) can set this preference order:
es-MX, es, en-US, en
In this case, if a requested document is available in Mexican Spanish, it will be served in preference to any other versions. However, if no such version is available, but there is one in Venezuelan Spanish (es-VE), then it will be treated as a match to the generic Spanish entry, and be served in preference to other languages. If no Spanish version is available, then an American English version will be checked for first, but if none is available a British English (en-GB) or an undifferentiated generic English (en) version will be served. Note that it is important to put the generic version in your list after any specific variants of the same language, or else the generic version will match any subtype of the given language, not necessarily the subtype you most prefer. If a site happens to have both a U.S. English and a British English version, the en item in your preference list will match either (with the preference between the two resolved by random chance or by a "quality level" setting in the server configuration where the webmaster decided which version is preferable when the end user doesn't care), but if the en-US item comes first in your list, it will unambiguously express your preference for American English.
However, if your preference list omits the generic versions:
es-MX, en-US
then you are telling the server that you want only Mexican Spanish or U.S. English, no other variety. If the site has versions in Panamanian Spanish and Australian English, neither would match your preference list, so you'd be at the mercy of what the server decides to do in cases where no language matches. (Below you'll see some hints on how a Webmaster can deal with such cases.) It's likely you won't end up with your preferred language in this case, so you should avoid this. Regrettably, though, the major browsers don't warn you of this, and some of them will merrily accept such flawed preferences without complaint. (There are rare cases where somebody might actually have a preference of this sort, especially in the case of languages with mutually-incomprehensible dialects, but in most common cases a user would be better served by including the generic as well as specific varieties.)
Opera avoids this issue by, in general, including only generic languages and not specific varieties (with a few exceptions, such as Chinese, for which several varieties are included); this removes the ability of the user to get into the "jam" of picking only a variety without its generic parent, at the expense of removing the ability to express a finely-tuned preference among dialects. On the other hand, MSIE includes mostly specific varieties in its list, and in the case of many languages the generic variety isn't even available (though there's a "write-in slot" where you can add additional languages of your own choice). This makes it difficult for users to configure their language settings correctly, with both specific and generic versions, even if they are aware that they should do so.
The best way to handle this, in my opinion, would be to include both generics and specifics (as Mozilla and Netscape do) but to have a warning message if the user sets a preference that omits the generics (as these browsers unfortunately don't).
"Out of the box," browsers will usually be configured with language preferences that match the user interface language of the browser. If you buy a computer that's been set up specifically for a particular country, any preinstalled browser is most likely set for the dominant language of that country. If you bought the computer from another country (e.g., Mexicans often get their computers from the United States) and it wasn't reconfigured by a local dealer, it might be set for a foreign language. If you download a browser, you might have a choice of downloading versions in different languages, or a configuration setting for language during the installation process. Since most users tend to leave their software in its initial default configuration, this means that some, but not all, browsers in use are properly configured to the language preferences of their users. In addition, due to the browser makers' failure to properly handle the generic vs. specific issue, an awful lot of browsers out there are configured to accept only a particular variant of a language and not that language in general -- en-US instead of en.
Because of these issues, some webmasters question whether the use of language negotiation is a good idea, and this can lead into philosophical debates over whether it is better to try to educate users on how to take advantage of their browsers' features, or whether developers ought to give up on that and just pander to the users' assumed level of ignorance. On such debates, I normally side with the faction that wants to use features properly and try to teach others to do so; the other route leads to the dumbing down of the Internet and the perpetual reinventing of wheels in inferior ways.

Setting Up Sites for Language Negotiation

Exactly how you set up your site to use this feature will depend on the server software that is in use. Since Apache is the most popular Web server at the present, I'll describe how to do it under that system, but there should be methods of accomplishing this in other server software as well.
Create Documents
First, you need to create different language versions of your documents with a consistent naming scheme. Actually, it's possible to configure language negotiation no matter what you name the different versions, but it's easiest to do it by giving the files extensions based on what language they are, in addition to any extensions that may already be there to indicate the data type. For instance, if you formerly had only a single-language version of a page, named mypage.html, and you now want to negotiate between English and Spanish versions of it, you can name them mypage.html.en and mypage.html.es. (I'll discuss more later on the issues of whether to put the language suffix before or after the other file extension like .html, and how that affects how you access the pages. For now, I'll assume the language suffix is put last.)
Set Up Configuration
Now you need to tell the server what the suffixes mean and that you intend them to be used for negotiation. One way to do this is with a .htaccess file (the filename is exactly that, .htaccess, with the dot as the first letter, as unnatural as this may seem to MS-Windows users more used to the scheme filename.ext; unfortunately, some PC software makes it difficult to create files with such names, as it tries to fill in stuff to transform the filename into a more "familiar" style), with the following lines:
Options +MultiViews
AddLanguage en-US .en
AddLanguage es-MX .es
LanguagePriority en-US es-MX

The first line tells Apache to enable the use of the "MultiViews" mode, where it can choose between alternative versions of a document instead of just serving the same file for everybody. The next two lines tell it that .en represents a document in English (U.S. variety), and .es represents a document in Spanish (Mexican variety). (Put the languages of your choice here; you can have as many different languages as you wish, with a different extension for each, which may but needn't match the language code -- usually, people using Polish as one of the supported languages use something other than the .pl language code because this is in common use on the Web as the file extension for Perl scripts). The final line suggests the order of priority for the different languages in cases where the user agent expresses no preference; here, I made English the first choice. You probably want to use whichever of your languages you think generally has the highest quality in your site, perhaps the language of your original writing from which the others might possibly lose something in the translation.
.htaccess files may be placed in any subdirectory of your site, and are applicable to everything in that directory and any "child" directories beneath it. Thus, you should place your file in the server root directory of your site (the topmost directory where public Web data is stored, usually where the main home page is located) if you want it to apply to your entire site, or in a subdirectory if you're only using these features in a subset of your site and don't wish them to interfere with the rest of your content.
Now try to access it!
Now that you've set up the .htaccess file and placed mypage.html.en and mypage.html.es files online, you can try accessing the URL http://yoursite.example/mypage.html (where yoursite.example is replaced with your site's domain and any path information necessary to get to the place where your page is). If you did everything correctly, you should end up with the English or Spanish version of the page depending on your language configuration.
Whoops... I still got the old version of the page!
If, instead, you wind up with the old version of mypage.html from before you started trying to add language negotiation, that means that you left that file in place alongside the new mypage.html.en and mypage.html.es files. When a user attempts to access the URL ending in mypage.html, Apache first looks for a file exactly matching that name. If it finds one, no "MultiViews" are attempted; the found file is immediately served. Hence, you should delete the old file to allow Apache to find the files you really want to be served. If no matching file is found, Apache proceeds to the negotiation phase, looking at all the files that are named mypage.html with additional extension(s) after it, and seeing which one of them best matches the acceptable languages (or, for that matter, data formats; negotiation can be used, in theory at least, to serve HTML, PDF, and MS-Word versions, for instance, of a document based on user preferences, though popular browsers don't really make much use of this capability at present). If English is the user's first language choice, mypage.html.en will be matched and served.
In fact, the .html part of the filename isn't even necessary to include in the URL when MultiViews is enabled, since that part is also subject to negotiation, and all common browsers include HTML as one of their accepted content types. A request for the URL ending in mypage (with no extension given) will cause mypage.html.en to be matched, since Apache associates .html with the text/html MIME type, and you have associated .en with the English language.
Some developers favor using extensionless URLs of this form, as they're shorter, cleaner, and more "future-proof" (you can change the data format or scripting language in the future without changing the URLs). Others find them to look a bit unnatural, as they're used to URLs having file extensions at the end when they're not subdirectories ending in slashes. If you're adding MultiViews negotiation to an already-existing site, you probably already use links to URLs ending in .html (or other extensions), and may not want to change them. So there are pros and cons to both ways of doing things, but they both work. One advantage of dropping the extension from the URL is that you can then add multiple extensions in any order for the purpose of content type, language, and other negotiation -- .html.en and .en.html will work identically. If your URLs end in .html, your filenames must not embed other extensions before this, or they won't match properly.
But what if the user doesn't include either language in the request?
One problem you may run into is dealing with requests that do not indicate any of your supported languages as acceptable. If a user sends the acceptance string fr-CA, fr, de, signifying that their preferred languages are Canadian French, generic French, and generic German, then neither your English nor your Spanish page will match. Technically, nothing in your site is acceptable to the user, and with the above configuration the user will be served an error page saying so. (Unfortunately, that error page is generally in English, even though the user has indicated that this isn't an acceptable language!) While this is technically accurate, most webmasters would prefer it not happen; it's an ugly and hard-to-understand page that starts by saying "Not Acceptable", making users think they did something wrong. Thus, most webmasters would rather that some version of their real content be served instead, even if it does happen to be in a language the user doesn't understand. This need becomes more urgent due to the fact of the existence of misconfigured browsers that don't send language acceptance strings matching their users' true preferences. If, for instance, a browser sends a string that excludes the generic versions of the preferred languages, like en-GB, es-VE, then neither the U.S. English nor Mexican Spanish versions will be matched and the user will get the error page, an unfortunate turn of events.
Fortunately, there's a workaround. If you include a version of the file with no language code, that will get served when none of the language-coded versions match. You've got to do this carefully, though. We've already seen that if you leave a plain mypage.html file around next to mypage.html.en, etc., then it will get served in preference to the other files in response to a request for mypage.html. If you use the "extensionless" version of the URL, however, then you can have those files side by side, with the negotiation being used to decide which version to serve and the one without a language code getting used only if no language matches. However, if your link does have the extension, you can still make it work, but a little more complicatedly. Just create a file named mypage.html.html, with a double extension. This won't immediately match the request, enabling negotiation to proceed, but it will still match as a default after all else fails.
Even if you use "extensionless" links, and thus are able to get away with avoiding the double extension, you still need to consider the default index pages for each directory; most likely, your server has been configured to look for a few specific names like index.html, and not an extensionless index. index.html.en, and index.html.html will be matched by negotiation, but if there is an index.html, it would get matched first and prevent the negotiation for the default index. Hence, you probably need to use "double extensions" for the default versions of these, even if you can get away with single extensions elsewhere.
You probably want the default version to be identical to the file of your preferred language. You can simply copy the other file to this new name, but then you have to remember to update both versions whenever the page changes. A better technique, if you can do it, is to make the default filename a "link" to one of the other file versions. If your server is on a Unix-like platform, and you have shell access, you can use the command:
ln mypage.html.en mypage.html.html
to create mypage.html.html as a "link" that always returns the same contents as mypage.html.en, updating automatically whenever that file changes.
The newest version of Apache has some more commands that are usable in .htaccess files to specify fallback behavior for negotiation in a simpler manner, but many sites (including mine) are still hosted by providers using older versions, so the above technique is necessary for now.
Another Problem... And An Ugly Kludge Around It
As noted above, there's a distressing tendency for current browsers to be configured with only one regional variant of a language and not its generic variety -- like en-US not accompanied by en. Technically, this is saying that the user wants only American English and not any other variety of English; thus, if a document is available in British English and French, and the default language is set to French in this site, then the user would be served the French version even though the English one is probably preferred. I ran into this problem in a site I set up, where the two versions were en-US and es-MX, the default was the English version, and I found that many visitors were configured for varieties of Spanish such as es-PR (Puerto Rican Spanish) and thus got the English version where the Mexican Spanish one would have made more sense. This is all in perfect compliance with the standards -- that's what the user actually asked for -- but the purpose of the site in question is not to argue with visitors or teach them how to configure their browsers. Hence, after some trial and error, I came up with a really ugly "kludge" that gave a better chance of giving users what they probably really wanted instead of what they actually asked for. This is not something I can in good conscience recommend -- it goes entirely against my normal Web development philosophy of following the standards meticulously and avoiding the slightest bit of pandering to the dumbed-down mindset that pervades the Internet these days -- but it does in this case improve the user experience, so here it is in case you want to try it too.
What I did was to make the Spanish versions of my documents "pretend" to be other versions of Spanish in addition to the Mexican variety. By looking at server logs and browser configurations, I listed the varieties of Spanish in common use and their associated codes, and added to my .htaccess file:
AddLanguage es .es1
AddLanguage es-CL .es2
AddLanguage es-CO .es3
AddLanguage es-CR .es4
AddLanguage es-PA .es5
AddLanguage es-PE .es6
AddLanguage es-PR .es7
AddLanguage es-PY .es8
AddLanguage es-SV .es9
AddLanguage es-AR .es10
AddLanguage es-DO .es11
AddLanguage es-US .es12
AddLanguage es-UY .es13
AddLanguage es-BO .es14
AddLanguage es-EC .es15
AddLanguage es-GT .es16
AddLanguage es-HN .es17
AddLanguage es-NI .es18
AddLanguage es-ES .es19
AddLanguage es-VE .es20

This covers a bunch of national variants, plus the generic one for good measure (which isn't really necessary, as it should be matched by any variant, but my trial-and-error showed that on occasion it actually caused the Spanish version to be served in response to a variant not on the list, for some reason), associating each with a different file extension, .es1, .es2, etc.
Next, I created "symlinked" files for each Spanish-version page, with names like:
index.html.es1.es2.es3.es4.es5.es6.es7.es8.es9.es10
index.html.es11.es12.es13.es14.es15.es16.es17.es18.es19.es20

This would be really tedious to create by hand, but I'm a programmer... I just whipped up a Perl script to do it for my entire site!
The end result is that, if a user's browser is configured for es-PE, the server "lies" to it and claims that the Spanish version actually is of this variety. In other words, I'm breaking the standards, but the result is that the user gets a Spanish version as desired instead of an English version.
Do this or not as you will... we can hope that someday browsers will default to more sensible configurations so this won't be necessary.
Always Include a Regular Link Too
As noted a few times above, users unfortunately don't always configure their browsers correctly for their language preferences. Also, users are sometimes using browsers belonging to other people (including in public places such as libraries and cybercafes), which might be configured to different language preferences from that of the user. Users might also be interested in looking at more than one of your language versions, because they're fluent in more than one language or because they're trying to learn another language. Thus, it's important to give the user a chance to get to other versions of your multilanguage pages than the one served them by default. You can do this by including links on each language-negotiated page to all other language versions of the page. By linking directly to mypage.html.es, you always go to the Spanish version no matter what configuration the user has set.
I recommend, however, that you don't use images of flags for these links, though this is fairly common; it's a mistaken approach, since flags represent countries, not languages. Should English be represented by a British flag or an American flag? Which language should a Canadian flag represent? (Both English and French are official languages in that country.) And flags are not necessarily even unique! It's better to use the name of each language, in that language, as the link text.
You might also want to include a LINK tag referencing the alternative language versions, like this:
<link rel="alternative" lang="fr" hreflang="fr" title="En Français" href="mypage.html.fr">
This causes browsers that support this element to provide some kind of user interface allowing access to the different language versions. However, since support is not very good in present-day browsers, this shouldn't be the only way you link to the alternative versions.
An interesting question is, if you've got all the pages of your site set up to be served through language negotiation, whether navigational links within the site ought to go to the "generic" URLs of each page (thus subject at each link to language negotiation), or to the specific language versions that correspond to the language version of the page being viewed now. This is a tough issue. If you use "generic" URLs, it may be frustrating to somebody who is trying, for whatever reason, to browse the site in a language other than that which is configured by browser settings, but keeps winding up in a different language version and has to click on the link to the right language on each page. There's a good argument to keeping the user in the current language once one is selected by following a specific link. On the other hand, if you do this, you're encouraging others linking to your site, as well as search engine indexers, to link the specific-language URLs, bypassing the negotiation, since the generic URLs won't show up among your internal links. This means that lots of new users will come to your site directly through such links, and the negotiation will never have a chance to proceed. You might as well not have added it at all in that case. By linking the generic URLs, you're keeping the negotiated links out there as access points to your site (although the specific-language ones will get linked and indexed too, since you do have links directly to them as well).
I finally came up with a solution to this dilemma; if you add this JavaScript to the HEAD section of your pages:
<SCRIPT LANGUAGE="JavaScript" type="text/javascript">

function settaglinks(tag, lang)
{
  anchors = document.getElementsByTagName(tag);
  for (i=0;anchors[i];i++)
  {
    if (anchors[i].href.lastIndexOf('.html') == anchors[i].href.length-5 &&
        anchors[i].href.indexOf('yourdomain.example') >= 0)
    {
      anchors[i].href = anchors[i].href+'.'+lang;
    }
    else if (anchors[i].href.charAt(anchors[i].href.length - 1) == '/' &&
        anchors[i].href.indexOf('yourdomain.example') >= 0)
    {
      anchors[i].href = anchors[i].href+'index.html.'+lang;
    }
  }
}

function setlinks(lang)
{
  if (document.location.href.lastIndexOf('.'+lang) == document.location.href.length-3)
  {
    settaglinks('a', lang);
    settaglinks('link', lang);
  }
}

</SCRIPT>

and then insert onload="setlinks('en')" in your <BODY> tag (changing 'en' to the language code appropriate to the given page, and 'yourdomain.example' to your site's domain name), then all links will be changed from generic to language-specific versions if the user accesses the page with a language-specific URL, but not if they come to the page by a generic URL that's language-negotiated. At least this will happen if the user has enabled JavaScript and their browser supports all the features used by this script. If not, it should "degrade gracefully" by leaving the links alone. The effect is that users who pick a language version explicitly by following a link to it will get links within that language from then on, while users that let their browser pick the language for them will stay at the generic URLs; also, search engines (which generally ignore JavaScript) will see and index the generic URLs.
Some of the stuff I'm doing here is a long way away from my normal "Keep It Simple, Stupid" philosophy, but sometimes you've got to get just as complicated as you need to be to get the job done... but no more.

Character Sets and Encodings

I'm only bringing up character sets and encodings here to note that they are a different issue from languages, although often confused. I already have a different page on character and font issues. The specification of a particular language through an HTML attribute or an HTTP header carries no implication as to what character encoding the document is using, or what font ought to be used to display it, although browsers might possibly use different default fonts depending on language. The confusion comes from the fact that different languages do require a different character repertoire, sometimes just slightly different from one another (the same basic alphabet but different accented letters or other diacritical marks and punctuation) and sometimes radically different (a different alphabet such as Greek or Cyrillic, or a nonalphabetic writing system like Chinese or Japanese), and hence there are specific encodings and fonts associated with a given language.
It's still improper to assume that a document uses a particular character encoding just because that's the most common one used with its language. A quotation within a document might be tagged as lang="ru" because it's in Russian, but it is not in the Cyrillic alphabet because it's transliterated into the Roman alphabet (like "glasnost" and "perestroika"). Even if it uses the Cyrillic alphabet, there are a number of different encodings that include this repertoire. Thus, the use of language tagging is no substitute for correctly announcing the document's character encoding in your server's HTTP response. This can be done in a .htaccess file:
AddDefaultCharset ISO-8859-1
This line can be used if all your documents are in the ISO-8859-1 encoding (usable for most Western European languages); if you use several encodings for different-language documents, you'll have to get more advanced, perhaps using different file extensions for each and setting configurations appropriately. (When I first wrote this, several years ago, such a multiplicity of encodings was still common, but these days it's getting more popular to do everything in UTF-8, an encoding that supports the entire Unicode repertoire. Fortunately, many text editors (including the one I use, UltraEdit), supports native editing in UTF-8, so this is feasible.)

No comments:

Post a Comment

Related Posts Plugin for WordPress, Blogger...

dg3