My Blog List

Directories and Default Index Files

Structuring Your Site

TIP: Make intelligent use of subdirectories to logically structure your site, make maintenance easier, and give parts of your site memorable URLs.
A Web site needn't be all in one directory. You can use subdirectories (what graphical-environment types tend to call "folders" these days, but us old-time computerists prefer the more technical term) within your site. That's a good way to separate your content in a logical, easily-maintainable way. If you just dump everything in one directory, it will get unwieldy very fast. Subdirectories can be used for the following purposes:
  1. To separate the content of your site into logical sub-sites. If your site has sensible dividing-points, give each part a separate subdirectory. For instance, a corporate site may have one directory for marketing information on its products, another for technical-support information, and a third one for stockholder reports. A subdirectory can be divided in turn into sub-subdirectories at the next level: the marketing directory may have a subdirectory for each product line.
  2. To put graphics in a separate directory from the HTML files. When you edit the HTML files and want to upload them all to the server, you won't want to waste time by uploading the unchanged graphics over again, but you'll have trouble separating them from the HTML when they're all jumbled together.
  3. Similarly, if you have sound files, Java applets, or other multimedia add-ons, use separate directories for each kind of content to keep things straight.
TIP: Once you decide on your directory structure and file names, don't change them unless you have a really good reason!
Decide on the directory structure of your site early, when you first start working on it; it's much easier to develop and maintain a site starting with a sensible structure than to try to change the structure of a site after it's already evolved haphazardly. And if you change the file and directory names after the site's already been up for a while, you'll break any bookmarks, links, and search-engine entries that have been made to parts of your site other than the main home page. So come up with sensible names from the start, and try to avoid changing them thereafter unless absolutely necessary. Even a "trivial" change like changing all your .html files to .htm or vice-versa will break links, so avoid it!
Note: Both .html and .htm are common extensions for HTML documents. .html is generally regarded as the more "proper" extension, standing for the full document format name "HyperText Markup Language", but .htm came into use early in the history of the Web for the sake of developers using operating systems like MS-DOS or Windows 3.1 that were limited to three-letter extensions. Nowadays, with few people using such operating systems on the Internet, and with modern FTP programs supporting an option to add the extra letter to the end of filenames on upload, there are few good reasons to use the shorter extension, and some people think URLs look "cheesy" with the short extension. Some authoring tools, especially those created by Microsoft, still default to this extension, so lots of sites use it even when the developers' system lacks the limitation that led to it. In fact, one of the common superstitions in file naming is that names should be limited to 8 letters plus a 3 letter extension; this is no longer true for the vast majority of systems in current use, and even systems that are still limited in this manner have no problem browsing Web sites with URL names not abiding by this limitation.
Here's as good a place as any to remind you that, on UNIX servers (which is what a large portion of Web sites use), filenames are case-sensitive. A name in uppercase like INDEX.HTML is different from one in lowercase like index.html, and they're both different from each of the mixed possibilities like Index.Html and index.HTML. So, when you're creating new files and directories in a Web site, be attentive to whether you're naming them in uppercase or lowercase, and be consistent. All links to a given file will have to agree in case with the way the file is on the server. Unless there's a good reason to do otherwise, you should use all lowercase letters in your names; that's generally the way users are used to entering URLs. (Even if your server is one, such as Windows NT, which does not use case-sensitive filenames, you should still be consistent in case in your links, since the different-case versions are different URLs even if they retrieve the same file, and will be separately cached by the browser and waste memory space and download time.)

Default Index Files

TIP: Use the default index file sensibly to simplify the URL of your site. Do the same for subdirectories, to simplify the URLs of your sub-sites.
Almost all Web servers have a default file, usually index.html, but sometimes default.html, welcome.html, or default.htm, that will be loaded automatically when a directory name is used as the URL. You can take advantage of this to make your URL shorter and more elegant-looking. Many users don't know this and use URLs like:
http://www.someplace.net/~msmith/marysmith.html
If Mary named her main page index.html, she'd be able to give her URL as:
http://www.someplace.net/~msmith/
Some people get this half right, and give their URL as:
http://www.someplace.net/~msmith/index.html
They used the right filename, but didn't realize that they didn't have to actually type that name. The directory name alone suffices, is easier to type, and looks nicer. (See the notes below on linking back to your home page.)
Put a default index file in every directory, even directories that don't actually need one (e.g., your graphics directory). If you don't, a user who enters the directory name as a URL will get a raw directory listing, and you may have files you'd prefer random users not see (like pages that are still under construction). A "dummy" index file prevents such snooping.

Final Slash in Pathnames

TIP: Don't leave out the closing slash of directory-name URLs!
Always include the final slash (/) at the end of a URL that ends in a directory name. If you use:
http://www.someplace.net/~msmith
(without the slash), the browser will first try to retrieve a file rather than a directory, and only when the server realizes that ~msmith is a directory name will it tell the browser to add the slash and try again. This takes one extra communication round between browser and server, slowing down the retrieval. Also, the browser doesn't know in advance that the address without the slash goes to the same page as the one with it, so it won't show the link in the "visited-link" color if the user already went there, and won't take advantage of a previously-cached copy of the page that may exist.
Even worse, there are a few old browsers (some versions of Mosaic, for instance) that don't handle this sort of redirection correctly. They may pull up the correct Web page without the slash, but they then fail to handle relative links from the page correctly. A link to stuff.html from the URL http://www.someplace.net/~msmith/ should end up going to http://www.someplace.net/~msmith/stuff.html, but if the slash is omitted and the browser software isn't smart enough to add it once it's redirected by the server, it will think it's really one directory level higher in the tree, and parse the relative URL as http://www.someplace.net/stuff.html. This will then cause a 404 Not Found error, and the user won't know why.
If you're using the <BASE HREF="..."> element to specify a base URL for your site, it's even more important to include the trailing slash; it won't work without it, as the browser will parse relative references using the directory one level higher than the one you intended.
One very prominent site whose creators failed to heed my advice on trailing slashes is the official government posting of the Ken Starr Report on President Clinton's relations with intern Monica Lewinsky. Due to news-media hype, this report (posted to several official government sites on September 11, 1998, and shortly thereafter to various private-sector sites as well) got some of the heaviest Internet traffic ever, causing the servers to be so overloaded in the first few hours the report was up that most people couldn't connect. Unfortunately, the government added to this problem by using versions of the URLs of these sites lacking the trailing slash everywhere they publicized or linked to the sites, thus ensuring that each access of the site would have one more server transaction than would be necessary if the slash had been used. With the high level of traffic the site had at the time, this probably added long delays for many people's accesses.

Another reason to use closing slashes...

When URLs get published in print media such as newspapers, magazines, and newsletters, they often get put in sentences with periods at the end. Some readers (especially those who are novices to the Web and unaware of what characters are usually in URLs in what order) will think the period is part of the URL and type it into their browsers. If the URL ends in a slash, adding a period onto it will be treated by most servers as a reference to the "single-dot" symbolic-link directory, which points at the current directory. This will bring up the same page as the user would have received without the extra period (though with a slightly inelegant URL). Without the closing slash, adding a period causes it to be appended to the requested filename, usually producing a 404 Not Found error.

A final note on slashes...

Having said all this, I'd better remind you not to "overcorrect" by adding slashes to URLs that aren't supposed to have them. If the URL references a file rather than a directory, there shouldn't be a slash at the end. So don't type "http://www.someplace.net/~msmith/stuff.html/"!

Linking Back Home

TIP: The home page is (usually) named index.html, but don't link to that filename!
When linking back to your main home page from other pages in your site, use <A HREF="./"> instead of <A HREF="index.html">. This "dot-slash" syntax causes the index of the present directory to be loaded under the same URL syntax that the user used to access the site in the first place (directory name alone), while the latter syntax sends the user to the URL with an unnecessary "index.html" appended to it, which the browser won't realize is the same page and will hence not show the link in the "visited" color or use cached copies. If the user links or bookmarks the page, they'll end up propagating your less- elegant "index.html" URL instead of the cleaner directory name. (Some WYSIWYG-type editors like Microsoft Front Page refuse to let you do links the way I recommend, even changing manually-entered "./" links to "index.html" on you. That's one of the reasons I hate such editors, and use only plain-text editors to do my own page editing.)
Note: As a general rule, you should be consistent and link to each of your pages with one single "canonical" URL per page, so that the "visited" link color and browser caches work properly. My notes on linking to the default index and always using closing slashes in directory links are two instances of this; other cases include sites that are accessible via multiple domain or host names: http://www.yoursite.com/ and http://yoursite.com/ might both work, but you should pick one as your standard way of linking to your site instead of mixing them. Some people purposely link to multiple variants of their address as a way of getting search engines to index them multiple times, but that strikes me as another form of "spamdexing", and it's annoying as a user to wind up with lots of copies of the same page showing up in a search result.
Also, if you use the same graphic in multiple places, be sure you use the same copy of it, at the same URL, so that browsers can use the previously cached copy of it instead of reloading it each time.
You can use index files in each subdirectory if you have multiple directories, so Mary can do sub-sites on her hobbies of stamp collecting and cats as
http://www.someplace.net/~msmith/stamps/
http://www.someplace.net/~msmith/cats/
In such a structure, the main menus of the "stamps" and "cats" subsites will be the index.html files of these respective directories, and there can be an unlimited number of other files in each of the directories. But don't confuse the structure by putting the main menu elsewhere; I've seen sites that use "stamps.html" in the parent directory as the main menu of the "stamps" subsite, with the remainder of the files in the subdirectory "stamps/". This illogical move separates the subsite menu from its related files, so I don't know what the developer was thinking when he or she did it.
If you put the main index of the subsite in the proper directory, but don't name it as the default index, you end up with "redundant" URLs like:
http://www.someplace.net/~msmith/stamps/stamps.html
I like to call such URLs "Foo-slash-foo" URLs, since they're of the form foo/foo.html (where "foo" is one of the "computer geek" community's favorite "arbitrary variable" names, representing any character string). Redundant URLs look slightly silly, and are longer than the URL you could have had by using default files and citing the URL by the directory name alone. I've even seen triply- or quadruply-redundant URLs on sites that seem to go out of their way to use overly deep directory trees and avoid using default indices, producing monstrosities like:
http://www.foocorp.com/foocorp/foo/corp/foocorp.html
Probably, the developer just wasn't thinking clearly when planning the file and directory names in such a site. You can do better!
NOTE: I thought when I came up with the above "foocorp" example that this was a contrived, exaggerated URL used for effect, and that I wasn't likely to run into one that bad in the real world... but I found that the mlb.com address of Major League Baseball's site redirects to this atrocity:
http://mlb.mlb.com/NASApp/mlb/mlb/homepage/mlb_homepage.jsp
When linking to the parent directory, use HREF="../" (two dots and a slash); when linking to a "sibling" directory use HREF="../cats/"; to link to the index of a subdirectory below the present one, use the name without any dots or slashes before it like HREF="stamps/". To go up two levels to a "grandparent" directory use HREF="../../"
One thing to note: If you do the links the way I recommend here, they won't work when you browse through your Web pages on your hard disk, since your hard disk does not have any "default" filename as a directory index. You will see the raw directory when you follow such a link. But are you developing your Web site to look good on your hard disk or on the destination Web server? Unless you're creating a site to distribute on floppy or CD-ROM to run in non-networked environments, the aim of your development is to make the site work well on the server, so you should put up with a little awkwardness when you're testing it on your own machine before uploading it. When you follow a link and a raw directory comes up, that's not an error; just click on "index.html" and keep going, with the awareness that this "problem" will go away once you put the site up on the server where it belongs. If you do need a version of the site that runs correctly on a hard or floppy disk, there are some programs available to export a Web site to a disk in runnable mode, which automatically change all links to valid filenames rather than directory names. Teleport Pro and WebSnake are two such programs, available through TUCOWS.

No comments:

Post a Comment

Related Posts Plugin for WordPress, Blogger...

dg3