[Support Free Speech Online]

Introductory WWW authoring tutorial

http://www.cs.umd.edu/~pugh/intro-www-tutorial/

by William Pugh (pugh at cs.umd.edu)
Dept. of Computer Science
Univ. of Maryland, College Park

Contents

Philosophy

There are several points of contention/philosophy regarding HTML/WWW documents. The originators of the WWW intended HTML to be a mark-up language, where you categorize different parts of the document and let the browser decide how to present it. For example, rather than making a line 18 point bold text, you might mark it as a header level 2.

Among other reasons for doing things this way, you might be using a browser that doesn't support multiple font sizes or wish to have a browser that provides an outline-based view of an HTML document.

However, as people have started to do more elaborate things with the WWW, some people have tried to get more control over the how their document looks. For example, some extensions allow you to control the size, color or font of text.

Another issue is that companies have raced past the official WWW Consortium: different browsers support different extensions, and some extensions are poorly thought out.

Most of what is described in this document is part of the official HTML 2.0 standard. I've included a few useful extensions, marked with [Non-standard]. These are fairly widely implemented, but you shouldn't depend on them. I've ignored some obsolete extensions even though they are widely used and supported. This document was written in January, 1996. The WWW and HTML are rapidly evolving; by 1997, parts of this document will be substantially out of date (although if you stick to the things in the 2.0 specification, your pages should still look fine).

Unfortunately, I don't have the time in this mini-course to go over what makes a web page attractive, useful or easy to use. Fortunately, a number of others have done a much better job than I could do, including:

One quick note however: assume anyone reading your web page will get your documents at about 1K/second. Think real hard before having more than 20K of images on a web page.

A very useful suggestion: look at the source code for web pages; most browsers have a view source option.

Overall structure

HTML is a markup language, based on SGML. Documents contain text, tags and special characters.

Tags are delimited by angle brackets (e.g., <p>) and are case insensitive. Some tags are containers: they have a start and end tag (e.g., <h1> and </h1>). A end tag is always formed by putting a slash at the start of the tag. Container tags must be properly nested (e.g., <strong> you can't <em>overlap</strong> two character styles</em>).

Tags that a browser does not recognize (because they are extensions it doesn't handle) are generally ignored.

Tags can have attributes, such as <img src="pic.gif" align=top>, which has two attributes src and align, each of which is assigned a value. It is always safe and sometimes required to quote attribute values (particularly if they contain any unusual characters). Sometimes, an attribute is simply present or not present, rather than being assigned a value.

HTML documents can contain special characters, such as à -- these are denoted by an & followed by a code followed by a semicolon; a code can be either a name or # followed by a number. For example, the string &agrave; will produce à. A fairly complete list of special characters is given at http://www.w3.org/pub/WWW/MarkUp/html-spec/html-spec_9.html#SEC9.7, but note that not all browsers implement that list.

Most importantly, if you want to have a less than sign appear in a document, you must use &lt;, and if you want a ampersand to appear, use &amp;. You are probably OK if a greater than sign appears naked in your document, but you might want to use &gt; to be on the safe side.

All white space is equivalent: a space, ten spaces, a new line or ten blank lines are all treated the same.

An HTML comment is written as <!-- Your comment goes here -->. Note that two dashes are required to both start and end a comment. Not all browsers recognize comments (this goes for a lot of HTML) and people can see your comments (by viewing the source of your document) so don't put any secrets in a comment.

The overall structure of an HTML document is:

<html>
<head>
<title>Your title goes here</title>
</head>
<body>
Your contents goes here 
</body>
</html>
You might be able to leave off the outer <html> ... </html>, and in more sophisticated applications, you might put more in the heading. But for now, I recommend that you not worry about it.

Body elements

The body of a document consists of the following elements:

Headers

Headers are used to indicate section headings. A <h1>level 1 header</h1>, is the top/biggest/most-important section header (e.g., a chapter heading), a <h2>level 2 header</h2> is the next most significant, and so on. You can use header levels 1 through 6, and a level X header is delimited with <hX> and </hX>. You probably want to avoid skipping levels (e.g., having a level 3 header immediately inside a level 1 header, with no level 2 header in sight). This will probably work, but future browsers may allow you to look at a document in outline form and would get confused by such a structure.

The contents of a header can include text, images and line breaks (but not paragraphs, lists, horizontal rules or preformatted text).

Paragraphs

Paragraphs are delimited with <p> and </p>, although the closing </p> is optional and you will rarely see it or use it. In an earlier version of HTML, <p> was used to separate paragraphs rather than to start them. You will probably see that in some old documents, but you should avoid it. The contents of a paragraph can include text, images and line breaks (but not headers, lists, horizontal rules or preformatted text).

The exact formatting of a paragraph is up to the browser; it might put a blank line before each paragraph, or not. Empty paragraphs might be displayed as blank or ignored.

Alignment [Non-standard]

You can (in some browsers) center or right align headers or paragraphs, for example by using <h2 align=center> ... </h2> or <p align=right> (this is one of the reasons why <p> was changed to start a paragraph rather than separate paragraphs).

Line
breaks

You can use <br> to cause a line break. A line break is useful when you want to start a new line but are not allowed to start a new paragraph (e.g., within a heading). A line break might also be useful because it often does not generate a blank line while <p> often will.

Multiple <br>'s may or may not cause multiple line breaks.

Horizontal rules

You can use <hr> to cause a horizontal rule on a line by itself. Here is an example:

Lists

There are three basic kinds of lists: ordered lists (i.e., enumerated), unordered lists (i.e., bulleted lists) and definition lists. Lists can be nested.

Ordered lists are denoted with <OL> ... </OL> and unordered lists are denoted with <UL> ... </UL>. Both of these much contain a series of list items, the start of each marked with <LI> (like paragraphs, you can close a list item with </LI> but it is not needed and rarely done).

Each list item can be pretty much whatever you want other than a header.

A definition list is denoted with <DL> ... </DL>. There are two types of items in a definition list: Terms (<DT>) and Definitions (<DD>). As with <LI>, you do not need to close <DT> or <DD>.

Definition list term
A definition list term is not indented and might be displayed in bold.
Definition list definition
A definition list definition is indented (and has no bullet or number preceeding it).

Preformatted text

Preformatted text is delimited with <pre> ... </pre>. Preformatted text is displayed in a monospaced font, and spaces and new lines are significant. Within preformatted text, you can use links and horizontal rules. You might be able to use strong/emphasized text and/or images, but don't rely on it.

An important note: < and & still have special meaning in preformatted text. You can't just convert a text document to html by putting <PRE> ... </pre> around it: You have to watch out for occurrences of < and & in your text.

Blockquote

<blockquote> ... </blockquote> denotes a chunk of your document that is quoted from elsewhere; typically a browser indents that portion of your document (headers and all).

Address

<address> ... </address> is used for information as address, signature and authorship, often at the beginning or end of the body of a document. Typically, it is rendered in an italic typeface and may be indented. For example:
William Pugh (pugh at cs.umd.edu)

Character formats

<em> ... </em>
emphasis; often displayed as italics
<strong> ... </strong>
Strong emphasis; often displayed as bold
<code> ... </code>
Used for computer text; often displayed in teletype font
<i> ... </i>
Italics
<b> ... </b>
Bold
<tt> ... </tt>
Teletype font
<big> ... </big> [Non-standard]
Big font ...
<small> ... </small> [Non-standard]
Small font

Images

You can include an image in your document using the tag <img ... >. There are several attributes you may wish to set:
src="URL"
Must be supplied; the URL for the picture.
alt="string"
This is text that should be displayed if the picture is not displayed (because the browser can't display pictures or the user has decided to not download images). Optional but recommended.
align=top|middle|bottom|left|right
How the picture is aligned. If omitted, treated as bottom.
height=num [Non-standard]
(optional): Tells the browser the height of the picture
width=num [Non-standard]
(optional): Tells the browser the width of the picture

Normally, the image is just treated as a (big) character. An alignment of top tells the browser to align the top top of the picture with the top of the line, and an alignment of middle or bottom tells the browser to align the middle or bottom of the image with the baseline of the text.

[Non-standard] An alignment of left or right introduces some serious magic. Rather than displaying the image within the current line of text, it is displayed on the left or right side of the window, and text wraps around it. Many browsers don't support this, and it is hard to predict exactly how your image will look on a different browser or with a different window width.

[Non-standard] If a browser supports left/right alignment of images, it may also support <br clear> to cause a line break to a place that is clear of left and right justified images. Alternatively, if may allow clear=left|right|all as an attribute for pretty much anything (including headers and paragraphs) (this is part of the proposed HTML 3.0 standard); this will cause the browser to move the display of that element down until the left/right/both margins are free of floating images.

[Non-standard] Early web browsers didn't display anything until they had downloaded all images in the document. More recent browsers try to display the web page as soon as possible. However, until it starts receiving an image, it doesn't know how big it is and how much space to reserve for it on the page. This can slow the display of the page and/or cause the page to be reformatted as the documents are downloaded. In some browsers, specifying the height and width of an images in the <img> tag eliminates this problem.

Images formats

At the moment, there are two images formats you should be primarily concerned with:

gif
Graphic interchange format: Maximum of 256 colors, great for images with lots of solid colors. Compression doesn't introduce any losses. One of the 256 colors can be specified as transparent. Can be interlaced, so that after receiving just a bit of the picture, your browser can display a crude approximation. Any graphical WWW browser supports GIF.
UNISYS decided to start demanding royalties on a patent used in GIF, so this format is headed for the dustbin of history (within one-two years), to be replaced by PNG (PNG's Not GIF).
jpg/jpeg
Full 24-bit color, great for photographs, great compression but blurs the picture a little. Not all WWW browsers support inline jpeg pictures. [Non-standard].

A progressive version of jpeg is just starting to become available (allowing rough approximations from just part of a file) [Non-standard].

Links

Finally: the stuff that makes HTML Hypertext. One of the subjects you have to master first is URL's: Uniform Resource Locators.

URL's

A URL has the following format:

protocol://machine.name[:port]/dir1/dir2/file

The protocol describes how to get/access the document. Some typical protocols are http (hypertext transfer protocol), ftp (file transfer protocol), gopher (gopher protocol), file (a local file). The machine.name must be a standard Internet domain name. Warning: wam might resolve appropriately within campus, but not outside of it: use fully specified names.

The port is some TCP wizardry you don't really need to know about. If omitted, it uses the standard for whatever protocol you are using (80 for http). The only bit of information you might find useful: if the port is less than 1025 on a UNIX system, it must be set up by the system administrator of that machine. If 1025 or greater, anyone could be running it.

The directories specify a path from the root of web file structure. You use UNIX style pathnames even if the server or client is on a Wintel or Apple system. One frequent exception: If the first directory is ~name, that resolves to the directory that has been set up for name.

If your path specifies a directory rather than a file, you will get a default file name (typically index.html but it might be something else). If such a file doesn't exist, you might get a directory listing or an error (depends on how the server is set up).

Links to points within an HTML document

Normally, a reference to an HTML file is considered a pointer to the beginning of the document. You can also point to an arbitrary named location within a HTML document. To do so, simply append #location to the URL. To name a location, use an anchor (<a>) with a name specified:

<a name=location> ... </a>

This associates the name specified with the text inside the anchor. An anchor tag can specify a name, an href, or both.

Relative URL's

You can leave off various prefixes of a URL and have the URL be treated as relative to the location of the page containing the link. As with UNIX file names, you can use .. as a directory name to climb up to the parent directory. For example, within the document

http://www.cs.umd.edu/users/pugh/index.html

the following shows the interpretation of some relative URL's.
Relative URL Absolute URL
/Department/About.htmlhttp://www.cs.umd.edu/Department/About.html
intro-www-tutorial/http://www.cs.umd.edu/users/pugh/intro-www-tutorial/
pugh.gifhttp://www.cs.umd.edu/users/pugh/pugh.gif
../keleherhttp://www.cs.umd.edu/users/keleher
#papershttp://www.cs.umd.edu/users/pugh/index.html#papers

Tables [Non-standard]

Tables are very useful, but only some browsers implement them, and they are not implemented consistently. They don't all recognize the same tags/attributes, and some allow only plain text in table cells, others allow anything (including lists and other tables). There is a proposal for a table standard, and most of the browsers that implement tables implement the standard. Within this section, I'll use [Non-standard] for features that are not a part of the proposed standard.

Among other uses, you can use tables to generate multi-column documents (but this only works in browsers that allow arbitrary contents for a table cell). I've done this in the summary section below.

Overall structure of a table:

<table> <tr> row 1 <tr> row 2 ... </table>

You can also specify a caption:

<table> <caption> caption text 1 </caption> <tr> row 1 <tr> row 2 ... </table>

Overall structure of a row:

<td> first entry 1 <td> second entry 2 ...

You can close rows (</tr>) and cells (</td>) but it isn't needed ( unless you have tables inside of tables).

You can also use <th> for table cells. Using <td> creates a data cell; using <th> creates a header cell. A data cell is typically displayed in normal font and left-justified. A header cell is typically displayed in a bold font and centered. If you close a header cell (optional), use </th>.

<table border>
Draws a border around each cell.
<table border=int> [Non-standard]
Draws a border of the specified width around each cell.
<tr align=left|center|right>
Sets the default alignment of the row
<td align=left|center|right>
Sets the alignment of that table cell
<tr valign=top|middle|bottom>
Sets the default vertical alignment of the row (default is center)
<td valign=top|middle|bottom>
Sets the vertical alignment of that table cell
<td rowspan=int>
Creates a cell that spans multiple rows
<td colspan=int>
Creates a cell that spans multiple columns

Here is an example table (taken from Teach Yourself More Web Publishing with HTML in a Week by Laura Lemay):
Drive Belt Deflection
Used Belt Deflection Set
deflection
of new belt
Limit Adjust
Deflection
Alternator Models without AC 10mm 5-7mm 5-7mm
Models with AC 12mm 6-8mm
Power Steering Oil Pump 12.5mm 7.9mm 6-8mm

Here is the HTML to generate it:

<TABLE BORDER>
<CAPTION>Drive Belt Deflection</CAPTION>
<TR>
    <TH ROWSPAN=2 COLSPAN=2></TH>
    <TH COLSPAN=2>Used Belt Deflection</TH>
    <TH ROWSPAN=2>Set<BR>deflection<BR>of new belt</TH>
</TR>
<TR>
    <TH>Limit</TH>
    <TH>Adjust<BR>Deflection</TH>
</TR>
<TR ALIGN=CENTER>
    <TH ROWSPAN=2 ALIGN=LEFT>Alternator</TD>
    <TD ALIGN=LEFT>Models without AC</TD>
    <TD>10mm</TD> <TD>5-7mm</TD> <TD ROWSPAN=2>5-7mm</TD>
</TR>
<TR ALIGN=CENTER>
    <TD ALIGN=LEFT>Models with AC</TD>
    <TD>12mm</TD> <TD>6-8mm</TD>
</TR>
<TR ALIGN=CENTER>
    <TH COLSPAN=2 ALIGN=LEFT>Power Steering Oil Pump</TD>
    <TD>12.5mm</TD> <TD>7.9mm</TD> <TD>6-8mm</TD>
</TR>
</TABLE>

HTML checkers

This section describes several kinds of programs that are designed to help validate your web pages. Two of these, weblint and htmlcheck, you only need to run when you create your web page: they check that your HTML is correct. Simply looking at your page with a WWW browser is not sufficient. Many browsers attempt to cope with HTML errors, but different browsers are able to cope with different errors. Some errors that Netscape 1.1 used to cope with aren't tolerated by Netscape 2.0.

Another checker, MOMspider, checks your links to see if they are valid. This is useful not only when creating a web page, but also as a weekly check to see if any of the (off-site) web pages you point to have changed. A listing of HTML validation tools, including some other nice things like WWW cross-reference generators, is provided at: http://www.khoros.unm.edu/staff/neilb/weblint/validation.html

Weblint: http://www.khoros.unm.edu/staff/neilb/weblint.html

weblint should be installed real-soon-now on the departmental Unix machines. There are a number of options (type man weblint or view http://www.khoros.unm.edu/staff/neilb/weblint/manpage.html) but you can run it by with just:

weblint index.html

You can also just submit a URL or HTML to a form at:

http://www.unipress.com/weblint/

Weblint looks for certain bad things in your html and gives you fairly useful error messages when it finds them. Some of the things that weblint complains about aren't illegal, simply things it thinks are bad style (like having a <h3> header immediately inside a <h1> header).

Htmlcheck: http://www.webtechs.com/html-val-svc/

htmlcheck is installed on the departmental Unix machines. There are a number of options (type man htmlcheck or see the WWW documentation), but you can run it by with just:

htmlcheck index.html

You can also just submit a URL or HTML to a form at:

http://www.webtechs.com/html-val-svc/

htmlcheck tries to parse your document using the official specification for HTML (you can tell it which specification to use). When it finds an error, the error message may not be very useful and it may get confused so that any later error messages are worthless.

In my use, I always use weblint first and correct any errors it finds. Only then do I use htmlcheck; once I've gotten rid of the big errors that weblint finds, the error messages from htmlcheck are often more useful. I've found that htmlcheck will often find problems that weblint will miss, so I use both tools.

Momspider: http://www.ics.uci.edu/WebSoft/MOMspider/

Momspider is run once a week on the departmental machine and checks links in html files to make sure that point to valid pages (it checks them even if they point to another machine).

For web pages on the CS machine, you can fill out an on-line form to have your web page checked once a week and have a report emailed to you if there are any problems.

Sometimes, you will get back an error report but when you check it out yourself you don't have any problem. If the machine hosting the other page was down when Momspider ran, then you'll get an report even if the machine came back up 5 minutes later.

HTML translators

There are lots of web translators available for converting to and from HTML. A good list of them is provided at:

http://www.w3.org/hypertext/WWW/Tools/Filters.html

Some of the most useful ones for converting to HTML are:

HTML editors

In general, the ones I've played with are not ready for prime time (as of January 1996). The problems are:

All of the ones I've seen work OK as a first pass, but I'd need to spend some time cleaning up the resulting HTML code before I was happy with it. Some of them are not happy editing HTML documents created by anything other than themselves.

HTML editors are improving. Within 6 months, I expect HTML editors to be an important tool for creating HTML documents.

Internet assistant for Microsoft Word

Courtesy of Jeff Hollingsworth

Quick Reference


William Pugh (pugh at cs.umd.edu)
Dept. of Computer Science
Univ. of Maryland, College Park