Chapter 4 Building Blocks of HTML

by Jim O'Donnell

CONTENTS

HTML and SGML: A Parent-Child Relationship
- Advantages and Disadvantages
- The Strength of the HTML Standard
HTML's DTD
- Levels of HTML Conformance
- Checking Conformance of Documents with HTML Standards
The Elements of HTML

HTML pages are like annotated bibliographies: they give you the opportunity to expand on an endless variety of topics and present additional factual or thematic resources to further explore a subject.

Of course, HTML pages are also like gossip magazines: sooner or later you'll see just about everything on them.

But regardless of how anyone perceives HTML, everyone who uses it speaks the same language. Elements, tags, anchors, hyperlinks, URLs, and attributes: they're all part of the lexicon of the Web's documents. To create inspired Web pages (and to cast a critical eye on those already on the Web), you need to have an intimate familiarity with the building blocks of HTML.

This chapter answers the following questions:

How is HTML related to SGML?
What is a DTD?
What's the difference between empty and container elements?
What are the basic components of HTML?

HTML and SGML: A Parent-Child Relationship

HTML is a subset of SGML (Standardized General Markup Language). SGML documents are more complex and programming-like than HTML. Figure 4.1 shows how an SGML document describes the HTML standard (the figure is, in fact, the SGML declaration for HTML-the SGML document that defines HTML).

Figure 4.1 : SGML coding provides machine-level display format and function commands.

HTML resembles simplified SGML. The observation that SGML is to HTML as HTML is to plain text seems reasonable on the surface. When you take a look under the hood, though, it's easy to see how HTML shares the advantages of both systems of marking text.

Troubleshooting

SGML seems very complicated. How do I find out more about it?

SGML is not for the faint of heart, as the code in figure 4.1 suggests. SGML code constructs are not based as much in "plain English" as HTML is. The following text is written in SGML, describing how the HTML element BLOCKQUOTE is used:

<!ELEMENT (%blockquote) - - %body.content> <!ATTLIST (%blockquote) %attrs; %needs; - for control of text flow - >

How would that read in English? The BLOCKQUOTE element is a container for text in the BODY section (%body.content); it does not have any defined arguments that affect its use or how its contents are displayed (no options are listed under the %attrs; or %needs; categories).

SGML is a full-bodied language for defining text function and formatting that many users have remained loyal to with the arrival of HTML. HTML's use of English language editing markup elements is a key reason for the popularity and success of the World Wide Web.

A good way to learn a little more about SGML might be the "Gentle Introduction to SGML," available at

ftp://www.ucc.ie/pub/sgml/p2sg.ps

This is a PostScript file and can be read by printing it on a PostScript printer.

Advantages and Disadvantages

In his World Wide Web Research Notebook

http://www.w3.org/hypertext/WWW/People/Connolly/drafts/webresearch.html

Daniel Connolly outlines the advantages and disadvantages to carrying over SGML practices and constructs into the current HTML standard.

These are the benefits of using SGML to define HTML:

Basing HTML on SGML makes it easy to test whether or not an HTML document conforms to the current standard. Document authors can have confidence in their documents that pass automatic verification processes.
The SGML definition for HTML defined a document called the Entity Structure Information Set. This form allows a standard interpretation of all HTML documents.
Like HTML, SGML provides a clear and widely supported standard for creating interchangeable documents.
These are the disadvantages of using SGML to define HTML:
SGML coding is meant to be interpreted at the machine-level, and SGML documents are difficult for people to read and understand. This makes an HTML standard based on SGML difficult to understand by reading it.
Due to its structural complexity, it's possible to read related SGML documents and come to incorrect assumptions about SGML usage and the standards they define.
SGML is defined at a level of complexity beyond the function and purpose of HTML, and certain modular capabilities that use SGML are too complex for the level of author manageability HTML strives to provide.

The Strength of the HTML Standard

HTML's strength comes from its combination of SGML machine-level constructs (the tags and elements that tell a viewer the purpose of document text) and standard English text markup notation.

For example, the <B> container tag is mnemonically correct (it stands for bold), and it signals a format change to the document's viewing software, which changes the display format of the following text. When the viewer comes across the </B> closing tag, which tells it to turn off the bold attribute, it returns to the previous text formatting.

The versatility of SGML and HTML is becoming widely acknowledged as they are adopted as hypertext document standards by more content managers, including the federal and many state governments.

Creating the Standards

The HTML standard is constantly under development. Users and developers from around the world contribute to the on-going discourse and testing of new ideas, concepts, and uses for HTML and its component elements. One user who provided an enormous amount of time and energy in this process is Daniel Connolly (connolly@w3.org) of the W3 Consortium at MIT. He outlined standards for the standards; he provided the following guidelines for the HTML development team to assist them in writing the current and upcoming specifications for HTML.

The goal of any HTML specification should be to promote confidence in the fidelity of communications using HTML. This means specifications need to adhere to the following standards:

Make it clear to authors what idioms are available to express their ideas.
Make it clear to implementers how to interpret the HTML format so that authors' ideas will be represented faithfully.
Keep HTML simple enough that it can be implemented using readily available technology, and then processed interactively.
Make HTML expressive enough that it can represent a useful majority of the contemporary communications idioms in the WWW community.
Make some allowance for expressing idioms not captured by the specifications.
Address relevant interoperability issues with other applications and technologies.

You can get more information about the ongoing HTML standards process from Daniel Connolly's Web page:

http://www.w3.org/hypertext/WWW/People/Connolly/

or by reading chapter 8, "Common Conventions in HTML Documents."

HTML's DTD

It's debatable who has contributed more to the "acronymization" of our culture. In a world where ATM can have two totally different meanings (one's great for convenience banking and the other for high-speed data networking), you might expect a language like HTML (itself an acronym) to continue the tradition.

And it does. From its elements-UL stands for, appropriately enough, un-ordered list-to its parent language SGML, HTML is defined by acronyms. An acronym defines HTML as well-HTML's DTD.

Levels of HTML Conformance

DTD stands for Document Type Definition. It's a document that describes the HTML language, its elements, and their legal uses. The HTML DTD has many levels that pertain to different categories of use or compatibility with the HTML standard. These levels are:

Level 0. Minimal conformance to or use of HTML elements.
Level 1. HTML compatibility with (or use of) HTML with Level 1 extensions.
Level 2. HTML compatibility with (or use of) HTML with Level 2 extensions.

The HTML DTD is written in SGML and can be difficult to interpret. Figure 4.2 shows a portion of the HTML DTD for Level 0 (for the complete DTD, see appendix A, "HTML Tags"). The document coding is complex and difficult to read; it's not meant entirely to be read by people, but by SGML interpreters. Don't be surprised if it makes no sense to you-it doesn't to the vast majority of people.

Figure 4.2 : The document for each level defines a measure of compatibility to the HTML specification.

Annotated versions of the HTML DTD make it easier for developers and end users to verify conformity issues. Daniel Connolly maintains one popular version, and you can find it at:

http://www.w3.org/hypertext/WWW/People/Connolly/

The Web sites listed in appendix D, "WWW Bibliography," collect other descriptions of the various HTML standards.

Checking Conformance of Documents with HTML Standards

It is possible to check your HTML documents for conformance with HTML standards. The Webtechs HTML Validation Service can be found at:

http://www.webtechs.com/html-val/svc/

As shown in figures 4.3 and 4.4, you can check for conformance at different levels, and supply the HTML document either as a URL to an existing document (see figure 4.3) or by directly inputting the HTML (see fig. 4.4).

Figure 4.3 : The Webtechs HTML Validation Service allows you to check your HTML documents for conformance to a variety of levels.

Figure 4.4 : If you want to check out a small amount of HTML you can enter it directly, rather than building a separate web page.

After you submit your URL or HTML code, the Webtechs service will analyze it and return a report such as that shown in figure 4.5. If it conforms to the HTML 2.0 standard, you are invited to include the validation icon on your web pages.

Figure 4.5 : Successfully passing the HTML validation check can be indicated on your Web pages by including a link to the Validation icon.

The Elements of HTML

HTML is composed of elements, or instructions, to WWW viewers to perform a defined task (make text bold, insert a paragraph break, or format and number a list in a predetermined manner). HTML tags consist of individual elements inside angle brackets. Figure 4.6 shows a few typical elements and how they are written in tag format.

Figure 4.6 : HTML tags are "invisible" when the WWW viewer displays the document.

Troubleshooting

If WWW viewers read HTML tags as instructions, how did you show them in figure 4.6? Why didn't the viewer just mark up the text in the tags?

Displaying the HTML tags in the previous figure was not as easy as it looks. Because Web viewers look for tags as signals to format text, all occurrences of tags are supposed to be interpreted. To get around this handicap (after all, the software is just doing its job), HTML provides a list of text entities that viewers will interpret as certain ASCII characters. For example, to write a line that the viewer will display as

<TITLE>The Battles of World War Two</TITLE>

you must use entities for the angle bracket characters. HTML defines the "less than" bracket (<) as < and the "greater than" bracket (>) as >. Therefore, the previous line would be written in the HTML document as

<TITLE>The Battles of World War Two</TITLE>

As the name implies, HTML marks up text in a document by defining the specific formatting for sections of the document. HTML is a hybrid, using some elements to define the abstract value of text (such as "emphasized") and others to define the actual on-screen representation in the WWW viewer's window (such as "italicized"). This "split personality" created quite a controversy in the authoring community, spawning two camps of thought that support the different uses of HTML markup.

Unlike the file systems of some operating systems, HTML element names are case independent. You can write tags with any mixture of upper and lowercase characters. For example, you can write one tag that defines the formatting of a section of text as <BLOCKQUOTE>, <blockquote>, <BlockQuote>, or any capitalization combination. Some authors use unorthodox capitalization schemes, such as <bLocKquOtE>, but that doesn't make for easy-to-read HTML, and your site administrator probably discourages this brand of "net.hipness."

Note

This book's convention of using all uppercase characters in HTML tags is for legibility only; feel free to use whatever scheme you're most comfortable with in your own documents, or whatever conforms to your Web site's HTML document style sheet-if there is one.

Empty and Container Tags

HTML uses two types of elements: empty (or open) and container tags. These tags differ because of what they represent. Empty tags represent formatting constructs, such as line breaks and horizontal rules. These tags indicate "one time" instructions that WWW viewers can read and execute without concern for any other HTML construction or document text.

Container tags define a section of text (or of the document itself) and specify the formatting or construction for all of the selected text. A container tag has both a beginning and an ending: the ending tag is identical to the beginning tag, with the addition of a forward slash. Most containers can overlap and hold other containers or empty tags (see fig. 4.7).

Figure 4.7 : Containers can hold other elements-the entire HTML document is actually one large container, defined by the tag <HTML>.

HTML Tag Arguments

I'm not talking about disagreements between tags in HTML documents. Like command-line applications, many HTML elements use additional parameters (known as arguments or attributes) to increase their functionality. These arguments are passed on to the client software and affect the way the element is applied to the section of text (or, with empty tags, how the tag's construct is displayed in the viewing software's window).

For example, the anchor element uses arguments to define the function of the anchor (whether it's a marker or a hypertext link to another document or anchor). So, a document can contain links to specific sections of text and named anchors at those text locations (see fig. 4.8). Notice that the parameters are contained in the tag's angle brackets.

Figure 4.8 : You use anchors as both the starting and ending points of hypertext links in HTML documents.

In this example, the last line in the list

<LI><A HREF="#Anzio">Battle of AnzioD-Day</A>

is an anchor that points to a named anchor somewhere else in the document. The named anchor it points to would be found in a line such as

<A NAME="Anzio"><H1>The Battle of Anzio</H1></A>

When the user clicks the list item D-Day in the viewed document, the WWW browser would jump immediately to the associated named anchor.

Caution

Underlining and colored borders (such as red or green) are used in some Web pages to indicate hyperlink text and graphics, but these don't print well.

Some WWW viewers, notably Netscape Navigator and Microsoft Internet Explorer, provide support for non-standard arguments that primarily affect the display of the HTML text in the viewer's window. WWW viewers that don't support non-standard elements or arguments just ignore them. Non-standard usage is noted in chapters 15 and 16.

Note

If you incorporate non-standard HTML in your own documents, let users know with a simple statement at the head of your "entry-point" document (usually the "Welcome" or introduction page). This way, they know that a given browser displays your Web pages as you intended them to be seen. Both Netscape and Microsoft have programs that allow you to include special messages and icons on your web pages indicating that they are best viewed with their browsers.

An Overview of HTML Elements

Tables 4.1, 4.2, and 4.3 provide a brief overview of some of more common HTML elements found in different sections of HTML documents. These tables don't include arguments but they do include the element's tag type. The entire HTML document should be contained in the HTML container element. For a complete description of each element and its associated arguments, see appendix A.

Table 4.1 HTML Elements for Head Sections in HTML Documents

Element	Element Type	Description
BASE	empty	Base context document
HEAD	container	Document head
ISINDEX	empty	Document is a searchable index
LINK	empty	Link from this document
META	container	Generic meta-information
NEXTID	empty	Next ID to use for link name
TITLE	container	Title of document

Table 4.2 HTML Elements for Body Sections in HTML Documents

Element	Element Type	Description
A	container	Anchor: source and/or destination of a link
ADDRESS	container	Address, signature, or byline for a document or passage
B	container	Bold text
BLOCKQUOTE	container	Quoted passage
BODY	container	Document body
BR	empty	Line break
CITE	container	Name or title of cited work
CODE	container	Source code phrase
DD	empty	Definition of term
DIR	container	Directory list
DL	container	Definition list, or glossary
DT	empty	Term in definition list
EM	container	Emphasized phrase
H1	container	Heading, level 1
H2	container	Heading, level 2
H3	container	Heading, level 3
H4	container	Heading, level 4
H5	container	Heading, level 5
H6	container	Heading, level 6
HR	empty	Horizontal rule
I	container	Italic text
IMG	empty	Image; icon, glyph, or illustration
KBD	container	Keyboard phrase, such as user input
LI	empty	List item
LISTING	container	Computer listing
MENU	container	Menu list
OL	container	Ordered or numbered list
P	empty	Paragraph
PRE	container	Preformatted text
SAMP	container	Sample text or characters
SELECT	empty	Selection of option(s)
STRONG	container	Strong emphasis
TT	container	Typewriter text
UL	container	Unordered list
VAR	container	Variable phrase or substitutable
XMP	container	Example section

Note

As the HTML standard changes, elements will be deprecated, or replaced by new elements with greater functionality. Deprecated elements will still be supported by existing WWW viewers but may not be in the future. Be prepared to review your older HTML documents for deprecated elements that may no longer be useful.

Table 4.3 HTML Elements for Forms in HTML Documents

Element	Element Type	Description
FORM	container	Fill-out or data-entry form
INPUT	empty	Form input datum
TEXTAREA	empty	Area for text input
OPTION	empty	Selection option