On my planet, there are two types of HTML conversion features in Word 97: the features that survive intact when the file is open and closed and the features that don’t. For Microsoft’s official treatment of this subject, please see KB article Q157086 (“WD97: Limitations of Converting from Word Format to HTML”).
This, by the way, is how Word 97 translates its “Simple” table of contents into HTML:
Using the HTML Conversion Features of Word 97 *
The Wrong Way to Edit HTML in Word 97 *
The Right Way to Edit HTML in Word 97 *
Exporting Word Document Properties *
Parting Comments *
Endnotes *
I suppose it can be useful. Perhaps it can be used to get around this document?
My unusual classification scheme of Word 97 HTML export features becomes reasonable when you start working on a native Word 97 file, save it as HTML, and continue to edit the file (while you edit the file you switch back and forth between HTML Source view and Online Layout view). These are the elements that will disappear if you edit for HTML in this manner:
|
Word Feature |
Comments |
|
Text marked with character style HTML Markup |
Once you switch to HTML Source view (or close and open the HTML file), the text designated as HTML will be converted to Word 97 HTML formatting. For example, the entity |
|
Equation Editor objects |
Once you switch to HTML Source view (or close and open the HTML file), these are converted to GIF—forever. This applies all other OLE Objects. |
|
Tables |
Only simple grid-like tables are translated faithfully. Background colors and merged cells do translate. TABLE tags are marked up with a fixed width while columns are in percentages. |
|
Inserted Images |
These images are converted to GIF (if not already JPEG). What is really cool is that it is possible to store the images and HTML in one “seed” Word 97 file. What is not cool is that images from previously exported HTML files are not removed and “pile up” over repeated saves. |
|
Fields |
Once you switch to HTML Source view (or close and open the HTML file), these are converted to text. |
|
Table of Contents |
This translates into links to anchor tags based on H1 through H4 styles. There are some caveats (see below). |
|
Footnotes |
Headers and Footers are not translated at all.1 |
The most painless way to edit for HTML in Word 97 is to work in native Word 97 for as long as possible using the styles provided for HTML conversion. The way to get to these styles is to open a new HTML file using the template Blank Web Page:
This template is available in the Word 97 Web Documents Toolkit.
Once this blank file open, select File > Save As Word Document to start your editing session in native Word 97 format. You can verify success by looking at the list of styles under Format > Style…:
Well, now is a good time to show what the “source code” of the original Word 97 document looks like:
Note that the document is showing all characters (the so-called “hidden” characters). The style HTML Markup is colored red and can only be seen when hidden characters are showing. You can also see that the table of contents field has its default, Word 97 formatting removed. This picture is a portrait of a struggle against the marketing forces behind closed source software. Hard work.
“Some” of the Properties specified by the File > Properties command are translated into HTML META tags. For example, we can guess how the following Word Properties will translate:
By the way, one of the Word Properties, shown above will not translate into a META tag. It will end up in a TITLE tag. It should be obvious which value that is. What is less obvious is how to get Word 97 to specify your own META information. This is done by editing the Custom tab:
The Custom Name UnknownHead_0_1_0 (and UnknownHead_0_1_1) is actually created by Word when it is importing a file with header information it does not understand. This can “force” Word to translate custom tags to the HEAD block during export.
Hey, this is better than nothing! The main motivation for creating HTML from Word 97 is to take advantage of inline spell checking and VBA automation features. This technique is most effective on long documents with pictures (like this one).
Since I respect W3C DTDs, I need to build HTML with very little proprietary tags. The following table lists a few caveats:
|
Word Feature |
Comments |
|
Sporadically, Word 97 decides to wrap each |
|
Colored text |
This feature translates into |
|
Character Styles |
Word 97 often translates Character Styles and Paragraph Styles in the wrong order. For example, the inline element |
|
Fonts |
This feature translates into |
|
Inserted Hyperlinks |
All you get is the |
|
Inserted Images |
No control over |
|
Paragraph Styles |
The Preformatted Paragraph Style is exported as |
|
Table of Contents |
The anchor tags ( Check for hidden bookmarks under Insert > Bookmark… as these end up as anchor tags. Make sure that Show page numbers is selected or no anchors will be inserted. Page numbers are converted to asterisks. That sucks. |
|
Tables |
The Word 97 HTML converter is unable to determine if a
Minimized
Avoid the proprietary |
|
The Comment Style |
This translates into the mysterious |
By the way, this document was prepared for http service by:
CLASS attributes to make tags refer to the style sheet specified by tweaking Word 97 Custom Properties.
This file you are reading has been validated by HTML-Kit and produces 0 errors, 11 warnings and 5 “other” comments. Most of the warnings point at tag nesting problems with the endnote at the end of this document. I had to correct that problem by hand.
I am certain that there is a better way to get “clean” HTML from a robust word processor—and Word 2000 has very little to do with it!
1However, using inserted Hyperlinks and Bookmarks can be the HTML equivalent of footnotes—or, rather endnotes.