SonghaySystem(::)

Using the HTML Conversion Features of Word 97

On my planet, there are two types of HTML conversion features in Word 97: the features that survive intact when the file is open and closed and the features that don’t. For Microsoft’s official treatment of this subject, please see KB article Q157086 (“WD97: Limitations of Converting from Word Format to HTML”).

This, by the way, is how Word 97 translates its “Simple” table of contents into HTML:

Using the HTML Conversion Features of Word 97 *
The Wrong Way to Edit HTML in Word 97 *
The Right Way to Edit HTML in Word 97 *
Exporting Word Document Properties *
Parting Comments *
Endnotes *

I suppose it can be useful. Perhaps it can be used to get around this document?

The “Wrong” Way to Edit HTML in Word 97

My unusual classification scheme of Word 97 HTML export features becomes reasonable when you start working on a native Word 97 file, save it as HTML, and continue to edit the file (while you edit the file you switch back and forth between HTML Source view and Online Layout view). These are the elements that will disappear if you edit for HTML in this manner:

Word Feature

Comments

Text marked with character style HTML Markup

Once you switch to HTML Source view (or close and open the HTML file), the text designated as HTML will be converted to Word 97 HTML formatting. For example, the entity “ is converted to the generic &#quot; entity.

Equation Editor objects

Once you switch to HTML Source view (or close and open the HTML file), these are converted to GIF—forever. This applies all other OLE Objects.

Tables

Only simple grid-like tables are translated faithfully. Background colors and merged cells do translate. TABLE tags are marked up with a fixed width while columns are in percentages.

Inserted Images

These images are converted to GIF (if not already JPEG). What is really cool is that it is possible to store the images and HTML in one “seed” Word 97 file.

What is not cool is that images from previously exported HTML files are not removed and “pile up” over repeated saves.

Fields

Once you switch to HTML Source view (or close and open the HTML file), these are converted to text.

Table of Contents

This translates into links to anchor tags based on H1 through H4 styles. There are some caveats (see below).

Footnotes

Headers and Footers are not translated at all.1

The “Right” Way to Edit HTML in Word 97

The most painless way to edit for HTML in Word 97 is to work in native Word 97 for as long as possible using the styles provided for HTML conversion. The way to get to these styles is to open a new HTML file using the template Blank Web Page:

This template is available in the Word 97 Web Documents Toolkit.

Once this blank file open, select File > Save As Word Document to start your editing session in native Word 97 format. You can verify success by looking at the list of styles under Format > Style…:

Well, now is a good time to show what the “source code” of the original Word 97 document looks like:

Note that the document is showing all characters (the so-called “hidden” characters). The style HTML Markup is colored red and can only be seen when hidden characters are showing. You can also see that the table of contents field has its default, Word 97 formatting removed. This picture is a portrait of a struggle against the marketing forces behind closed source software. Hard work.

Exporting Word Document Properties

“Some” of the Properties specified by the File > Properties command are translated into HTML META tags. For example, we can guess how the following Word Properties will translate:

By the way, one of the Word Properties, shown above will not translate into a META tag. It will end up in a TITLE tag. It should be obvious which value that is. What is less obvious is how to get Word 97 to specify your own META information. This is done by editing the Custom tab:

The Custom Name UnknownHead_0_1_0 (and UnknownHead_0_1_1) is actually created by Word when it is importing a file with header information it does not understand. This can “force” Word to translate custom tags to the HEAD block during export.

Parting Comments

Hey, this is better than nothing! The main motivation for creating HTML from Word 97 is to take advantage of inline spell checking and VBA automation features. This technique is most effective on long documents with pictures (like this one).

Since I respect W3C DTDs, I need to build HTML with very little proprietary tags. The following table lists a few caveats:

Word Feature

Comments
Bullet lists

Sporadically, Word 97 decides to wrap each LI tag in its own UL tags.

Colored text

This feature translates into FONT tags. Yuck.

Character Styles

Word 97 often translates Character Styles and Paragraph Styles in the wrong order. For example, the inline element CODE precedes the block-level PRE tags. This usually happens when the Character Style is specified at the end or the beginning of a paragraph.

Fonts

This feature translates into FONT tags. Yuck. It is better to keep body text in the Default Paragraph Font of Normal style.

Inserted Hyperlinks

All you get is the HREF attribute of the A tag. No TARGET, no TITLE.

Inserted Images

No control over ALT or BORDER attributes. Images are saved in the same directory as the file.

Paragraph Styles

The Preformatted Paragraph Style is exported as PRE tags filled with either unnecessary BR tags (translated from Word Line Break characters) or P tags (translated from Word Paragraph characters).

Table of Contents

The anchor tags (<A></A>) may not nest correctly within the header tags.

Check for hidden bookmarks under Insert > Bookmark… as these end up as anchor tags.

Make sure that Show page numbers is selected or no anchors will be inserted. Page numbers are converted to asterisks. That sucks.

Tables

The Word 97 HTML converter is unable to determine if a TH tag is required.

Minimized <P> tags are used gratuitously in table cells with only one paragraph.

Avoid the proprietary BORDERCOLOR attribute by leaving the default border color of tables at Auto. If you have to change it back to Auto, verify that Borders and Shading… > Borders > Setting: is All.

The Comment Style

This translates into the mysterious COMMENT tag. It does not create HTML comments.

By the way, this document was prepared for http service by:

  1. Exporting as HTML from Word 97.
  2. Opening in HTML-Kit.
  3. Adding a few CLASS attributes to make tags refer to the style sheet specified by tweaking Word 97 Custom Properties.
  4. Running the HTML Tidy Plug-in HTML-Kit. (May lead to a few Find/Replace operations.)

This file you are reading has been validated by HTML-Kit and produces 0 errors, 11 warnings and 5 “other” comments. Most of the warnings point at tag nesting problems with the endnote at the end of this document. I had to correct that problem by hand.

I am certain that there is a better way to get “clean” HTML from a robust word processor—and Word 2000 has very little to do with it!

Endnotes

1However, using inserted Hyperlinks and Bookmarks can be the HTML equivalent of footnotes—or, rather endnotes.

 
This document was last reviewed on Wednesday, August 25, 2004 at 06:44 PM PDT.
Copyright© 2008 by Bryan D. Wilhite All rights reserved. No part of this material may be used or reproduced in any form or by any means, or stored in a database or retrieval system, without prior written permission of the publisher except in the case of brief quotations embodied in critical articles and reviews. Making copies of any part of this material for any purpose other than your own personal use is a violation of United States copyright laws.

The information provided by Bryan D. Wilhite at kintespace.com is provided “as is” without warranty of any kind. In no event shall Bryan D. Wilhite or any of his affiliates be liable for any damages whatsoever including, but not limited to, direct, indirect, incidental, consequential, loss of business profits or special damages due to material published by Bryan D. Wilhite or any of his affiliates.