Clean HTML is the Microsoft Word Template that creates “Clean” HTML equivalents of the contents of Word 2000/2002 files. This template translates W3C-compliant HTML strings out of Microsoft Word 2000/2002 documents.1 These HTML strings, created from highlighted portions or entire Word documents, can then be copied and pasted into other data applications using the Windows Clipboard—or saved to a text file. Simultaneously, Clean HTML translates a large number of typographic symbols (like curly quotes “” or the em dash —) and “extended” Latin characters into the appropriate HTML entity. Clean HTML takes advantage of the rich user interface of Microsoft Word 2000/2002 without sacrificing vendor-independent standards. Clean HTML also writes HTML 4.0 text files that can automatically link to external CSS and script files.
The following table summarizes the changes made to the Clean HTML template file and the “known issues.” Microsoft Word VBA Projects do not have versioning features by default. It follows that the summarizing table below marks updates by date:
| Date | Description |
|---|---|
| 9/10/04 | Updated the code to reflect locale-specific, built-in Style names for international versions of Word. |
| 9/7/04 |
Clean HTML will work correctly in Office Word 2003 but the Clean HTML template has to be loaded with every editing session after a security warning prompt. Promise: A future version of this product will be based on .NET, its security technology, to avoid this issue (among others). Spending upwards of $400US on Authenticode stuff is out of the question! |
| 2/24/2003 | The way Clean HTML handles Word tables has been completely revised and is documented below. |
| 2/24/2003 | Improved support for Clean HTML Character Styles in combination with Paragraph Styles added. Support for the Word 2002 Target Frame… entry added to Hyperlink translation. |
| 8/29/2002 |
Having multiple document windows open in Word 2002 may cause CleanHTML to disappear behind an open window. Flipping through windows using Alt + Tab helps.
|
| 8/29/2002 |
Currently CleanHTML ignores Bookmarks. This implies that headers (e.g. a paragraph of style Header 2) will not be translated if it is being used as a Bookmark. Note: Pasting HTML into a Microsoft Word file may cause Bookmarks to appear—even if Match Destination Formatting is used in Word 2002 and above. |
| 7/11/2002 | Fixed problems with List Bullet and List Number styles. |
| 7/11/2002 | Fixed various problems associated with the first character of the document being formatted. |
| 7/11/2002 |
Had fatal problems with international versions of Microsoft Word that are not based on the English language. Clean HTML is not known to run correctly with international versions of Microsoft Word. NOTE: We believe this problem is corrected. See 9/10/04 above. |
| 7/11/2002 | “Hidden” characters are now shown during Clean HTML translation to prevent problems associated with Word 2002. |
| 1/20/2002 | The ability to save Clean HTML text files to disk was added. |
| 1/20/2002 |
Clean HTML is unable to translate paragraphs highlighted within a table. Workaround: copy and paste the text into a new document without the table and run Clean HTML. |
| 1/20/2002 | Clean HTML moves through Footnote characters very, very slowly. |
Because of the awesome negative impact of “macro viruses,” Clean HTML will not deploy itself automatically after running some kind of setup program. Installation of Clean HTML must be done manually by copying Clean HTML.dot to a Word 2000/2002 Startup folder and loading it with the Templates and Add-Ins… command under the Word 2000/2002 Tools menu.2 The Templates and Add-Ins… dialog shown below should summarize the installation of Clean HTML:
After Clean HTML has been properly installed, it can be run from the Tools > Macro > Macros… dialog:
Or, if you are familiar with the customizing features of Word 2000/2002, you can assign Clean HTML to a button and place this button on one of the Tool Bars. You can set up a button that looks like this:
This means that you can run Clean HTML by pressing the button or typing Alt + c.
Clean HTML processes a Word 2000/2002 document with two logical loops. The first loop moves through the Collection of Paragraphs in a Word document. Within each Paragraph, the second loop moves through each Word.3 Additionally, there is a command to wrap the Clean HTML in document-level elements (including customizable meta elements) and save to an external file. Each loop looks for certain conditions to determine what to translate to HTML. Let’s call the first loop the Paragraph Loop—and the second loop the Word Loop. Let’s specify what each loop does:
| Condition | Response |
|---|---|
| A paragraph with Style Code Block. |
Send this paragraph to the Word Loop (see below). Place the output of this loop in a This is a custom Style created for Clean HTML. |
| A paragraph with Style Hidden Block. |
Clean HTML ignores this paragraph. This is a custom Style created for Clean HTML. |
| A paragraph with Style HTML Block. |
Directly translate the text in this paragraph as HTML. This is a custom Style created for Clean HTML. |
| A paragraph with Style List Bullet. |
Send this paragraph to the Word Loop (see below). Wrap the output of this loop in li elements. If it is the first paragraph of this style, prefix an opening ul element. Suffix the last paragraph with a closing ul element.
|
| A paragraph with Style List Number. |
Send this paragraph to the Word Loop (see below). Wrap the output of this loop in li elements. If it is the first paragraph of this style, prefix an opening ol element. Suffix the last paragraph with a closing ol element.
|
| A paragraph with Style Normal. |
Send this paragraph to the Word Loop (see below). Wrap the output of this loop in p elements.
|
| A paragraph with Style Style Block. |
Send this paragraph to the Word Loop (see below). Do not wrap the output of this loop in This style allows Word 2000/2002 paragraphs to be wrapped in customized HTML written in the Style HTML Block. This is a custom Style created for Clean HTML. |
| A paragraph with Style Heading 1. |
Send this paragraph to the Word Loop (see below). Wrap the output of this loop in h1 elements.
|
| A paragraph with Style Heading 2. |
Send this paragraph to the Word Loop (see below). Wrap the output of this loop in h2 elements.
|
| A paragraph with Style Heading 3. |
Send this paragraph to the Word Loop (see below). Wrap the output of this loop in h3 elements.
|
| A paragraph with Style Heading 4. |
Send this paragraph to the Word Loop (see below). Wrap the output of this loop in h4 elements.
|
| A paragraph with Style Heading 5. |
Send this paragraph to the Word Loop (see below). Wrap the output of this loop in h5 elements.
|
| A paragraph with Style Heading 6. |
Send this paragraph to the Word Loop (see below). Wrap the output of this loop in h6 elements. Additional headings are ignored.
|
| A paragraph consisting of one in-line Shape. |
Translate to an HTML block containing one
If the in-line Shape is not connected to an external file, the Failing this, the formatting stops and an alert message displays.
The
and
where |
| A paragraph inside of a Table. |
If this paragraph is the first cell of the table then it is translated to a temporary marker of the form At the end of the Paragraph Loop these temporary tags are replaced with the ordered Collection of tables in the word document.
The first row of a table is translated to
The
where
The
The
where
The
or
where the
The |
Clean HTML table formatting depends almost entirely on CSS Level 2 and its support in “mainstream” browsers. As of the revision date of this document, the support for CSS Level 2 is satisfactory—especially for the formatting of HTML tables. This dependency allows Clean HTML the highest level of flexibility. Clean HTML assigns a unique ID to each table, each row and even each cell on the HTML page. By using ID selectors (from CSS Level 1), a Cascading style sheet can reach any table element including the table itself.
Tables translated to Clean HTML from different Word documents can still be uniquely identified by using the CSS_ID custom file property discussed below under “Translation of Word 2000/2002 File Properties into Clean HTML.”
The following style block below is one of the simplest designs for HTML tables:
<style><!--
.cleanHTMLTable{
border:solid 1px #000000;
border-collapse:collapse;
}
.cleanHTMLTableHeader,.cleanHTMLTableData{
border:solid 1px #000000;
}
//-->
</style>
Most importantly, the border-collapse property is the key to making an HTML table “look like” an HTML table rendered without Cascading Style Sheets. More advanced table formatting techniques are discussed in detail at W3.org.
The following table should translate into Clean HTML and produce the expected results:
A Word Table with Horizontally Merged Cells |
|||
|---|---|---|---|
| Cell One | Cell Two | Cell Three | Cell Four |
| As of this writing, Clean HTML only supports horizontally merged cells across all columns of the table. | |||
| Condition | Response |
|---|---|
| Characters formatted as Bold. |
Translate to strong elements. This translation will not take place for paragraphs with heading styles (e.g. Heading 1).
|
| Characters formatted as Italic. |
Translate to em elements. This translation will not take place for paragraphs with heading styles (e.g. Heading 1).
|
| Characters formatted as Underline (or any of the other underline character formats—e.g.: Double Underline). |
Translate to span elements with style attribute. (u elements have been deprecated in HTML 4.0.)
|
| Characters formatted as Strikethrough (e.g.: strikethrough). |
Translate to span elements with style attribute. (strike elements have been deprecated in HTML 4.0.)
|
| Characters with Style Code Block. |
Translate to code elements. Note that this is a paragraph-level Word style.
|
| Characters with Style Code Line. |
Translate to code elements.
|
| Characters with Style Footnote Reference. Word automatically creates this Style when Footnotes are inserted. |
Translate to Native Word 2000/2002 Endnotes are ignored. |
| Characters with Style Hyperlink. |
Translate to Word Bookmark locations is not supported. |
| The Line Break character. |
If this character is found in a paragraph of style Code Block, translate to carriage return and line feed characters; otherwise translate to the <br> element.
|
| Characters generated by a Field object. | Translate according to the rules of the current Style names aforementioned. Warning: only fields that produce results (e.g. formulas generating string values) have been tested with Clean HTML. |
| Characters within Bookmarks. | Bookmarks are ignored by Clean HTML. |
| Characters within Comments. | Comments are ignored by Clean HTML. |
The Word Loop looks for “special characters” to translate into HTML entities. The following table summarizes the characters supported by Clean HTML:
Character Codes Translated into Clean HTML |
||||
|---|---|---|---|---|
| 0–63 | 128–159 | 160–191 | 192–223 | 224–255 |
| " | € | ¡ | À | À |
| & | ƒ | ¢ | Á | Á |
| < | „ | £ | Â | Â |
| > | … | ¤ | Ã | Ã |
| † | ¥ | Ä | Ä | |
| ‡ | ¦ | Å | Å | |
| ˆ | § | Æ | Æ | |
| ‰ | ¨ | Ï | Ç | |
| Š | © | È | È | |
| ‹ | ª | É | É | |
| Œ | « | Ê | Ê | |
| ‘ | ¬ | Ë | Ë | |
| ’ | ® | Ì | Ì | |
| “ | ¯ | Í | Í | |
| ” | ° | Î | Î | |
| • | ± | Ï | Ï | |
| – | ² | Ð | Ð | |
| — | ³ | Ñ | Ñ | |
| ˜ | ´ | Ú | Ò | |
| ™ | µ | Ó | Ó | |
| š | ¶ | Ô | Ô | |
| › | · | Õ | Õ | |
| œ | ¸ | Ö | Ö | |
| Ÿ | ¹ | × | ÷ | |
| º | Ø | Ø | ||
| » | Ù | Ù | ||
| ¼ | Ú | Ú | ||
| ½ | Û | Û | ||
| ¾ | Ü | Ü | ||
| ¿ | Ý | Ý | ||
| Þ | Þ | |||
| ß | Ÿ | |||
Here’s a “picture paragraph” of Clean HTML output:
Its original linking information, paragraph alignment and Alternative text settings are translated into Clean HTML. As of this writing, Hyperlink information assigned to objects other than Range objects is ignored. The next picture paragraph shows the Alternative text settings:
The Clean HTML output window shows the Save button. This command saves a text file of Clean HTML adding “document-level” tags according to the following rules:
| Rule | Remarks |
|---|---|
| The file will be written in Unicode text format. | Any information suggesting that the Unicode format has a negative impact on a system is currently beyond the scope of this document. |
| The HTML will be level 4.0 transitional. |
This is denoted by the DOCTYPE declaration.
|
Word 2000/2002 File Properties are translated into meta and base elements.
|
Word 2000/2002 has both “built-in” file properties and Custom File Properties. Both of these property types are found in the Properties dialog under the File menu (see below for more details). |
| Clean HTML recognizes Custom File Properties that create references to external style sheets and script files. | Some of these Properties are undocumented (see below for more details). |
The dialog tabs shown below show a portion of the built-in and custom file properties:
|
|
The table below summarizes the translation of these properties into Clean HTML elements:
| Word Property | Clean HTML Element |
|---|---|
|
General > Created General > Modified |
Each are directly translated into one
|
| Summary > Title |
Direct translation into the text within
|
| Summary > Subject |
Direct translation into one
|
| Summary > Author |
If a pipe-delimiter is not found then translate into one
If a pipe-delimiter is found then assume that the text is a delimited string of form
|
| Summary > Manager |
Direct translation into one
|
| Summary > Company |
Direct translation into one
|
| Summary > Category |
Direct translation into one
|
| Summary > Keywords |
Direct translation into one
|
| Summary > Comments |
Direct translation into one
|
| Summary > Hyperlink base |
Direct translation into one
|
Custom > Name > [name]Custom > Value > [value]
|
These default, custom name-value pairs are scanned by Clean HTML and each pair directly translates into one
|
|
Custom > Name > CSS Custom > Value > [URI]
|
This is a special name-value pair that Clean HTML translates into a
Specifically, based on the example in the image above, we have:
|
|
Custom > Name > CSS_ID Custom > Value > [string]
|
This is a special name-value pair that Clean HTML translates into uniquely identifying the id attribute of table elements, tr elements, th elements and td elements.
|
The idea behind Clean HTML is not to produce HTML that emulates the formatting of Office documents. That’s Microsoft’s job. Office XP subscribers may see some improvements on Microsoft’s previous attempts to produce useful HTML documents. In addition to product interoperability, we also need a clean HTML representation of the contents of Word documents to interact with systems based on vendor agnostic standards.
The elements in Clean HTML are in lowercase. However, as of this writing, Clean HTML does not produce XHTML or HTML streams to be inserted into well-formed XML documents.
As of this writing, Clean HTML does not explicitly support HTML 4.01. This W3C specification is known as a “subversion” of HTML 4.0. Many of the popular browsers (like Mozilla) reads the 4.01 DOCTYPE and cause display “problems.” These problems may actually be the browser switching off “quirks mode” and actually showing the HTML according to standards. More Clean HTML research in this area will be needed!
The specifications in this document imply that Clean HTML works best with relatively “simple” documents. A “simple” document would only have the default Word Styles (like Normal and Heading 1) plus the styles from Clean HTML. When a Word Document has a table of contents, columns, comments, text boxes, multiple versions, customized styles, form fields, etc., Clean HTML may not work properly or produce unexpected results.
Clean HTML will not produce well-formed HTML from Word Range objects containing characters with “multiple formatting.” For example, a word formatted with Bold, Italic and Underline will produce poorly formed HTML. In terms of both traditional typography and Clean HTML processing, using multiple formatting is not recommended. However, the following table summarizes some workarounds:
| Word Formatting | Clean HTML |
|---|---|
<em>Experimental Typography (Working With Computer Type , No 4)</em>
|
Experimental Typography (Working With Computer Type , No 4) |
<em><strong>bold-italic</strong></em>
|
bold-italic |
Heading
|
Heading Three |
Future enhancements to Clean HTML should include XHTML support. As of this writing, Clean HTML has been extensively tested on Word 2000. For obvious historical reasons, Clean HTML has not been tested extensively on Word XP (or Word 2002). A digital signature for Clean HTML will be forthcoming with increased sales of Clean HTML. For Clean HTML news, registration information, bug reports, etc., please mail.
Clean HTML is for that happy few who need to separate their lengthy prose from its visual HTML presentation. Such people enjoyed writing a good essay in Microsoft Word but did not enjoy how Microsoft decided to “help” publish that essay to the Internet. Now, with Clean HTML, we can still use Word and feel little more secure that our word processing documents can move to the Internet very quickly, with links, endnotes, images and typographically correct characters—all under a standard Web-Consortium format, open to as many browsers as possible.
This document is rendered entirely via Clean HTML. Please support standards-enabling software and purchase Clean HTML today.
1 Clean HTML is not compatible with any other version of Microsoft Word (past or future). Clean HTML was only tested in Microsoft Windows installations capable of running Microsoft Office.
2 This manual procedure will also verify that you have enough security permissions to use Clean HTML without related errors.
3 Actually it loops through each character in each paragraph. Interestingly, there is no “Word” object in the Word VBA Object model. There is a Words Collection where each Item returns a Range Object. It turns out that the Range Object is as important to Word VBA as the Recordset Object is for Access VBA (or ADO).
4 “Top-level” implies that Word’s nested tables are not supported as of this writing.