site site-navbranch.xsl site

Skip to main content.

Looking for PowerPoint and/or Google Analytics solutions? Please follow the above links to ShufflePoint.com.

section no_callout.xsl no_callout

page static_html.xsl thincmsword2wpcml

Writing copy in Word

What is perhaps most sorry state of affairs in content management is that the most widely used word processor is viewed as being incompatible with structured content management. The fault fell squarely in Microsoft's corner by forcing organizations to choose between having content and having structure. Two seemingly inconsequential features of Word 2003 have changed that. The features are XML export and style locking. Both of these features on their own might have been viewed like many others as arcane stuff which typical users wouldn't know or care about. And even looked at together, they may not easily be seen as a solution for bridging the desktop author with the structured content repository. But together these two features are in fact a "silver bullet". Let's see why.

Word XML

Word 2003 documents can be saved as XML. The resulting document follows a published XML schema which is basically a mapping of the RTF markup to XML. As I mentioned in my first ThinCMS article, I had in the past generated Word documents by generating RTF (I still have nightmares). And I had consumed Word content using Office automation (and lots of praying). This article is about my system of extracting Word content, but let me say before moving on that the task of generating Word has also been revolutionized.

The Word XML format, being an XML serialization of RTF, is honestly not a real pretty thing to behold. And those worn down by ordeal of getting stuff out of Word might at first say "that's of no help - users will still just have to send us a 'save as text'." You need to instead say "yeh, it's ugly, but the content is in there, and it is in a structured form."

Why lock styles?

I'm going to start by telling you in broad terms how I now use Word. I create a Word template, lock the styles, and save the document as a Word template (.dot file). I give this .doc file to content providers with instructions for saving it to the templates folder. Locking a document is done in the Word menu by selecting "tools" then "protect document".

Why do I lock styles? To get "structured" XML you need to control the semantics of this document. Users can add semantics to Word using styles. And you can control the grammar of those semantics by locking the styles. Unlike InfoPath or more "serious" semantic markup tools, you won't be able to dictate the a document must have certain semantic markup present. But you can at least control the domain of these semantics. And through the "style for following paragraph" setting, and through the text present in your Word template, you can give the user some pretty strong hints about what you expect or require.

Just add XSLT

XSL Transforms (XSLT) is usually viewed (sometimes with suspicion) as a means of generating structured content. But it is just as much at home when used as a means of extracting structured content. That is how it will be employed in Word2WPCML. All of the heavy lifting of this tool is done by the XSL file Word2WPCML.xsl. At 18K that one is big by my XSL file size standards. But this workhorse is doing several things:

  1. Extracting content as well-formed XHTML
  2. Extracting the document outline for use in building web navigation
  3. Extracting any images found in the document

The command-line XSLT utility nxslt.exe is used to apply this XSL processor to the Word XML documents being processed. nxslt.exe is a very high-quality, flexible, free, open-source command-line XSL processor. One if its key features is its support for exsl:document, an extension function for performing Multiple Result Document processing. an <exsl:document> directive is used in word2wpcml to enable one Word XML document to generate multiple WPCML files - one for each "page" specified in the Word document.

wordml2wpcml.wsf - Orchestrating the processing pipeline

The name word2wpcml is technically no longer correct. The current version of this toolkit uses word2wpcml.xsl to generate an intermediate XML form which has captured the metadata and XSLT from the Word XML. The actual WPCML files are generated in a following step. So let's look now at the overall workflow.

The application which orchestrates the overall processing is the Windows Script Host file wordml2wpcml.wsf ("the wsf" for now on). The wsf is run at a command prompt. Since it takes lots of arguments, I almost always create a bat or cmd file to avoid all that typing. Perhaps a quick sample would help explain:

cscript wordml2wpcml.wsf /path:"F:/VSS/ThinCMS/app/site/repository/activeinterface/thincms/_word2xml" /oper:CGJM /set:shared

This invocation says "process the folder …/_word2xml to extract pages and the sitemap and instantiate the WPCML documents using the template set 'shared'"

The /oper argument is used to specify a list of processing operations. The currently implemented operations are:

  • Cleanup (C) - delete files in the destination folders
  • Generate (G) - generate content extraction files
  • Join (J) - join content extraction file with WPCML templates
  • Merge (M) - merge sitemaps files generated from each individual Word file into one
  • Hierarchize (H) - turn sitemap into a tree by specified page outline levels

The argument /oper:A is equivalent to /oper:CGJMH. "A" stands for "All operations".

The WSF is written in JavaScript - my language of choice for such utilities. Breaking down the processing into bite-sized pieces makes the utility easier to develop, understand, and debug.

Sitemap generation

The Generate operation creates for each Word XML file a flat list of pages found in the file. For each page, it captures the name, caption, code, and link name specified by the content author in Word. These files are really sitemap precursors. The are saved into the "sitemaps" folder. A well-designed web site will have content organized into a tree, with each page optionally having children pages. The Hierarchize operation applies sitemap.xsl to the flat merged sitemap file to generate a final sitemap file with a tree of link nodes.

Note that the current version of Word 2WPCML is not very robust when it comes to missing nodes. If you have a content page with code 1.1 and also 1.1.2.1 and 1.1.2.2 but no 1.1.2 then you will most likely end up with a sitemap missing nodes 1.1.2.1 and 1.1.2.2. When playing the role of editor for a site, I will review the provided content files and the generated sitemap and make sure nothing was dropped. In the next version of this I will be making it more robust by at least warning of nodes which did not have a parent.

As mentioned much earlier, the sitemap file is a key component of ThinCMS since it organizes all of the pages of a site into a logical structure and is used to auto-generate all navigation parts of the web site.

Sample

I have provided with the download the Word document and invocation command I used to create the content you are now observing. In the word2wpcml root folder you will find a file thincms.cmd which invokes the wordml2wpcml.wsf script with /oper:CGJMH (all operations).

Summary

My posting to the WSG CMS list last week was met with enthusiastic responses from people who would like to see more about what I do. So rather than wait until it is complete (software is never complete), I am going to just write about those pieces which I can get organized into a useable form. In the next article, I will be adding another processing operation command to the script which will generate a simple XHTML preview of the collection of pages written in Word.


downloads: wordml2wpcml.zip (200K)