Exploit XML-Word Interoperability
Discover new ways you can use XML to make Word more interoperable.
by Mitch Gitman
TechEd, May 27, 2004
Are you frustrated at how difficult it is to produce structured data with Microsoft Word for use outside the Office world? Hopefully, you haven't resorted to writing your own word processor. Instead, the answer might lie with the new XML features of Microsoft Office Word 2003.
In this article, I'll examine some of the new ways you can use XML to make Word more interoperable. I'll look at these from the standpoint of someone who has bought into Microsoft's broader vision of interoperability, revolving around the .NET Framework and the XML Schema language for defining the structure and typology of XML documents.
WordprocessingML
With Word 2003, you can now save a Word document as a text .xml file and not lose any data that would otherwise be present in a binary .doc file. The two file formats are equivalent, and Word can go back and forth seamlessly, opening a file in one and saving in the other.
The foundation for this XML support lies in Word's XML representation of its document model, called WordprocessingML, or just WordML. Actually, WordML is simply an XML schema for defining XML documents. Other products in Microsoft Office 2003 have their own schemas; for example, Excel 2003 uses SpreadsheetML.
Literally, WordML comprises eight namespace-assigned schemas. In any Word-generated WordML document, you'll find these schemas declared using these xmlns:prefix="namespace" combinations:
- w="http://schemas.microsoft.com/office/word/2003/wordml"
- v="urn:schemas-microsoft-com:vml"
- w10="urn:schemas-microsoft-com:office:word"
- sl="http://schemas.microsoft.com/schemaLibrary/2003/core"
- aml="http://schemas.microsoft.com/aml/2001/core"
- wx="http://schemas.microsoft.com/office/word/2003/auxHint"
- o="urn:schemas-microsoft-com:office:office"
- dt="uuid:C2F41010-65B3-11d1-A29F-00AA00C14882"
Some of these schemas merit an explanation:
w : This is the core WordML schema that defines key elements and data types such as the root wordDocument element. You could get away with manually defining an entire WordML document without straying from this namespace.
aml : The Annotation Markup Language schema defines annotations—comments, revision history, bookmarks. Often, this is the kind of extra data that delivers enormous value during the authoring process but can be stripped from the published document.
wx : Auxiliary hints that Word ignores but can be helpful to an outside XML processor.
o : Properties that are common across the Office application suite.
You can download the .xsd files for each of these namespaces. Start at the Office 2003 XML Reference Schemas Licensing page (see Resources).
Well, this is all very well and good, but how does WordML relate to your own XML data structures?
Template + Schema
In Word 2003, the standard technique to incorporate your own XML into Word XML is to assign a custom XML schema to a document template (.dot file). Suppose you've done that, and a user wants to save a document based on that template. He or she can mark the "Save data only" checkbox in the Save As dialog box to save not a WordML document, but a (much briefer) XML document according to the custom schema. The document in memory can also be validated against the schema.
To get started, you should download and install the Microsoft Office Word 2003 XML Toolbox (see Resources), which shows up as a toolbar on Word. The Toolbox toolbar adds some special functionality to Word and makes some existing functionality more accessible. It's more useful for developers creating templates than for end users writing documents.
For example's sake, consider a ridiculously oversimplified XML schema representing a homework assignment:
<xs:schema xmlns:tns="urn:assign.demo"
elementFormDefault="qualified"
targetNamespace="urn:assign.demo"
xmlns:xs="http://www.w3.org/2001/XMLSchema">
<xs:element name="assignment" nillable="true"
type="tns:Assignment" />
<xs:complexType name="Assignment">
<xs:simpleContent>
<xs:extension base="xs:string">
<xs:attribute name="studentName"
type="xs:string" />
<xs:attribute name="date"
type="xs:string" />
<xs:attribute name="assignmentNumber"
type="xs:string" />
</xs:extension>
</xs:simpleContent>
</xs:complexType>
</xs:schema>
This schema was generated from source code using the Xsd.exe tool found in the .NET Framework SDK. All the attributes are typed as strings because Word won't work with other attribute types, which is not such a bad thing anyway when the user interface doesn't have any validation smarts wired in.
In the Tools menu, select Templates and Add-Ins. Click on the XML Schema tab, and you'll see a dialog (see Figure 1). (You could also use the Toolbox to get here.)
Clicking on the Add Schema button lets you locate an XML schema to attach to this template (or to a regular document). Doing so also adds the schema to Word's Schema Library, which makes the schema available Word-wide for possibly attaching to other documents.
Once you select an XML schema, you get a dialog where you can assign a prefix (see Figure 2).
After you OK out of these dialogs, an "XML Structure Task Pane" appears on the main screen. Within this pane, you can click on the root XSD element to apply it either to the entire document or to the current selection. Once you place the root element tags, you can place child element tags in between, and so on, recursively.
However, XML attributes (three in our example schema) don't show up in the main editing pane. Instead, you have to right-click the containing element's display in either the main pane or the task pane and select the Attributes menu item.
The attributes aren't directly accessible to fill in, so you can change the simplistic schema to make it rely on child elements instead. The template author, or end user, will see the tags in Figure 3.
If those tags seem a little too much for the end user, you can go to the "XML Structure Task Pane" and uncheck "Show XML tags in the document." If you do this, you should also specify placeholder text for each element, which you can do starting from the toolbox or from the same Attributes context menu item. Then a user would see the changes (see Figure 4).
I also got rid of leading and trailing whitespace in this template, but I still found it all too easy to create a document that confused Word about when one element ended and the next one began.
The way to prevent this trouble is to go to Tools | Document Protection and—to gloss over the ensuing specific steps—lock the entire document from editing, then unlock the portions where the XML body text goes. The editable areas will display in light yellow.
You can see that, for "form-y" input data, Word isn't that good a fit, even when you try to shoehorn it as you just did. At this point, you might want to try on a sister Office product, InfoPath 2003 (see "InfoPath: Closing the Circuit on XML Transport").
Alternatively, a simple workaround in a Web application architecture would be to pull the form data (name, date, assignment number) from the Word document and have users input it in an HTML form instead. The actual content requiring Word's authoring chops could be pasted as XML text into a form text area or, simpler yet, users could just do a file upload of the whole Word document.
So Word has its limits at clueing in users about an XML document's structure. Nonetheless, you can programmatically manipulate the XML that lies underneath. You can choose from two approaches.
Adding XML to the Word API
The first approach involves adding XML to the Word API. Now, typically when you think of programming against the Word 2003 object model, you think of programming with COM type libraries using Visual Basic for Applications (VBA)—meaning unmanaged code. And typically when you think interoperability and dynamic XML, you think .NET Framework—meaning managed code. So first off, how do you bridge that gap?
The answer is Visual Studio .NET and an add-on product, Visual Studio Tools for the Microsoft Office System (see Resources). The technical answer is that these tools use the Office XP primary interop assemblies—DLLs that allow managed code to call unmanaged Office type libraries.
The two namespaces of particular interest are Microsoft.Office.Interop.Word and Microsoft.Office.Core , aliased respectively as Word and Office when you create a "Word Document" or "Word Template" project in Visual Studio .NET.
For XML support, new classes such as XMLNamespace and XSLTransform have been added to the Word class hierarchy. And new members have been added to some of the most central classes in that hierarchy. They include these properties:
Application.ArbitraryXMLSupportAvailable
Application.XMLNamespaces
Document.XMLSchemaReferences
Document.XMLSaveDataOnly
Range.XML
Selection.XML
Document.XMLNodes
Range.XMLNodes
Selection.XMLNodes
And these methods:
Range.InsertXML
Selection.InsertXML
The ArbitraryXMLSupportAvailable property of an Application object allows XML schemas to be associated with documents or templates. Adding a namespace/schema to an Application object's XMLNamespaces collection works the same as interactively placing a schema in the Schema Library. Adding a namespace/schema to a Document object's XMLSchemaReferences collection resembles interactively adding a schema to a document.
The XML and InsertXML members trade in full WordML documents—with the w:wordDocument root—even though they obviously sometimes hold just document fragments. You don't need to view the full signatures to get the idea that XML , InsertXML , and XMLNodes are dealing essentially with text strings and DOM nodes.
And you can see how, once you have access to these APIs from within the .NET Framework, you can use them to jigger up Word to dynamically read from or write to a remote SOAP Web service. First generate a client proxy class using the Wsdl.exe tool in the .NET Framework SDK. You would also have to write code to translate between the raw XML strings or DOM nodes of the relevant new members and the already deserialized objects produced by Wsdl.exe from a WSDL document defining a Web service. A much trickier issue, though, arises if the Web service is not expecting or returning WordML in particular. The preceding XML properties do take an optional boolean DataOnly parameter.
Binding to the WordML Schema
This brings us to the other programmatic alternative, which relies more on the .NET Framework than on Word and its APIs: binding to the WordML schema.
When you download the Word 2003 reference schemas, they show up at this path relative to Program Files:
Microsoft Office 2003 Developer Resources\Microsoft Office 2003 XML Reference Schemas\
This folder contains a folder, WordprocessingML Schemas, which itself contains the WordML XML Schema definitions.
Now, run Xsd.exe at the command line, specifying each of these .xsd files as arguments. This generates a source file; for example, the generated C# .cs file runs more than 11,000 lines long. You'll find in this file a wordDocumentElt class that binds to the complex type of the root w:wordDocument element. The wordDocumentElt class has among its fields a field of type bodyElt , which in turn has among its fields an array of type pElt for paragraphs, and so on.
Now you have the entire Word object model. So you could use the XmlSerializer class to serialize and deserialize between WordML-conformant XML documents and full Word-document object graphs, independent of Word.
The question is, how can you use these objects? Certainly, if you're content merely to extract the raw text from a Word document, you're better off not doing any programming at all and just saving plain text from Word. The same goes if you're more interested in text patterns than XML document structures.
Now, if you want to pass data between a wordDocumentElt object and an object graph based on a custom XML schema—and the structural relationship between the objects and/or their potential serialized forms is not trivial—you begin to understand how binding the Word XML schemas to classes could be more than just a nifty little exercise. You can see the same value if you're using some other XML Schema binding framework besides the .NET Framework's combination of Xsd.exe and XmlSerializer .
Just so you don't start seeing the value in writing your own Word-compatible word processor.
About the Author
Mitch Gitman is a Java and .NET Framework developer specializing in Web services, XML, and interoperability. Reach Mitch at mgitman@usa.net.
|