Building and Processing XML in Java

Overview

This web page publishes Java source code that demonstrates how to build XML Document objects. It also publishes code that shows how to parse, validate and process XML. The code published here builds an XML Document from an arithmetic expression or assignment statement (e.g., x = 3 + 4 * 5). The resulting XML Document is then "serialized" to a String object. The String is then read by an XML parser, validated and converted to an XML DOM object. The DOM tree is traversed and the expression (or statement) is evaluated). The XML construction and evaluation objects are used to construct an interactive expression processor.

This web page provides an overview for the source code published here. This code is commented and will hopefully provide the best reference for building and processing XML.

XML: Those who cannot learn from history are doomed to repeat it

The development of XML started during the early years of the World Wide Web and the dot-com boom. Hype and exaggeration were characteristic of this time and XML got its share. I recall reading some early articles on XML, which claimed that the processing instructions for a document, like a Web page, could be provided in XML. By some magic process these instructions would be executed and the document would be transformed. How, exactly, this happened was not explained.

In fact there is little that is remarkable or revolutionary about XML. I frequently get the feeling that the people who worked on the standards for XML and the XML processors (e.g., parsers) know little about computer science. I have seen no evidence, for example, that the thirty years of knowledge about parsing has had any effect on XML Schema processing.

Using XML

XML is being used in more and more applications. As XML's popularity grows path dependence sets in and XML is adopted simply because it has become a standard. And XML is not without its good points. XML is useful as a human readable "wire protocol", that can be used to encode information that is transmitted over a a computer network. For example, stock market orders and processing instructions can be encoded in XML and sent to a trading system that will execute the order. The response from the trading system can also be sent back via XML.

Of course XML XML is not the first general purpose "wire protocol" to be proposed. The IDL standard was published by the Object Management Group and was used to transmit data between distributed CORBA objects.

As far as I am concerned XML has two advantages over older data protocols:

XML has become a widely accepted standard. For example, XML has largely displaced IDL in distributed applications.
Extensive freely available Java (and C++) software libraries are available for processing XML from the Apache Project and Sun Microsystems.

Building and Processing XML

Learning to build XML Document objects and process XML can be frustrating. The standards for XML objects, like DOM (the Document Object Model) standard were written by W3C. Although these documents are extensive, I have not found them very clear. I learned to write Java XML processors from the Brett McLaughlin's book Java & XML, 2nd Edition (2001, O'Reilly Press). This book provided a good start, but it does not provide help with more than basic XML construction and processing.

There are a lot of easily forgettable details when it comes to working with XML. What classes need to be called to allocate a Document object for building an XML tree? How is a DOM parser initialized? How is the parser passed the XML character stream? If I have not been working with XML for a few months, I find that I have forgotten many of these details. One of the purposes of this web page is to provide a reference software base is not associated with anyone else's copyright. As I note below, you may also use this software in anyway you like, as long as you take responsibility for any associated risk.

Building an XML Document

The simplest way to build an XML document is to construct a Java String object that contains the XML information. For example:

    String buildXML()
    {
        ByteArrayOutputStream bos = new ByteArrayOutputStream();
        PrintStream ps = new PrintStream( bos );
        System.out.println("<EXPRESSION>");
        System.out.println("  <EQUAL>");
        System.out.println("    <IDENT>x</IDENT>");
        System.out.println("    <PLUS>");
        System.out.println("       <INT>3</INT>");
        System.out.println("       <TIMES>");
        System.out.println("          <INT>4</INT>");
        System.out.println("          <INT>5</INT>");
        System.out.println("       </TIMES>");
        System.out.println("    </PLUS>");
        System.out.println("  </EQUAL>");
        System.out.println("</EXPRESSION>");
        String xml = bos.toString();
        return xml;
    } // buildXML

The buildXML method will return an XML document that represents the expression

   x = 3 + 4 * 5

If the XML document being created is relatively small and will not change much this is reasonable way to create XML. For large documents, where the contents changes, this approach may have problems. It is easy to make mistakes in properly terminating XML tags (e.g., leaving off an end tag). As this example shows, indentation must be inserted by hand. Adding attributes to elements can also be awkward.

The Document object provides an alternative which is used in the software published here. The Document object supports a createElement() method which creates an XML tag. The Document object takes care of adding the termination tag. Classes available from the Apache project can be used to serialize the Document to a string, in indented format for readability.

The ParseExpToXML class in the software published here is passed an arithmetic expression or assignment statement (along with the parameters needed for the XML Document object). It parses the expression using a technique called recursive descent and builds an XML Document object representing the expression. The object is then serialized to a String. An example is shown below:

<ex:EXPRESSION xmlns="http://www.bearcave.com/expression"
               xmlns:ex="http://www.bearcave.com/expression"
               xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
               xsi:schemaLocation="http://www.bearcave.com/expression expression.xsd">
    <ex:EQUAL>
        <ex:IDENT>x</ex:IDENT>
        <ex:PLUS>
            <ex:INT>3</ex:INT>
            <ex:TIMES>
                <ex:INT>4</ex:INT>
                <ex:INT>5</ex:INT>
            </ex:TIMES>
        </ex:PLUS>
    </ex:EQUAL>
</ex:EXPRESSION>

This document is built with an XML namespace and a prefix on the tags. This allows the document to be embedded in another document without having the tag names collide with the surrounding document (e.g., in the case where the surrounding document also used the XML tag EXPRESSION).

Parsing by Hand vs. Using a Parser Generator

I used a hand constructed recursive descent expression parser because it makes the code easier to understand. Even for simple assignment statements and expressions some thought and effort was needed to create a recursive descent parser. Parsers can be created with much less effort using a parser generator like ANTLR. Using a parser generator, the grammar is specified, along with the actions to be taken (in this case, construction of XML Element objects) for each grammar section. My ANTLR Examples web page includes an example of an expression parser.

Validation with an XML Schema

The ParseExpToXML object builds an XML document that is designed for schema validation. This can be seen in the attributes associated with the EXPRESSION document tag.

A schema is an XML document that defines the proper structure of another XML document. In this case the schema defines the proper structure for XML that represents arithmetic expressions and assignments.

If validation is turned on when the XML document is read by the XML parser (DOMParser.parser(), for example), the parser will make sure that the document conforms to the XML definition in the schema. If the XML document does not conform to the definition in the schema, an error will be reported.

The schema that defines the format for the XML expressions generated by ParseExpToXML can be found here (expression.xsd). This schema is also included in the .tar and .jar files, below. Note that the XML schema defines the same name space that is referenced in the XML document that is generated by ParseExpToXML.

Processing the XML Document as a DOM Object

The EvalXML object reads the XML into a DOM Object. Validation is enabled, so, in theory, the XML document conforms to the schema. The DOM object implements an interface to an XML document defined by W3C. This allows XML documents to be accessed as trees. The EvalXML object walks the XML tree and evaluates the expression or statement.

Copyright and Permissions

The source code is published with the following copyright:

  Copyright Ian Kaplan 2004, Bear Products International
 
  You may use this code for any purpose, without restriction,
  including in proprietary code for which you charge a fee.
  In using this code you acknowledge that you understand its
  function completely and accept all risk in its use.

You may include this source code in Free Software projects, but you need to keep my copyright with my software. This copyright is less restrictive than the GNU Free Sofware "copyleft", since you may use this software in a "closed source" project.

Java Source Code and XML Expression Schema

The software can be downloaded in either UNIX tar archive format or in Java's jar archive format.

To unpack these files use either:

    tar xvf xmlexpr.tar
    jar xvf xmlexpr.jar

Doxygen Generated Documentation

The doxygen program can be used to generate HTML formatted documentation from Java (or C++) source code. Unlike Javadoc, doxygen can be used to generate documentation that includes source code, class diagrams and UML diagrams.

Doxygen formatted documentation can be found here

Java: compile once, run anywhere (well, anywhere where all the necessary .jar files have been installed)

I have been using Java (over C++) for more an more projects because of the huge, freely available, class library that is available from Sun Microsystems and the Apache Project. The opportunity to reuse this huge software base reduces development effort. But it also complicates software release. You cannot compile and run the software published here unless you have the necessary libraries installed. These libraries must also be called out in your CLASSPATH environment variable.

Below I've listed the .jar files that were in my build and class path when I built and executed the XML code published here. You will need to have these .jars, or some subset, installed on your local system.

  jax-api.jar       (Sun Java Architecture for XML Binding (JAXB))
  jaxb-impl.jar     (Sun Java Architecture for XML Binding (JAXB))
  jaxb-libs.jar     (Sun Java Architecture for XML Binding (JAXB))
  jaxp-api.jar      (Sun Java API for XML Processing (JAXP))
  jaxr-api.jar      (Sun Java API for XML Registries (JAXR))
  jaxr-impl.jar     (Sun Java API for XML Registries (JAXR))
  resolver.jar      (Apache Xerces)
  xercesImpl.jar    (Apache Xerces)
  xml-apis.jar      (Apache Xerces)
  xmlParserAPIs.jar (Apache Xerces)
  j2sdk1.4.2_05     (Sun Microsystems Java "Standard Edition" Software Development Kit)

These .jar files can be obtained from the java.sun.com and apache.org web sites:

Sun Microsystems Java web site. Look for the link to XML.
Apache Project. Look for the XML and then the Xerces Java 2 link.

Related Web Pages

Parsing XML with SAX

The DOMParser and the DOM object it builds are useful for processing complex XML documents. However, DOM may impractical for very large XML documents because of its memory use. Also, the construction and traversal of a DOM object has a computational cost.

The SAXParser is an alternative to the DOMParser. This web page publishes example SAXParser code that processes prototype messages that might be used a Trade Engine, a software system that supports computer driven trading.
Processing XML with the XML Pull Parser

As noted on the web page Parsing XML with SAX, the way SAX processes XML is "ass backward". The SAX parser calls code in the application what parses the XML document. The XML Pull Parser uses a standard parsing architecture and can be called by the parsing application.
Building an in-memory tree with the Xml Pull Parser

This web page publishes a remarkably small object that builds an in-memory tree representation of an XML document using the XmlPullParser. A tree-to-XML serializer is also included.

Ian Kaplan, August 2004
Revised:

back to Java Topics