Javanotes 5.0, Section 11.6 -- A Brief Introduction to XML

Section 11.6

A Brief Introduction to XML

When data is saved to a file or transmitted over a network, it must be represented in some way that will allow the same data to be rebuilt later, when the file is read or the transmission is received. We have seen that there are good reasons to prefer textual, character-based representations in many cases, but there are many ways to represent a given collection of data as text. In this section, we'll take a brief look at one type of character-based data representation that has become increasingly common.

XML (eXtensible Markup Language) is a syntax for creating data representation languages. There are two aspects or levels of XML. On the first level, XML specifies a strict but relatively simple syntax. Any sequence of characters that follows that syntax is a well-formed XML document. On the second level, XML provides a way of placing further restrictions on what can appear in a document. This is done by associating a DTD (Document Type Definition) with an XML document. A DTD is essentially a list of things that are allowed to appear in the XML document. A well-formed XML document that has an associated DTD and that follows the rules of the DTD is said to be a valid XML document. The idea is that XML is a general format for data representation, and a DTD specifies how to use XML to represent a particular kind of data. (There is also an alternative to DTDs, known as XML schemas, for defining valid XLM documents, but let's ignore them here.)

There is nothing magical about XML. It's certainly not perfect. It's a very verbose language, and some people think it's ugly. On the other hand it's very flexible; it can be used to represent almost any type of data. It was built from the start to support all languages and alphabets. Most important, it has become an accepted standard. There is support in just about any programming language for processing XML documents. There are standard DTDs for describing many different kinds of data. There are many ways to design a data representation language, but XML is the one that has happened to come into widespread use. In fact, it has found its way into almost every corner of information technology. For example: There are XML languages for representing mathematical expressions (MathML), musical notation (MusicXML), molecules and chemical reactions (CML), vector graphics (SVG), and many other kinds of information. XML is used by OpenOffice and recent versions of Microsoft Office in the document format for office applications such as word processing, spreadsheets, and presentations. XML site syndication languages (RSS, ATOM) make it possible for web sites, newspapers, and blogs to make a list of recent headlines available in a standard format that can be used by other web sites and by web browsers; the same format is used to publish podcasts. And XML is a common format for the electronic exchange of business information.

My purpose here is not to tell you everything there is to know about XML. I will just explain a few ways in which it can be used in your own programs. In particular, I will not say anything further about DTDs and valid XML. For many purposes, it is sufficient to use well-formed XML documents with no associated DTDs.

11.6.1 Basic XML Syntax

An XML document looks a lot like an HTML document (see Subsection 6.2.3). HTML is not itself an XML language, since it does not follow all the strict XML syntax rules, but the basic ideas are similar. Here is a short, well-formed XML document:

<?xml version="1.0"?>
<simplepaint version="1.0">
   <background red='255' green='153' blue='51'/>
   <curve>
      <color red='0' green='0' blue='255'/>
      <symmetric>false</symmetric>
      <point x='83' y='96'/>
      <point x='116' y='149'/>
      <point x='159' y='215'/>
      <point x='216' y='294'/>
      <point x='264' y='359'/>
      <point x='309' y='418'/>
      <point x='371' y='499'/>
      <point x='400' y='543'/>
   </curve>
   <curve>
      <color red='255' green='255' blue='255'/>
      <symmetric>true</symmetric>
      <point x='54' y='305'/>
      <point x='79' y='289'/>
      <point x='128' y='262'/>
      <point x='190' y='236'/>
      <point x='253' y='209'/>
      <point x='341' y='158'/>
   </curve>
</simplepaint>

The first line, which is optional, merely identifies this as an XML document. This line can also specify other information, such as the character encoding that was used to encode the characters in the document into binary form. If this document had an associated DTD, it would be specified in a "DOCTYPE" directive on the next line of the file.

Aside from the first line, the document is made up of elements, attributes, and textual content. An element starts with a tag, such as <curve> and ends with a matching end-tag such as </curve>. Between the tag and end-tag is the content of the element, which can consist of text and nested elements. (In the example, the only textual content is the true or false in the <symmetric> elements.) If an element has no content, then the opening tag and end-tag can be combined into a single empty tag, such as <point x='83' y='96'/>, which is an abbreviation for <point x='83' y='96'></point>. A tag can include attributes such as the x and y in <point x='83' y='96'/> or the version in <simplepaint version="1.0">. A document can also include a few other things, such as comments, that I will not discuss here.

The basic structure should look familiar to someone familiar with HTML. The most striking difference is that in XML, you get to choose the tags. Whereas HTML comes with a fixed, finite set of tags, with XML you can make up meaningful tag names that are appropriate to your application and that describe the data that is being represented. (For an XML document that uses a DTD, it's the author of the DTD who gets to choose the tag names.)

Every well-formed XML document follows a strict syntax. Here are some of the most important syntax rules: Tag names and attribute names in XML are case sensitive. A name must begin with a letter and can contain letters, digits and certain other characters. Spaces and ends-of-line are significant only in textual content. Every tag must either be an empty tag or have a matching end-tag. By "matching" here, I mean that elements must be properly nested; if a tag is inside some element, then the matching end-tag must also be inside that element. A document must have a root element, which contains all the other elements. The root element in the above example has tag name simplepaint. Every attribute must have a value, and that value must be enclosed in quotation marks; either single quotes or double quotes can be used for this. The special characters < and &, if they appear in attribute values or textual content, must be written as < and &. "<" and "&" are examples of entities. The entities >, ", and ' are also defined, representing >, double quote, and single quote. (Additional entities can be defined in a DTD.)

While this description will not enable you to understand everything that you might encounter in XML documents, it should allow you to design well-formed XML documents to represent data structures used in Java programs.

11.6.2 XMLEncoder and XMLDecoder

We will look at two approaches to representing data from Java programs in XML format. One approach is to design a custom XML language for the specific data structures that you want to represent. We will consider this approach in the next subsection. First, we'll look at an easy way to store data in XML files and to read those files back into a program. The technique uses the classes XMLEncoder and XMLDecoder. These classes are defined in the package java.beans. An XMLEncoder can be used to write objects to an OutputStream in XML form. An XMLDecoder can be used to read the output of an XMLEncoder and reconstruct the objects that were written by it. XMLEncoder and XMLDecoder have much the same functionality as ObjectOutputStream and ObjectInputStream and are used in much the same way. In fact, you don't even have to know anything about XML to use them. However, you do need to know a little about Java beans.

A Java bean is just an object that has certain characteristics. The class that defines a Java bean must be a public class. It must have a constructor that takes no parameters. It should have a "get" method and a "set" method for each of its important instance variables. (See Subsection 5.1.3). The last rule is a little vague. The idea is that is should be possible to inspect all aspects of the object's state by calling "get" methods, and it should be possible to set all aspects of the state by calling "set" methods. A bean is not required to implement any particular interface; it is recognized as a bean just by having the right characteristics. Usually, Java beans are passive data structures that are acted upon by other objects but don't do much themselves.

XMLEncoder and XMLDecoder can't be used with arbitrary objects; they can only be used with beans. When an XMLEncoder writes an object, it uses the "get" methods of that object to find out what information needs to be saved. When an XMLDecoder reconstructs an object, it creates the object using the constructor with no parameters and it uses "set" methods to restore the object's state to the values that were saved by the XMLEncoder. (Some standard java classes are processed using additional techniques. For example, a different constructor might be used, and other methods might be used to inspect and restore the state.)

For an example, we return to the same SimplePaint example that was used in Subsection 11.3.4. Suppose that we want to use XMLEncoder and XMLDecoder to create and read files in that program. Part of the data for a SimplePaint sketch is stored in objects of type CurveData, defined as:

private static class CurveData {
   Color color;  // The color of the curve.
   boolean symmetric;  // Are reflections also drawn?
   ArrayList<Point> points;  // The points on the curve.
}

To use such objects with XMLEncoder and XMLDecoder, we have to modify this class so that it follows the Java bean pattern. The class has to be public, and we need get and set methods for each instance variable. This gives:

public static class CurveData {
   private Color color;  // The color of the curve.
   private boolean symmetric;  // Are reflections also drawn?
   private ArrayList<Point> points;  // The points on the curve.
   public Color getColor() {
      return color;
   }
   public void setColor(Color color) {
      this.color = color;
   }
   public ArrayList<Point> getPoints() {
      return points;
   }
   public void setPoints(ArrayList<Point> points) {
      this.points = points;
   }
   public boolean isSymmetric() {
      return symmetric;
   }
   public void setSymmetric(boolean symmetric) {
      this.symmetric = symmetric;
   }
}

I didn't really need to make the instance variables private, but bean properties are usually private and are accessed only through their get and set methods.

At this point, we might define another bean class, SketchData, to hold all the necessary data for representing the user's picture. If we did that, we could write the data to a file with a single output statement. In my program, however, I decided to write the data in several pieces.

An XMLEncoder can be constructed to write to any output stream. The output stream is specified in the encoder's constructor. For example, to create an encoder for writing to a file:

XMLEncoder encoder; 
try {
   FileOutputStream stream = new FileOutputStream(selectedFile); 
   encoder = new XMLEncoder( stream );
     .
     .

Once an encoder has been created, its writeObject() method is used to write objects, coded into XML form, to the stream. In the SimplePaint program, I save the background color, the number of curves in the picture, and the data for each curve. The curve data are stored in a list of type ArrayList<CurveData> named curves. So, a complete representation of the user's picture can be created with:

   encoder.writeObject(getBackground());
   encoder.writeObject(new Integer(curves.size()));
   for (CurveData c : curves)
      encoder.writeObject(c);
   encoder.close();

When reading the data back into the program, an XMLDecoder is created to read from an input file stream. The objects are then read, using the decoder's readObject() method, in the same order in which they were written. Since the return type of readObject() is Object, the returned values must be type-cast to their correct type:

   Color bgColor = (Color)decoder.readObject();
   Integer curveCt = (Integer)decoder.readObject();
   ArrayList<CurveData> newCurves = new ArrayList<CurveData>();
   for (int i = 0; i < curveCt; i++) {
      CurveData c = (CurveData)decoder.readObject();
      newCurves.add(c);
   }
   decoder.close();
   curves = newCurves; // Replace the program's data with data from the file.
   setBackground(bgColor);
   repaint();

You can look at the sample program SimplePaintWithXMLEncoder.java to see this code in the context of a complete program. Files are created by the method doSaveAsXML() and are read by doOpenAsXML().

The XML format used by XMLEncoder and XMLDecoder is more robust than the binary format used for object streams and is more appropriate for long-term storage of objects in files.

11.6.3 Working With the DOM

The output produced by an XMLEncoder tends to be long and not very easy for a human reader to understand. It would be nice to represent data in a more compact XML format that uses meaningful tag names to describe the data and makes more sense to human readers. We'll look at yet another version of SimplePaint that does just that. See SimplePaintWithXML.java for the source code. The sample XML document shown earlier in this section was produced by this program. I designed the format of that document to represent all the data needed to reconstruct a picture in SimplePaint. The document encodes the background color of the picture and a list of curves. Each <curve> element contains the data from one object of type CurveData.

It is easy enough to write data in a customized XML format, although we have to be very careful to follow all the syntax rules. Here is how I write the data for a SimplePaint picture to a PrintWriter, out:

out.println("<?xml version=\"1.0\"?>");
out.println("<simplepaint version=\"1.0\">");
Color bgColor = getBackground();
out.println("   <background red='" + bgColor.getRed() + "' green='" +
      bgColor.getGreen() + "' blue='" + bgColor.getBlue() + "'/>");
for (CurveData c : curves) {
   out.println("   <curve>");
   out.println("      <color red='" + c.color.getRed() + "' green='" +
         c.color.getGreen() + "' blue='" + c.color.getBlue() + "'/>");
   out.println("      <symmetric>" + c.symmetric + "</symmetric>");
   for (Point pt : c.points)
      out.println("      <point x='" + pt.x + "' y='" + pt.y + "'/>");
   out.println("   </curve>");
}
out.println("</simplepaint>");

Reading the data back into the program is another matter. To reconstruct the data structure represented by the XML Document, it is necessary to parse the document and extract the data from it. Fortunately, Java has a standard API for parsing and processing XML Documents. (Actually, it has two, but we will only look at one of them.)

A well-formed XML document has a certain structure, consisting of elements containing attributes, nested elements, and textual content. It's possible to build a data structure in the computer's memory that corresponds to the structure and content of the document. Of course, there are many ways to do this, but there is one common standard representation known as the Document Object Model, or DOM. The DOM specifies how to build data structures to represent XML documents, and it specifies some standard methods for accessing the data in that structure. The data structure is a kind of tree whose structure mirrors the structure of the document. The tree is constructed from nodes of various types. There are nodes to represent elements, attributes, and text. (The tree can also contain several other types of node, representing aspects of XML that we can ignore here.) Attributes and text can be processed without directly manipulating the corresponding nodes, so we will be concerned almost entirely with element nodes.

The sample program XMLDemo.java lets you experiment with XML and the DOM. It has a text area where you can enter an XML document. Initially, the input area contains the sample XML document from this section. When you click a button named "Parse XML Input", the program will attempt to read the XML from the input box and build a DOM representation of that document. If the input is not legal XML, an error message is displayed. If it is legal, the program will traverse the DOM representation and display a list of elements, attributes, and textual content that it encounteres. (The program uses a few techniques that I won't discuss here.) Here is an applet version of the program for you to try:

In Java, the DOM representation of an XML document file can be created with just two statements. If selectedFile is a variable of type File that represents the XML file, then

DocumentBuilder docReader 
                 = DocumentBuilderFactory.newInstance().newDocumentBuilder();
xmldoc = docReader.parse(selectedFile);

will open the file, read its contents, and build the DOM representation. The classes DocumentBuilder and DocumentBuilderFactory are both defined in the package javax.xml.parsers. The method docReader.parse() does the actual work. It will throw an exception if it can't read the file or if the file does not contain a legal XML document. If it succeeds, then the value returned by docReader.parse() is an object that represents the entire XML document. (This is a very complex task! It has been coded once and for all into a method that can be used very easily in any Java program. We see the benefit of using a standardized syntax.)

The structure of the DOM data structure is defined in the package org.w3c.dom, which contains several data types that represent an XML document as a whole and the individual nodes in a document. The "org.w3c" in the name refers to the World Wide Web Consortium, W3C, which is the standards organization for the Web. DOM, like XML, is a general standard, not just a Java standard. The data types that we need here are Document, Node, Element, and NodeList. (They are defined as interfaces rather than classes, but that fact is not relevant here.) We can use methods that are defined in these data types to access the data in the DOM representation of an XML document.

An object of type Document represents an entire XML document. The return value of docReader.parse() -- xmldoc in the above example -- is of type Document. We will only need one one method from this class: If xmldoc is of type Document, then

xmldoc.getDocumentElement()

returns a value of type Element that represents the root element of the document. (Recall that this is the top-level element that contains all the other elements.) In the sample XML document from earlier in this section, the root element consists of the tag <simplepaint version="1.0">, the end-tag </simplepaint>, and everything in between. The elements that are nested inside the root element are represented by their own nodes, which are said to be children of the root node. An object of type Element contains several useful methods. If element is of type Element, then we have:

element.getTagName() -- returns a String containing the name that is used in the element's tag. For example, the name of a <curve> element is the string "curve".
element.getAttribute(attrName) -- if attrName is the name of an attribute in the element, then this method returns the value of that attribute. For the element, <point x="83" y="42"/>, element.getAttribute("x") would return the string "83". Note that the return value is always a String, even if the attribute is supposed to represent a numerical value. If the element has no attribute with the specified name, then the return value is an empty string.
element.getTextContent() -- returns a String containing all the textual content that is contained in the element. Note that this includes text that is contained inside other elements that are nested inside the element.
element.getChildNodes() -- returns a value of type NodeList that contains all the Nodes that are children of the element. The list includes nodes representing other elements and textual content that are directly nested in the element (as well as some other types of node that I don't care about here). The getChildNodes() method makes it possible to traverse the entire DOM data structure by starting with the root element, looking at children of the root element, children of the children, and so on. (There is a similar method that returns the attributes of the element, but I won't be using it here.)
element.getElementsByTagName(tagName) -- returns a NodeList that contains all the nodes representing all elements that are are nested inside element and which have the given tag name. Note that this includes elements that are nested to any level, not just elements that are directly contained inside element. The getElementsByTagName() method allows you to reach into the document and pull out specific data that you are interested in.

An object of type NodeList represents a list of Nodes. It does not use the API defined for lists in the Java Collection Framework. Instead, a value, nodeList, of type NodeList has two methods: nodeList.getLength() returns the number of nodes in the list, and nodeList.item(i) returns the node at position i, where the positions are numbered 0, 1, ..., nodeList.getLength() - 1. Note that the return value of nodeList.get() is of type Node, and it might have to be type-cast to a more specific node type before it is used.

Knowing just this much, you can do the most common types of processing of DOM representations. Let's look at a few code fragments. Suppose that in the course of processing a document you come across an Element node that represents the element

<background red='255' green='153' blue='51'/>

This element might be encountered either while traversing the document with getChildNodes() or in the result of a call to getElementsByTagName("background"). Our goal is to reconstruct the data structure represented by the document, and this element represents part of that data. In this case, the element represents a color, and the red, green, and blue components are given by the attributes of the element. If element is a variable that refers to the node, the color can be obtained by saying:

int r = Integer.parseInt( element.getAttribute("red") );
int g = Integer.parseInt( element.getAttribute("green") );
int b = Integer.parseInt( element.getAttribute("blue") );
Color bgColor = new Color(r,g,b);

Suppose now that element refers to the node that represents the element

<symmetric>true</symmetric>

In this case, the element represents the value of a boolean variable, and the value is encoded in the textual content of the element. We can recover the value from the element with:

String bool = element.getTextContent();
boolean symmetric;
if (bool.equals("true"))
   symmetric = true;
else
   symmetric = false;

Next, consider an example that uses a NodeList. Suppose we encounter an element that represents a list of Points:

<pointlist>
   <point x='17' y='42'/>   
   <point x='23' y='8'/>   
   <point x='109' y='342'/>   
   <point x='18' y='270'/>   
</pointlist>

Suppose that element refers to the node that represents the <pointlist> element. Our goal is to build the list of type ArrayList<Point> that is represented by the element. We can do this by traversing the NodeList that contains the child nodes of element:

ArrayList<Point> points = new ArrayList<Point>();
NodeList children = element.getChildNodes();
for (int i = 0; i < children.getLength(); i++) {
   Node child = children.item(i);   // One of the child nodes of element.
   if ( child instanceof Element ) {
      Element pointElement = (Element)child;  // One of the <point> elements.
      int x = Integer.parseInt( pointElement.getAttribute("x") );
      int y = Integer.parseInt( pointElement.getAttribute("y") );
      Point pt = new Point(x,y); // Create the Point represented by pointElement.
      points.add(pt);            // Add the point to the list of points.
   }
}

All the nested <point> elements are children of the <pointlist> element. The if statement in this code fragment is necessary because an element can have other children in addition to its nested elements. In this example, we only want to process the children that are elements.

All these techniques can be employed to write the file input method for the sample program SimplePaintWithXML.java. When building the data structure represented by an XML file, my approach is to start with a default data structure and then to modify and add to it as I traverse the DOM representation of the file. It's not a trivial process, but I hope that you can follow it:

Color newBackground = Color.WHITE;
ArrayList<CurveData> newCurves = new ArrayList<CurveData>();

Element rootElement = xmldoc.getDocumentElement();
   
if ( ! rootElement.getNodeName().equals("simplepaint") )
   throw new Exception("File is not a SimplePaint file.");
String version = rootElement.getAttribute("version");
try {
   double versionNumber = Double.parseDouble(version);
   if (versionNumber > 1.0)
      throw new Exception("File requires a newer version of SimplePaint.");
}
catch (NumberFormatException e) {
}

NodeList nodes = rootElement.getChildNodes();
   
for (int i = 0; i < nodes.getLength(); i++) {
   if (nodes.item(i) instanceof Element) {
      Element element = (Element)nodes.item(i);
      if (element.getTagName().equals("background")) { // Read background color.
         int r = Integer.parseInt(element.getAttribute("red"));
         int g = Integer.parseInt(element.getAttribute("green"));
         int b = Integer.parseInt(element.getAttribute("blue"));
         newBackground = new Color(r,g,b);
      }
      else if (element.getTagName().equals("curve")) { // Read data for a curve.
         CurveData curve = new CurveData();
         curve.color = Color.BLACK;
         curve.points = new ArrayList<Point>();
         newCurves.add(curve);  // Add this curve to the new list of curves.
         NodeList curveNodes = element.getChildNodes();
         for (int j = 0; j < curveNodes.getLength(); j++) {
            if (curveNodes.item(j) instanceof Element) {
               Element curveElement = (Element)curveNodes.item(j);
               if (curveElement.getTagName().equals("color")) { 
                  int r = Integer.parseInt(curveElement.getAttribute("red"));
                  int g = Integer.parseInt(curveElement.getAttribute("green"));
                  int b = Integer.parseInt(curveElement.getAttribute("blue"));
                  curve.color = new Color(r,g,b);
               }
               else if (curveElement.getTagName().equals("point")) {
                  int x = Integer.parseInt(curveElement.getAttribute("x"));
                  int y = Integer.parseInt(curveElement.getAttribute("y"));
                  curve.points.add(new Point(x,y));
               }
               else if (curveElement.getTagName().equals("symmetric")) {
                  String content = curveElement.getTextContent();
                  if (content.equals("true"))
                     curve.symmetric = true;
               }
            }
         }
      }
   }
}
curves = newCurves;  // Change picture in window to show the data from file.
setBackground(newBackground);
repaint();

XML has developed into an extremely important technology, and some applications of it are very complex. But there is a core of simple ideas that can be easily applied in Java. Knowing just the basics, you can make good use of XML in your own Java programs.

End of Chapter 11