Native Validation Interface

Written by Kohsuke KAWAGUCHI

Table of Contents

  1. Introduction
  2. Model
  3. Obtaining VGM
  4. Validation
  5. Context
  6. Error Reporting and Recovery
  7. Advanced Topics

Introduction

MSV has a native API for validation which enables better error reporting and flexible validation. This document describes this native API of MSV.

Model

The native API consists of two interfaces: Acceptor and DocumentDeclaration.

DocumentDeclaration is the VGM. Its sole purpose is to create an Acceptor which validates the top level sequence, which is usually the root element.

An Acceptor performs a validation for one content model (siblings). It can create new "child" acceptors to validate child content models, thereby validating the whole tree.

Obtaining VGM

One simple way to compile a schema into a VGM is to use the GrammarLoader.loadVGM method. This method takes a schema as an argument and compiles it into a AGM, then wrap it into VGM. The source code of GrammarLoader should reveal how you can create VGM in other ways.

It is important that some schema languages may use different VGM implementations, or there might be more than one VGM implementations for one schema language.For example, right now W3C XML Schema uses com.sun.verifier.regexp.xmlschema.XSREDocDecl while all others use com.sun.verifier.regexp.REDocumentDecl. So creating a VGM from an AGM is non trivial.

Validation

Let's assume that we have a DocumentDeclaration object and see how we can perform a plain-vanilla validation by traversing a DOM tree.

From the higher point of view, the validation will be done by passing information about XML document through various methods of the Acceptor interface, creating acceptors for each element.

The first thing to do is to create an Acceptor and use it to validate the top level, as follows:

void validate( Document dom, DocumentDeclaration docDecl ) {
  Acceptor acc = docDecl.createAcceptor();
  return validateElement(dom.getDocumentElement(),acc);
}

The validateElement method is defined here as validating a given element with a given acceptor:

void validateElement( Element node, Acceptor acc ) {
  ...
}

Validation of an element is done by the createChildAcceptor method. This method creates a child acceptor, which will validate children of that element. This method takes a StartTagInfo as a parameter; this object holds the information about the element name and attributes (information about the start tag), and you are responsible for creating that object.

void validateElement( Element node, Acceptor acc ) {
  org.xml.sax.helpers.AttributesImpl atts =
    /* create SAX Attributes object from attributes of this node. */
  
  // StartTagInfo uses Attributes object for keeping attributes.
  StartTagInfo sti = new StartTagInfo(
    node.getNamespaceURI(), // information about the element name.
    node.getLocalName(),
    node.getName(),
    attributes,
    context );
  
  Acceptor child = acc.createChildAcceptor(sti,null);
  if(child==null)  throw new InvalidException();
}

If there is a validation error (e.g., unexpected element), the createChildAcceptor method returns null.

Once you create a child acceptor, the next thing to do is to validate children (attributes of that element, child elements, and texts within that element) with it. After that, call the isAcceptState method to see if the child acceptor is "satisfied". An acceptor is satisfied when the whole content model was OK.

  Acceptor child = acc.createChildAcceptor(sti,null);
  if(child==null)  throw new InvalidException();
  
  validateChildren(node,child);
  
  // test if it's OK to end the contents here.
  if(!child.isAcceptState())
    throw new InvalidException();

For example, when the content model is (a,b,c) and the actual content is <a/><b/>, then the acceptor won't be satisfied because it still need to see c. So when false is returned from this method, then it means mandatory elements are missing.

Once you make sure that the child acceptor is in a valid state, then you'll pass it back to the parent acceptor. The parent acceptor will step forward (think of it as an automaton) by eating the child acceptor.

  acc.stepForward(child);

The complete code of the validateElement method will be as follows:

void validateElement( Element node, Acceptor acc ) {
  // create StartTagInfo
  StartTagInfo sti = new StartTagInfo( ... );
  
  Acceptor child = acc.createChildAcceptor(sti,null);
  if(child==null)  throw new InvalidException();
  
  validateChildren(node,child,sti);
  
  // test if it's OK to end the contents here.
  if(!child.isAcceptState())
    throw new InvalidException();
  
  acc.stepForward(child);
}

Let's move on to the validateChildren method. First, call the onAttribute method for each attribute:

void validateChildren( Element node, Acceptor acc, StartTagInfo sti ) {
  
  NamedNodeMap atts = node.getAttributes();
  
  for( int i=0; i<atts.getLength(); i++ ) {
    Attr a = atts.item(i);
    if( !acc.onAttribute(a.getNamespaceURI(),a.getLocalName(), ... ) )
      throw new InvalidException();
  }
}

It returns false if there is an error in the attribute (e.g., undefined attribute, or the attribute value is wrong).

Then, call the onEndAttributes method to indicate that no more attribute is there.

  if(!acc.onEndAttributes(acc,null))
    throw new InvalidException();

This method returns false when there has to be more attributes. For example, this method returns false when a mandatory attribute is missing.

Once you processed attributes, you'll process the children (contents) of the element.

  node.normalize();
  for( Node n = node.getFirstChild(); n!=null; n=n.getNextSibling() ) {
    switch(n.getNodeType()) {
    
    case Node.ELEMENT_NODE:
      validateElement( (Element)n, acc );
      break;
    
    case Node.TEXT_NODE:
    case Node.CDATA_SECTION_NODE:
      String text = n.getNodeValue();
      
      if(!acc.onText(text,context,null,null))
        throw new InvalidException();
      break;
    }
  }

It is important to normalize the DOM tree. This is because the onText method has to be called with the whole text chunk. For example, if you have an XML like <foo>abcdef</foo>, then you cannot call the onText method twice by splitting "abcdef" into two substrings.

The onText method returns false if the text is invalid. Usually, it is because the text is not allowed there at all, or the text is invalid wrt the datatype.



The following table summarizes atoms in XML documents and actions you have to take.

Atom Action
start tag call the createChildAcceptor and switch to the child acceptor
end tag call the isAcceptState then stepForward, switch back to the parent acceptor.
attribute call the onAttribute method. Don't forget to call the onEndAttributes.
text call the onText method. Be careful with the normalization.

Context

Although I didn't mentioned in the previous section, one needs to specify a "context" object (com.sun.msv.verifier.IDContextProvider) to some of the abovementioned methods. Those objects are used to provide contextual information (like namespace prefix bindings, the base URI, etc). For example, "QName" datatype needs to resolve a namespace prefix into a namespace URI.

You have to implement a context object by yourself and pass it to methods that need it. If you are not interested in xml:base, then you can return null from the getBaseUri method. Similarly, if you don't care about entities and notations, then you can return false from the isNotation and isUnparsedEntity methods.

Error message and recovery

Most of the methods on the Acceptor interface returns false to indicate a validation error. To obtain more detailed error message, pass a StringRef object to those methods.

Consider the following example for the isAcceptState method:

  if(!acc.isAcceptState(null)) {
    // there was an error in the document.
    
    // create a StringRef object. This object will
    // receive error message.
    StringRef ref = new StringRef();
    
    // call the isAcceptState method again
    acc.isAcceptState(ref);
    
    // print the error message
    System.out.println(ref.str);
  }

These methods do not change the state of the acceptor when they return false. So you can call the same method again (with a valid StringRef object) to get the error message.

If you specify a StringRef object, the acceptor will recover from the error as a side-effect. For example, if the createChildAcceptor method returns null and you call the same method again with a StringRef, then it will return a valid child acceptor object.

  Acceptor child = parent.createChildAcceptor(sti,null);
  if(child==null) {
    // get the error message
    StringRef ref = new StringRef();
    child = parent.createChildAcceptor(sti,ref);
    
    System.out.println(ref.str);
    
    // the above statement will return a valid acceptor
    // so we can continue validating documents.
  }
  
  ...

The same recovery behavior will apply for all other methods. This makes it possible to continue validation after seeing errors.

Note that because the error recovery is highly ad-hoc, somtimes it will fall into the panic mode, in which a lot of false errors are reported. So you may want to implement some kind of filters to suppress error messages until you are sure that it gets back to sync.

Advanced Topics

Re-validation

Acceptors can be always cloned by calling the createClone method. Such a clone is useful to "bookmark" a particular element of a document.

For example, you can run the normal validation once to associate each DOM Node with Acceptor. Later, you can use that cloned acceptor to re-validate a subtree.

Datatype Assignment

In the onText and onAttribute methods, applications can obtain datatypes that are assigned to those text.

To obtain this information, pass a non-null DatatypeRef object to those methods. Upon the method completion, this DatatypeRef object will receive an array of Datatypes.

When the array is null or empty, it means there was an error or the datatype was not uniquely determined. When there is only one item in the array, it means the attribute value (or the text) is validated as that datatype. If there are more than one items in the array, it measn the attribute value (or the text) was validated as a <list> (of RELAX NG) and each datatype in the array indicates the datatype of each token.