Extending MathML, continued

[28 April 2014]

In an earlier post, I talked about how to extend the MathML schema by using element substitution groups. I tend to think of that as the best way, other things being equal, to extend a schema.

But it’s also possible to extend the MathML schema by using xsd:redefine; this post explains what’s involved.

Step by step

Define the extension elements

First, we make a schema document for our own namespace, including the extension elements. We can use the first version of the schema document given in the earlier post, before we messed around with the type definitions for the new elements.

<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema"
  xmlns:my="http://example.com/nss/sample"
  targetNamespace="http://example.com/nss/sample"
  elementFormDefault="qualified"> 

  <xs:complexType name="extension-type-1" mixed="true">
    <xs:sequence/>
    &t;xs:attribute name="id" type="xs:ID"/>
    <xs:attribute name="flavor"/>
    <xs:attribute name="tone"/>
  </xs:complexType>
  <xs:complexType name="extension-type-2" mixed="true">
    <xs:sequence/>
    <xs:attribute name="gloss" type="xs:IDREF" use="required"/>
  </xs:complexType>  

  <xs:element name="ext1" type="my:extension-type"/>
  <xs:element name="ext2" type="my:extension-type"/>

</xs:schema>

Make a new MathML schema document

Next, we make a new top-level schema document to use when we import the MathML namespace. The new document will use xsd:redefine to point to the standard top-level or root schema document for MathML, and specify some changes.

In particular, we decide that we want to add our extension elements to the content-model group named mathml:Presentation-layout.class. We’ll define a group with that name, with the slightly unusual property that our new group will include a recursive reference to the existing group of that name, as a child. Such recursion is normally forbidden in content-model groups, but when redefining them, it’s obligatory. The group will look like this:

<xs:group name="Presentation-layout.class">
  <xs:choice>
    <xs:element ref="my:ext1"/>
    <xs:element ref="my:ext2"/>
    <xs:group ref="mathml:Presentation-layout.class"/>
  </xs:choice>
</xs:group>

The new top-level schema document for MathML will wrap that group definition in an xsd:redefine element, and will contain nothing else.

<?xml version="1.0" encoding="UTF-8"?>
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema"
  xmlns:mathml="http://www.w3.org/1998/Math/MathML"
  xmlns:my="http://example.com/nss/sample"
  targetNamespace="http://www.w3.org/1998/Math/MathML"
  elementFormDefault="qualified">
  
  <xs:import namespace="http://example.com/nss/sample"/>

  <xs:redefine schemaLocation="../standard-modules/mathml2/mathml2.xsd">
    <xs:annotation>
      <xs:documentation> This is a modified copy of the MathML 2
        schema documents at http://www.w3.org/Math/XMLSchema/mathml2.tgz. 

        We use xsd:redefine to extend the Presentation-layout class to
        include two extension elements. </xs:documentation>
    </xs:annotation>

    <xs:group name="Presentation-layout.class">
      <xs:choice>
        <xs:element ref="my:ext1"/>
        <xs:element ref="my:ext2"/>
        <xs:group ref="mathml:Presentation-layout.class"/>
      </xs:choice>
    </xs:group>
  </xs:redefine>

</xs:schema>

Point to the new MathML schema document, not the old

Finally, we make the top-level driver document for our schema import our redefinition of MathML, not the standard MathML schema documents. In a normal schema importing the MathML namespace, we might have something like this:

<xsd:import namespace="http://www.w3.org/1998/Math/MathML"
            schemaLocation="standard-modules/mathml2/mathml2.xsd"/>

In our modified schema we have:

<xsd:import namespace="http://www.w3.org/1998/Math/MathML"
            schemaLocation="local-mods/my-modified-mathml2.xsd"/>

Advantages and disadvantages

Using xsd:redefine has one advantage over using substitution groups: we didn’t have to muck with the type definitions of our extension elements in order to derive those types from the type used by the substitution-group head. It has a few disadvantages worth mentioning.

First, every group and type we redefine is required to refer to the group or type being redefined. So we cannot use xsd:redefine to make arbitrary changes in a group or type. This was not a problem in our example, but it can be just as constraining, in its way, as the type-derivation requirement on substitution groups.

Second, despite the restrictions on what can be done in a redefinition, xsd:redefine does not guarantee that the relation of our extended schema to the base schema is easy to explain, document, or understand. Documents valid against the base schema are not guaranteed to be valid against the extended schema, so we don’t necessarily have a clean extension. Nor is there a convenient way to ensure that documents valid against the redefinition are always valid against the base schema, in cases where we want a clean restriction. The redefined groups, or types, may turn out to violate some other constraint on schemas, so there is absolutely no guarantee that a redefinition which conforms to all the constraints in the spec, applied to a schema which conforms to all the constraints in the spec, will produce a resulting schema which conforms to the constraints in the spec. Under these circumstances, it’s not surprising that some informed observers regard the constraints on xsd:redefine as pointless.

And, third, both the XSD 1.0 and the XSD 1.1 specs are internally contradictory with regard to xsd:redefine, and it is not hard to find sets of schema documents which behave very differently in different implementations as a result of the implementors having different interpretations of the spec. It is possible to use xsd:redefine in ways that will be consistently implemented by different XSD validators, but it’s very easy to find yourself locked in to a particular validator if you’re not careful.

Careful readers will have noticed that some imports in the schema documents shown have schemaLocation attributes, and some don’t. The simplest way to achieve interoperability is to keep things very very simple, and never ever tell a validator more than once where to find a schema document for a given namespace. My way of doing that is always to have a top-level document for the schema that performs all the imports and includes needed for the schema, and provides explicit schema location information, and to insist as a general rule that no other schema documents ever contain schema location information, unless (as in the case of an xsd:redefine) it is required for conformance to the spec. That way, the schema validator never sees two different schema locations for the same namespace, and we never need to worry about the fact that in such a situation different implementations of XSD may make different choices about which schema location to load and which to ignore. It is especially important to avoid the situation of having one schema document in your input import a particular namespace, while another redefines that same namespace: when that happens, no two XSD processors behave the same way. (This may seem implausible, since there are surely more than two XSD processors in the world. But it turns out that there are more than two different behaviors possible in the situation described. I once tested seven validators on an example, with different ways of formulating the command line, and got nine different behaviors from the set of seven processors.)

Checking ISBN check-digits in XSD 1.1

[6 December 2012]

I recently had occasion to write an XSD 1.1 schema for a client whose data includes ISBN and ISSN values.

In a DTD, all one can plausibly say about an element which is supposed to contain an ISBN is that it contains character data, something like this:

<!ELEMENT isbn (#PCDATA) >

That accepts legal ISBN values, like “0 13 651431 6” and “978-1-4419-1901-4”, but it also accepts strings with invalid check-digits, like “0 13 561431 6” (inversion of digits is said to be the most common single error in typing ISBNs), and strings with the wrong number of digits, like “978-1-4419-19014-4”. For that matter, it also accepts strings like “@@@ call Sally and ask what the ISBN is going to be @@@”. (There may be stages in a document’s life when you want to accept that last value. But there may also be stages when you don’t want to allow anything but a legal ISBN. This post is about what to do when writing a schema for that latter set of stages in a document’s life.)

In XSD 1.0, regular-expression patterns can be used to say, more specifically, that the value of a ten-digit ISBN should be of a specific length (thirteen, actually, not ten, because we want to require hyphens or blanks as separators) and should contain only decimal digits, separators, and X (because X is a legal check-digit).

<xsd:simpleType name="ISBN-10">
  <xsd:restriction base="xsd:string">
    <xsd:length value="13"/>
    <xsd:pattern value="[0-9X \-]*"/>
  </xsd:restriction>
</xsd:simpleType>

Actually, we can do better than that. In a ten-digit ISBN, there should be ten digits: one to five digits in the so-called group identifier (which divides the world in language / country areas), one to seven digits in the publisher code (in the US, all publisher codes use at least two digits, but I have not been able to find anything that plausibly asserts this is necessarily true for all publisher codes world-wide), one to seven in the item number, and a final digit (or X) as a check digit.

<xsd:simpleType name="ISBN-10">
  <xsd:restriction base="xsd:string">
    <xsd:length value="13"/>
    <xsd:pattern 
      value="[0-9]{1,5}-[0-9]{1,7}-[0-9]{1,7}-[0-9X]"/>
    <xsd:pattern 
      value="[0-9]{1,5} [0-9]{1,7} [0-9]{1,7} [0-9X]"/>
  </xsd:restriction>
</xsd:simpleType>

Since the number of separators is fixed, and the total length of the string is fixed, the type definition above will only accept literals with exactly ten non-separator digits. The patterns above assume that either hyphens or blanks will be used as separators, not a mix of hyphens and blanks; they also want any X appearing as a check-digit to be uppercase.

A similar type can be defined for thirteen-digit ISBNs, which add a three-digit industry-code prefix and another separator at the beginning:

<xsd:simpleType name="ISBN-13">
  <xsd:restriction base="xsd:string">
    <xsd:length value="17"/>
    <xsd:pattern 
      value="(978|979)-[0-9]{1,5}-[0-9]{1,7}-[0-9]{1,7}-[0-9]"/>
    <xsd:pattern 
      value="(978|979) [0-9]{1,5} [0-9]{1,7} [0-9]{1,7} [0-9]"/>
  </xsd:restriction>
</xsd:simpleType>

In XSD 1.0, that’s as much as we can conveniently do. (Well, almost. If we are willing to endure the associated tedium, we can check for the correct positioning of hyphens in at least the ISBNs of some areas which assign publisher codes in such a way as to ensure that ISBNs remain unique even if the separators are dropped. See the ISBN datatype defined by Roger Costello and Roger Sperberg for an illustration of the principle.)

In theory, we ought to be able to do better: the check-digit algorithm can be checked by a finite-state automaton, and the languages of ten-digit and thirteen-digit ISBNs are thus demonstrably regular languages. So in principle, there are regular expressions that can perform the check-digit calculation. When I have tried to translate from the FSA to a regular expression, however, the result has been uncomfortably long.

But in XSD 1.1, the addition of assertions makes it possible to replicate the check-digit algorithm. We can write a type definition similar to the ones given above, with an additional xsd:assertion element whose test attribute has as its value an XPath expression which will validate the check-digit.

The ISBN-10 check-digit is constructed in such a way that the sum of digit 1 × 10 + digit 2 × 9 + … + digit 8 × 3 + digit 9 × 2 + digit 10 (if digit 10 is a digit, or 10 if digit 10 is an X), modulo 11, is equal to 0. The ISBN-13 check-digit uses a similar but simpler calculation: the numeric values of digits in even-numbered positions are multiplied by three, those of the digits in odd-numbered positions by one, and the sum of these weighted values must be a multiple of ten. This calculation is well within the range of XPath 2.0; let us build up the expression in stages.

Given a candidate ISBN in variable $value, we can obtain a string of digits (or X) without the separators by deleting all hyphens and blanks, which we can do in XPath by writing:

translate($value,' -','')

We can turn that, in turn, into a sequence of numbers (the UCS code-point numbers for the characters) using the XPath 2.0 function string-to-codepoints:

string-to-codepoints(translate($value,' -',''))

For example, given the ISBN “0 13 651431 6”, as the value of $value, the expression just given evaluates to the sequence of integers (48 49 51 54 53 49 52 51 49 54). For purposes of the checksum calculation, however, we’d rather have a 0 in the ISBN appear as a 0, not a 48, in our sequence of numbers. And we need to turn X (which maps to 88) into 10. So we write the following XPath 2.0 expression:

for $d in string-to-codepoints(translate($value,' -',''))
return if ($d = 88) then 10 else ($d - 48)

Now the ten-digit ISBN “0 13 651431 6” and the thirteen-digit ISBN “978-1-4419-1901-4” map, respectively, to the sequences (0 1 3 6 5 1 4 3 1 6) and (9 7 8 1 4 4 1 9 1 9 0 1 4). This gives us precisely what we need for doing the arithmetic.

From the integer sequences thus created, we can extract the first digit by writing the filter expression [1], the second digit with [2], etc. It would be convenient to be able to assign the integer sequence to a variable, but that’s not possible in XPath 2.0 (at least, not using normal means). In writing the schema document, however, we can put the expression that generates the sequence into a named entity, thus:

<!ENTITY digit-sequence 
"(for $d in string-to-codepoints(translate($value,' -',''))
  return if ($d = 88) then 10 else ($d - 48))">

Now we can write the assertion for ISBN-10 thus:

<xsd:assertion test="
        ((&digit-sequence;[1] * 8
        + &digit-sequence;[2] * 7
        + &digit-sequence;[3] * 6
        + &digit-sequence;[4] * 5
        + &digit-sequence;[5] * 4
        + &digit-sequence;[6] * 3
        + &digit-sequence;[7] * 2
        + &digit-sequence;[8] * 1) mod 11) eq 0
        "/>

We could write a similar expression for ISBN-13, but in fact we can use simple arithmetic to simplify the expression to:

<xsd:assertion test="
        ((sum(&digit-sequence;
             [position() mod 2 = 1]) 
        + sum(for $d in (&digit-sequence;
                         [position() mod 2 = 0]) 
              return 3 * $d)
        ) mod 10) eq 0
        "/>

(Digression on entities …)

Some people, of course, frown on the use of entities in XML and claim that they are not helpful. I think examples like this one clearly show that entities can be very useful when used intelligently; it is much easier to see that the assertions given above are correct than it is in the equivalent assertions after entity expansion (post-edited to provide better legibility):

<xsd:assertion test="
  (((for $d in string-to-codepoints(translate($value,' -','')) 
     return if ($d = 88) then 10 else ($d - 48)) [1] * 10         
  + (for $d in string-to-codepoints(translate($value,' -','')) 
     return if ($d = 88) then 10 else ($d - 48)) [2] * 9         
  + (for $d in string-to-codepoints(translate($value,' -','')) 
     return if ($d = 88) then 10 else ($d - 48)) [3] * 8         
  + (for $d in string-to-codepoints(translate($value,' -','')) 
     return if ($d = 88) then 10 else ($d - 48)) [4] * 7         
  + (for $d in string-to-codepoints(translate($value,' -','')) 
     return if ($d = 88) then 10 else ($d - 48)) [5] * 6         
  + (for $d in string-to-codepoints(translate($value,' -','')) 
     return if ($d = 88) then 10 else ($d - 48)) [6] * 5         
  + (for $d in string-to-codepoints(translate($value,' -','')) 
     return if ($d = 88) then 10 else ($d - 48)) [7] * 4         
  + (for $d in string-to-codepoints(translate($value,' -','')) 
     return if ($d = 88) then 10 else ($d - 48)) [8] * 3         
  + (for $d in string-to-codepoints(translate($value,' -','')) 
     return if ($d = 88) then 10 else ($d - 48)) [9] * 2         
  + (for $d in string-to-codepoints(translate($value,' -','')) 
     return if ($d = 88) then 10 else ($d - 48)) [10] * 1) 
  mod 11) 
  eq 0
  "/>      

<xsd:assertion test="
   ((sum(
      (for $d in string-to-codepoints(translate($value,' -','')) 
       return if ($d = 88) then 10 else ($d - 48)
      )[position() mod 2 = 1]
     )          
   + sum(for $d in 
          ((for $d in 
              string-to-codepoints(translate($value,' -','')) 
            return if ($d = 88) then 10 else ($d - 48)
           ) [position() mod 2 = 0]) 
         return 3 * $d)) 
   mod 10) 
   eq 0
   "/>

The use of entity references makes it far easier to be confident that the two, or ten, for-expressions all really do the same thing, and they provide a level of abstraction which, in a simple way, encapsulates the book-keeping details and allows the overall structure of the two test expressions to be more clearly exhibited.

(End of digression.)

The end result is an XSD 1.1 datatype that detects most typos in the recording of ISBNs. It does not, alas, ensure that the legal ISBN one types in is actually the correct ISBN, only that it is a correct ISBN. But using machines to check what machines can check will leave more time for humans to check those things that only humans can check.