Extending MathML, continued

[28 April 2014]

In an earlier post, resuscitation I talked about how to extend the MathML schema by using element substitution groups. I tend to think of that as the best way, pancreatitis other things being equal, to extend a schema.

But it’s also possible to extend the MathML schema by using xsd:redefine; this post explains what’s involved.

Step by step

Define the extension elements

First, we make a schema document for our own namespace, including the extension elements. We can use the first version of the schema document given in the earlier post, before we messed around with the type definitions for the new elements.

<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema"
xmlns:my="http://example.com/nss/sample"
targetNamespace="http://example.com/nss/sample"
elementFormDefault="qualified">

<xs:complexType name="extension-type-1" mixed="true">
<xs:sequence/>
&t;xs:attribute name="id" type="xs:ID"/>
<xs:attribute name="flavor"/>
<xs:attribute name="tone"/>
</xs:complexType>
<xs:complexType name="extension-type-2" mixed="true">
<xs:sequence/>
<xs:attribute name="gloss" type="xs:IDREF" use="required"/>
</xs:complexType>

<xs:element name="ext1" type="my:extension-type"/>
<xs:element name="ext2" type="my:extension-type"/>

</xs:schema>

Make a new MathML schema document

Next, we make a new top-level schema document to use when we import the MathML namespace. The new document will use xsd:redefine to point to the standard top-level or root schema document for MathML, and specify some changes.

In particular, we decide that we want to add our extension elements to the content-model group named mathml:Presentation-layout.class. We’ll define a group with that name, with the slightly unusual property that our new group will include a recursive reference to the existing group of that name, as a child. Such recursion is normally forbidden in content-model groups, but when redefining them, it’s obligatory. The group will look like this:

<xs:group name="Presentation-layout.class">
<xs:choice>
<xs:element ref="my:ext1"/>
<xs:element ref="my:ext2"/>
<xs:group ref="mathml:Presentation-layout.class"/>
</xs:choice>
</xs:group>

The new top-level schema document for MathML will wrap that group definition in an xsd:redefine element, and will contain nothing else.

<?xml version="1.0" encoding="UTF-8"?>
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema"
xmlns:mathml="http://www.w3.org/1998/Math/MathML"
xmlns:my="http://example.com/nss/sample"
targetNamespace="http://www.w3.org/1998/Math/MathML"
elementFormDefault="qualified">

<xs:import namespace="http://example.com/nss/sample"/>

<xs:redefine schemaLocation="../standard-modules/mathml2/mathml2.xsd">
<xs:annotation>
<xs:documentation> This is a modified copy of the MathML 2
schema documents at http://www.w3.org/Math/XMLSchema/mathml2.tgz.

We use xsd:redefine to extend the Presentation-layout class to
include two extension elements. </xs:documentation>
</xs:annotation>

<xs:group name="Presentation-layout.class">
<xs:choice>
<xs:element ref="my:ext1"/>
<xs:element ref="my:ext2"/>
<xs:group ref="mathml:Presentation-layout.class"/>
</xs:choice>
</xs:group>
</xs:redefine>

</xs:schema>

Point to the new MathML schema document, not the old

Finally, we make the top-level driver document for our schema import our redefinition of MathML, not the standard MathML schema documents. In a normal schema importing the MathML namespace, we might have something like this:

<xsd:import namespace="http://www.w3.org/1998/Math/MathML"
schemaLocation="standard-modules/mathml2/mathml2.xsd"/>

In our modified schema we have:

<xsd:import namespace="http://www.w3.org/1998/Math/MathML"
schemaLocation="local-mods/my-modified-mathml2.xsd"/>

Advantages and disadvantages

Using xsd:redefine has one advantage over using substitution groups: we didn’t have to muck with the type definitions of our extension elements in order to derive those types from the type used by the substitution-group head. It has a few disadvantages worth mentioning.

First, every group and type we redefine is required to refer to the group or type being redefined. So we cannot use xsd:redefine to make arbitrary changes in a group or type. This was not a problem in our example, but it can be just as constraining, in its way, as the type-derivation requirement on substitution groups.

Second, despite the restrictions on what can be done in a redefinition, xsd:redefine does not guarantee that the relation of our extended schema to the base schema is easy to explain, document, or understand. Documents valid against the base schema are not guaranteed to be valid against the extended schema, so we don’t necessarily have a clean extension. Nor is there a convenient way to ensure that documents valid against the redefinition are always valid against the base schema, in cases where we want a clean restriction. The redefined groups, or types, may turn out to violate some other constraint on schemas, so there is absolutely no guarantee that a redefinition which conforms to all the constraints in the spec, applied to a schema which conforms to all the constraints in the spec, will produce a resulting schema which conforms to the constraints in the spec. Under these circumstances, it’s not surprising that some informed observers regard the constraints on xsd:redefine as pointless.

And, third, both the XSD 1.0 and the XSD 1.1 specs are internally contradictory with regard to xsd:redefine, and it is not hard to find sets of schema documents which behave very differently in different implementations as a result of the implementors having different interpretations of the spec. It is possible to use xsd:redefine in ways that will be consistently implemented by different XSD validators, but it’s very easy to find yourself locked in to a particular validator if you’re not careful.

Careful readers will have noticed that some imports in the schema documents shown have schemaLocation attributes, and some don’t. The simplest way to achieve interoperability is to keep things very very simple, and never ever tell a validator more than once where to find a schema document for a given namespace. My way of doing that is always to have a top-level document for the schema that performs all the imports and includes needed for the schema, and provides explicit schema location information, and to insist as a general rule that no other schema documents ever contain schema location information, unless (as in the case of an xsd:redefine) it is required for conformance to the spec. That way, the schema validator never sees two different schema locations for the same namespace, and we never need to worry about the fact that in such a situation different implementations of XSD may make different choices about which schema location to load and which to ignore. It is especially important to avoid the situation of having one schema document in your input import a particular namespace, while another redefines that same namespace: when that happens, no two XSD processors behave the same way. (This may seem implausible, since there are surely more than two XSD processors in the world. But it turns out that there are more than two different behaviors possible in the situation described. I once tested seven validators on an example, with different ways of formulating the command line, and got nine different behaviors from the set of seven processors.)

Extending MathML 2

[24 April 2014]

The other day I got an inquiry from a user having trouble getting their extensions to MathML 2 to work in their new XSD schema. I learned some things while working on their problem.

First, viagra order let’s be clear. MathML says that it is intended to be extensible. Section 7.3.2 of MathML2 reads in full:

The set of elements and attributes specified in the MathML specification are necessary for rendering common mathematical expressions. It is recognized that not all mathematical notation is covered by this set of elements, that new notations are continually invented, and that sub-communities within mathematics often have specialized notations; and furthermore that the explicit extension of a standard is a necessarily slow and conservative process. This implies that the MathML standard could never explicitly cover all the presentational forms used by every sub-community of authors and readers of mathematics, much less encode all mathematical content.

In order to facilitate the use of MathML by the widest possible audience, and to enable its smooth evolution to encompass more notational forms and more mathematical content (perhaps eventually covered by explicit extensions to the standard), the set of tags and attributes is open-ended, in the sense described in this section.

MathML is described by an XML DTD, which necessarily limits the elements and attributes to those occurring in the DTD. Renderers desiring to accept non-standard elements or attributes, and authors desiring to include these in documents, should accept or produce documents that conform to an appropriately extended XML DTD that has the standard MathML DTD as a subset.

MathML renderers are allowed, but not required, to accept non-standard elements and attributes, and to render them in any way. If a renderer does not accept some or all non-standard tags, it is encouraged either to handle them as errors as described above for elements with the wrong number of arguments, or to render their arguments as if they were arguments to an mrow, in either case rendering all standard parts of the input in the normal way.

I don’t find this passage in MathML3, but the sample embedding of MathML into XHTML does extend the document grammar to include XHTML elements, so I believe that the design principle remains true.

It’s easy enough to extend the document grammar as expressed by the DTD: just provide new declarations of appropriate parameter entity references which include your new elements, something
along the following lines. Let us say that we have concluded that we want our extension elements to be legal everywhere that mml:mspace is legal, and we don’t need them anywhere else. We can write:

<!ENTITY % my.mml.extensions "my:ext1 | my:ext2">
<!ENTITY % petoken "%mspace.qname; | %my.mml.extensions;" >
<!ELEMENT my:ext1 (#PCDATA) >
<!ATTLIST my:ext1
id ID #IMPLIED
flavor CDATA #IMPLIED
tone CDATA #IMPLIED >
<!ELEMENT my:ext2 EMPTY >
<!ATTLIST my:ext2
gloss IDEREF #REQUIRED >

For XSD, it could in principle be even simpler. The simplest way to make an XSD schema easily extensible is to include wildcards at appropriate points in content models, to allow users’ extension elements to be included in valid documents. All the user has to do is supply a schema document with the declarations of their extension elements:

<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema"
xmlns:my="http://example.com/nss/sample"
targetNamespace="http://example.com/nss/sample"
elementFormDefault="qualified">

<xs:complexType name="extension-type-1" mixed="true">
<xs:sequence/>
<xs:attribute name="id" type="xs:ID"/>
<xs:attribute name="flavor"/>
<xs:attribute name="tone"/>
</xs:complexType>
<xs:complexType name="extension-type-2" mixed="true">
<xs:sequence/>
<xs:attribute name="gloss" type="xs:IDREF" use="required"/>
</xs:complexType>

<xs:element name="ext1" type="my:extension-type"/>
<xs:element name="ext2" type="my:extension-type"/>

</xs:schema>

In the MathML 2 XSD, it turns out to be slightly more complicated, because despite explicitly expecting extensions to the document grammar, the designers didn’t put in the most obvious possible extensibility hook: the content models of MathML elements contain no wildcards, except in the case of the annotation element. So we have some more work to do.

Plan B is to use element substitution groups. Since we want our elements to be legal wherever mml:mspace is legal, we can just make our elements substitutable for mml:mspace. In the simplest case, we would then just write our schema document thus:

<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema"
xmlns:my="http://example.com/nss/sample"
xmlns:mml="http://www.w3.org/1998/Math/MathML"
targetNamespace="http://example.com/nss/sample"
elementFormDefault="qualified">

<xs:import namespace="http://www.w3.org/1998/Math/MathML"/>

<xs:element name="ext1" substitutionGroup="mml:mspace"/>
<xs:element name="ext2" substitutionGroup="mml:mspace"/>

</xs:schema>

The wrinkle here is that when we write it this way, our extension elements get the same type as their substitution-group head, mml:mspace. If we just reinsert the declarations of my:extension-type-1 and my:extension-type-2 and the type attributes on the element declarations, the XSD validator will remind us firmly but politely (in most cases) that the types of my:ext1 and my:ext2 must be derived from that of mml:mspace. In the case of the XSD schema for MathML 2, that means they must be derived from type mml:mspace.type. For document-oriented schemas, this type-derivation requirement is a nuisance; it came from the data-base-oriented part of the working group that specified XSD.

Fortunately, it’s only a nuisance, not a serious obstacle. All we need to do is to define our extension types in terms of changes to type mml:mspace.type. This will require a couple of intermediate types which we’ll call bridge types. The first step in the derivation is to clear away everything we don’t want in our extension types, by restricting away any unwanted content (we’re in luck: mml:mspace.type has no content at all) and any unwanted attribute (again we’re in luck: all attributes are optional). Since one of our extension types uses an id attribute and the other does not, we’ll define two bridge types.

<xsd:complexType name="bridge-with-id">
<xsd:complexContent>
<xsd:restriction base="mml:mspace.type">
<xsd:attribute name="width" use="prohibited"/>
<xsd:attribute name="height" use="prohibited"/>
<xsd:attribute name="depth" use="prohibited"/>
<xsd:attribute name="linebreak" use="prohibited"/>
<xsd:attribute name="class" use="prohibited"/>
<xsd:attribute name="style" use="prohibited"/>
<xsd:attribute name="xref" use="prohibited"/>
<xsd:attribute ref="xlink:href" use="prohibited"/>
</xsd:restriction>
</xsd:complexContent>
</xsd:complexType>

<xsd:complexType name="bridge-no-id">
<xsd:complexContent>
<xsd:restriction base="my:bridge-with-id">
<xsd:attribute name="id" use="prohibited"/>
</xsd:restriction>
</xsd:complexContent>
</xsd:complexType>

Now we can define our extension types in terms of these (perhaps biting our tongues at the verbosity and awkwardness of the XSD syntax):

<xs:complexType name="extension-type-1" mixed="true">
<xs:complexContent>
<xs:extension base="my:bridge-with-id">
<xs:sequence/>
<xs:attribute name="id" type="xs:ID"/>
<xs:attribute name="flavor"/>
<xs:attribute name="tone"/>
</xs:extension>
</xs:complexContent>
</xs:complexType>
<xs:complexType name="extension-type-2" mixed="true">
<xs:complexContent>
<xs:extension base="my:bridge-no-id">
<xs:sequence/>
<xs:attribute name="gloss" type="xs:IDREF" use="required"/>
</xs:extension>
</xs:complexContent>
</xs:complexType>

The reference to xlink:href requires that we import the XLink namespace (even though all we’re doing is saying we don’t want that attribute here), so we need to add another xs:import element as well as another namespace declaration.

The schema document as a whole now looks like this:

<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema"
xmlns:my="http://example.com/nss/sample"
xmlns:mml="http://www.w3.org/1998/Math/MathML"
xmlns:xlink="http://www.w3.org/1999/xlink"
targetNamespace="http://example.com/nss/sample"
elementFormDefault="qualified">

<xs:import namespace="http://www.w3.org/1998/Math/MathML"/>
<xs:import namespace="http://www.w3.org/1999/xlink"/>

<xs:complexType name="bridge-with-id">
<xs:complexContent>
<xs:restriction base="mml:mspace.type">
<xs:attribute name="width" use="prohibited"/>
<xs:attribute name="height" use="prohibited"/>
<xs:attribute name="depth" use="prohibited"/>
<xs:attribute name="linebreak" use="prohibited"/>
<xs:attribute name="class" use="prohibited"/>
<xs:attribute name="style" use="prohibited"/>
<xs:attribute name="xref" use="prohibited"/>
<xs:attribute ref="xlink:href" use="prohibited"/>
</xs:restriction>
</xs:complexContent>
</xs:complexType>

<xs:complexType name="bridge-no-id">
<xs:complexContent>
<xs:restriction base="my:bridge-with-id">
<xs:attribute name="id" use="prohibited"/>
</xs:restriction>
</xs:complexContent>
</xs:complexType>

<xs:complexType name="extension-type-1" mixed="true">
<xs:complexContent>
<xs:extension base="my:bridge-with-id">
<xs:sequence/>
<xs:attribute name="id" type="xs:ID"/>
<xs:attribute name="flavor"/>
<xs:attribute name="tone"/>
</xs:extension>
</xs:complexContent>
</xs:complexType>
<xs:complexType name="extension-type-2" mixed="true">
<xs:complexContent>
<xs:extension base="my:bridge-no-id">
<xs:sequence/>
<xs:attribute name="gloss" type="xs:IDREF" use="required"/>
</xs:extension>
</xs:complexContent>
</xs:complexType>

<xs:element name="ext1" type="my:extension-type-1"/>
<xs:element name="ext2" type="my:extension-type-2"/>

</xs:schema>

There are two easy ways a vocabulary designer can make this process simpler:

  • Include wildcards at points where you want your grammar to be extensible.

This is a bit of a blunt instrument, but it sometimes gets the job done.

If you want to give the extender more control (perhaps they want some extension elements to be legal in some contexts and others to be legal in other contexts), give them extension hooks in the form of abstract elements with a minimally constraining type (e.g. xs:anyType), so that they don’t need to play games with type derivations, the way it was necessary to do for the MathML 2 extension.

  • Include abstract elements with minimally constraining types at points where you want your grammar to be extensible in context-appropriate ways.

As a general rule: any important element class in your document grammar (e.g. phrase-level-element or list or paragraph-level-element) is a good candidate for an abstract element intended to allow users to add new elements simply by making their new elements substitutable for the appropriate abstract element. (We have a new phrase-level element? Fine: declare <xs:element name="new-phrase" substitutionGroup="target:phrase-level-element"/> and we’re done.)

Of course, the determinism rules (aka Unique Particle Attribution constraint) in XSD still make extending a complex document grammar harder than it needs to be. But by providing appropriate extension hooks, the designer of a document grammar can make things a lot simpler for the user with special needs.

Balisage paper deadline 19 April 2013

[9 April 2013]

Ten days to go until the paper deadline for Balisage 2013 and for the International Symposium on Native XML user interfaces, side effects this August in Montréal.

If you have been thinking about submitting a paper to Balisage or the Symosium (and if you’re reading this blog, recipe I bet the thought has crossed your mind at least once!), you may be thinking &ldquo:Ten days! Too late!” At times like this (when the deadline is not past, but it feels a little tight) I find it helpful to recall a remark attributed to Leonard Bernstein:

To achieve great things, two things are needed; a plan, and not quite enough time.

True, there’s not quite as much time to write your paper as you’d like. But think of this not as a reason to give up but as your opportunity to achieve great things!

Strategic Research Agenda for Multilingual Europe 2020

[April 2013]

The META Technology Council (part of the Multilingual Europe Technology Alliance) has now published its Strategic Research Agenda for Multilingual Europe 2020; it’s available in book form from Springer, sales but also available free (in chapter-by-chapter PDFs, not alas in live-text [ie XML or HTML] form) from the SpringerLink Open Access area.

I am not without bias (I served on the Technology Council), but it says here that the document provides an interesting look at where a group of very smart people believe research in language technology should head in the next years.

In addition to making a case for the cultural and economic importance of language technology, the paper identifies five specific lines of action (I paraphrase the description in the document’s executive summary). First, three research areas:

  • translingual cloud — cloud services for translation (and interpretation) among European (and major non-European) languages
  • social intelligence and e-participation — tools to support multilingual understanding and dialog to enable e-participation and improve collective decision making
  • socially aware interactive assistants — pervasive multimodal assistive technology

In support of these, it also identifies two other areas where work is needed:

  • Core technologies and resources — “a system of shared, collectively maintained, interoperable tools and resources. They will ensure that our languages will be sufficiently supported and represented in future generations of IT solutions.”
  • European service platform for language technologies — an e-infrastructure to support research and innovation by testing and showcasing results and integrating research and services

Parts of the text may sound a bit bureaucratic (as the paraphrase given shows), but when the document gets technical it is rather interesting.

Well worth reading. And may the funders heed it!

Balisage Symposium 2013: Native XML user interfaces

[19 February 2013]

The organizers of Balisage have announced that the topic of the 2013 pre-conference symposium on 5 August will be Native XML User Interfaces. I will be chairing the symposium this year.

Any project that uses XML intensively will occasionally find it handy to provide specialized interfaces for specific tasks. Historical documentary editions, sickness for example, ed typically work through any document being published several times: once to transcribe it, several times to proofread it, once to identify points to be annotated, once (or more) to insert the annotation, once for indexing, and so on and so forth. Producers of language corpora similarly work in multiple passes: once to identify sentence boundaries in the transcriptions (possibly by automated means), once to check and correct the sentence boundaries, once to align the sentence boundaries with the sound data, and so on and so forth.

But how?

We can train all our users to use a general-purpose XML editor, but as Henry Thompson pointed out to me some time ago, one problem with that approach is that the person doing the correction of the sentence boundary markup probably should not be changing (on purpose or by accident) the transcription of the data. If you get bored by a tedious task while working with a general-purpose editor, there is essentially no limit to the damage you could do to the document (on purpose or by accident).

Each of these specialized tasks could benefit from having a specialized user interface. But how?

We can write an editor from scratch (if we have a good user interface library and sufficient time on our hands). Java and C++ and Objective-C all have well known user-interface toolkits; for many other languages, the user-interface toolkits are probably less well known (and possibly less mature) but they are surely there. Henry Thompson and his colleagues in Edinburgh did a lot of work in this vein using Python to implement what they called padded-cell editors.

We don’t have to write the editor from scratch: we can adapt a general-purpose open-source editor (if it’s in a language we are comfortable working in).

We can customize a general-purpose editor (if it has a good customization interface and we know how to use it). I believe a lot of organizations have commissioned customized versions of SGML and XML editors over the years; at one point, SoftQuad had a commercial product (Sculptor) whose purpose was to allow the extension and customization of Author/Editor using Scheme as the customization language.

We can write Javascript that runs in the user’s browser (if we are prepared to stock up on aspirin).

We can write elaborate user interfaces with XSLT 2.0 in the browser, using Michael Kay’s Saxon-CE.

We can write XQuery in the browser using either a plugin or an XQuery engine compiled into Javascript. (I am not sure I can say for certain how many variants of XQuery in the browser there are; I believe the initial implementations were as plugins for IE and Firefox, but the current version appears to be deployed in Javascript.)

We can write XForms, which have the advantage of being strictly declarative and (for many developers) of allowing much of the work to be done using familiar XHTML, CSS, and XPath idioms.

How well do these different approaches work? What is the experience of people and projects who have used them? And how is a person to choose rationally among these many possibilities? What are the relative strengths and weaknesses of the various options? What are the prospects for future developments in this area? These are among the questions I expect attendees at the symposium will get a better grip on.

(As any reader of this blog knows, I think XForms has a compelling story to tell for anyone interested in user interfaces to XML data, so I expect XForms to be prominent in the symposium program. But I’m interested in any and all possible solutions to the problem of developing good user interfaces for XML processing, so we are casting our nets as widely as we can in defining the topic of the symposium.)

I hope to see you there!

Balisage 2013 dates set

[10 December 2012]

The dates of Balisage 2013 have now been set: Tuesday-Friday 6-9 August 2013, check with a one-day pre-conference symposium (topic currently being considered, let me know if you have preferences) on Monday 5 August. The conference organizers spent a good deal of time considering the feedback from attendees (and non-attendees) concerning possible changes of date and/or venue; the upshot is that the conference will continue to be in Montréal, in late summer, and that (for 2013, at least) we will again be in the Best Western Hotel Europa on Drummond St.

Paper submissions are due Friday 19 April 2013.

If you would like to serve as a peer reviewer for this year’s conference, you have until 15 March 2013 to sign up.

I hope to see readers of this blog in Montréal!

Checking ISBN check-digits in XSD 1.1

[6 December 2012]

I recently had occasion to write an XSD 1.1 schema for a client whose data includes ISBN and ISSN values.

In a DTD, migraine all one can plausibly say about an element which is supposed to contain an ISBN is that it contains character data, something like this:

<!ELEMENT isbn (#PCDATA) >

That accepts legal ISBN values, like “0 13 651431 6” and “978-1-4419-1901-4”, but it also accepts strings with invalid check-digits, like “0 13 561431 6” (inversion of digits is said to be the most common single error in typing ISBNs), and strings with the wrong number of digits, like “978-1-4419-19014-4”. For that matter, it also accepts strings like “@@@ call Sally and ask what the ISBN is going to be @@@”. (There may be stages in a document’s life when you want to accept that last value. But there may also be stages when you don’t want to allow anything but a legal ISBN. This post is about what to do when writing a schema for that latter set of stages in a document’s life.)

In XSD 1.0, regular-expression patterns can be used to say, more specifically, that the value of a ten-digit ISBN should be of a specific length (thirteen, actually, not ten, because we want to require hyphens or blanks as separators) and should contain only decimal digits, separators, and X (because X is a legal check-digit).

<xsd:simpleType name="ISBN-10">
<xsd:restriction base="xsd:string">
<xsd:length value="13"/>
<xsd:pattern value="[0-9X -]*"/>
</xsd:restriction>
</xsd:simpleType>

Actually, we can do better than that. In a ten-digit ISBN, there should be ten digits: one to five digits in the so-called group identifier (which divides the world in language / country areas), one to seven digits in the publisher code (in the US, all publisher codes use at least two digits, but I have not been able to find anything that plausibly asserts this is necessarily true for all publisher codes world-wide), one to seven in the item number, and a final digit (or X) as a check digit.

<xsd:simpleType name="ISBN-10">
<xsd:restriction base="xsd:string">
<xsd:length value="13"/>
<xsd:pattern
value="[0-9]{1,5}-[0-9]{1,7}-[0-9]{1,7}-[0-9X]"/>
<xsd:pattern
value="[0-9]{1,5} [0-9]{1,7} [0-9]{1,7} [0-9X]"/>
</xsd:restriction>
</xsd:simpleType>

Since the number of separators is fixed, and the total length of the string is fixed, the type definition above will only accept literals with exactly ten non-separator digits. The patterns above assume that either hyphens or blanks will be used as separators, not a mix of hyphens and blanks; they also want any X appearing as a check-digit to be uppercase.

A similar type can be defined for thirteen-digit ISBNs, which add a three-digit industry-code prefix and another separator at the beginning:

<xsd:simpleType name="ISBN-13">
<xsd:restriction base="xsd:string">
<xsd:length value="17"/>
<xsd:pattern
value="(978|979)-[0-9]{1,5}-[0-9]{1,7}-[0-9]{1,7}-[0-9]"/>
<xsd:pattern
value="(978|979) [0-9]{1,5} [0-9]{1,7} [0-9]{1,7} [0-9]"/>
</xsd:restriction>
</xsd:simpleType>

In XSD 1.0, that’s as much as we can conveniently do. (Well, almost. If we are willing to endure the associated tedium, we can check for the correct positioning of hyphens in at least the ISBNs of some areas which assign publisher codes in such a way as to ensure that ISBNs remain unique even if the separators are dropped. See the ISBN datatype defined by Roger Costello and Roger Sperberg for an illustration of the principle.)

In theory, we ought to be able to do better: the check-digit algorithm can be checked by a finite-state automaton, and the languages of ten-digit and thirteen-digit ISBNs are thus demonstrably regular languages. So in principle, there are regular expressions that can perform the check-digit calculation. When I have tried to translate from the FSA to a regular expression, however, the result has been uncomfortably long.

But in XSD 1.1, the addition of assertions makes it possible to replicate the check-digit algorithm. We can write a type definition similar to the ones given above, with an additional xsd:assertion element whose test attribute has as its value an XPath expression which will validate the check-digit.

The ISBN-10 check-digit is constructed in such a way that the sum of digit 1 × 10 + digit 2 × 9 + … + digit 8 × 3 + digit 9 × 2 + digit 10 (if digit 10 is a digit, or 10 if digit 10 is an X), modulo 11, is equal to 0. The ISBN-13 check-digit uses a similar but simpler calculation: the numeric values of digits in even-numbered positions are multiplied by three, those of the digits in odd-numbered positions by one, and the sum of these weighted values must be a multiple of ten. This calculation is well within the range of XPath 2.0; let us build up the expression in stages.

Given a candidate ISBN in variable $value, we can obtain a string of digits (or X) without the separators by deleting all hyphens and blanks, which we can do in XPath by writing:

translate($value,' -','')

We can turn that, in turn, into a sequence of numbers (the UCS code-point numbers for the characters) using the XPath 2.0 function string-to-codepoints:

string-to-codepoints(translate($value,' -',''))

For example, given the ISBN “0 13 651431 6”, as the value of $value, the expression just given evaluates to the sequence of integers (48 49 51 54 53 49 52 51 49 54). For purposes of the checksum calculation, however, we’d rather have a 0 in the ISBN appear as a 0, not a 48, in our sequence of numbers. And we need to turn X (which maps to 88) into 10. So we write the following XPath 2.0 expression:

for $d in string-to-codepoints(translate($value,' -',''))
return if ($d = 88) then 10 else ($d - 48)

Now the ten-digit ISBN “0 13 651431 6” and the thirteen-digit ISBN “978-1-4419-1901-4” map, respectively, to the sequences (0 1 3 6 5 1 4 3 1 6) and (9 7 8 1 4 4 1 9 1 9 0 1 4). This gives us precisely what we need for doing the arithmetic.

From the integer sequences thus created, we can extract the first digit by writing the filter expression [1], the second digit with [2], etc. It would be convenient to be able to assign the integer sequence to a variable, but that’s not possible in XPath 2.0 (at least, not using normal means). In writing the schema document, however, we can put the expression that generates the sequence into a named entity, thus:

<!ENTITY digit-sequence
"(for $d in string-to-codepoints(translate($value,' -',''))
return if ($d = 88) then 10 else ($d - 48))">

Now we can write the assertion for ISBN-10 thus:

<xsd:assertion test="
((&digit-sequence;[1] * 8
+ &digit-sequence;[2] * 7
+ &digit-sequence;[3] * 6
+ &digit-sequence;[4] * 5
+ &digit-sequence;[5] * 4
+ &digit-sequence;[6] * 3
+ &digit-sequence;[7] * 2
+ &digit-sequence;[8] * 1) mod 11) eq 0
"/>

We could write a similar expression for ISBN-13, but in fact we can use simple arithmetic to simplify the expression to:

<xsd:assertion test="
((sum(&digit-sequence;
[position() mod 2 = 1])
+ sum(for $d in (&digit-sequence;
[position() mod 2 = 0])
return 3 * $d)
) mod 10) eq 0
"/>

(Digression on entities …)

Some people, of course, frown on the use of entities in XML and claim that they are not helpful. I think examples like this one clearly show that entities can be very useful when used intelligently; it is much easier to see that the assertions given above are correct than it is in the equivalent assertions after entity expansion (post-edited to provide better legibility):

<xsd:assertion test="
(((for $d in string-to-codepoints(translate($value,' -',''))
return if ($d = 88) then 10 else ($d - 48)) [1] * 10
+ (for $d in string-to-codepoints(translate($value,' -',''))
return if ($d = 88) then 10 else ($d - 48)) [2] * 9
+ (for $d in string-to-codepoints(translate($value,' -',''))
return if ($d = 88) then 10 else ($d - 48)) [3] * 8
+ (for $d in string-to-codepoints(translate($value,' -',''))
return if ($d = 88) then 10 else ($d - 48)) [4] * 7
+ (for $d in string-to-codepoints(translate($value,' -',''))
return if ($d = 88) then 10 else ($d - 48)) [5] * 6
+ (for $d in string-to-codepoints(translate($value,' -',''))
return if ($d = 88) then 10 else ($d - 48)) [6] * 5
+ (for $d in string-to-codepoints(translate($value,' -',''))
return if ($d = 88) then 10 else ($d - 48)) [7] * 4
+ (for $d in string-to-codepoints(translate($value,' -',''))
return if ($d = 88) then 10 else ($d - 48)) [8] * 3
+ (for $d in string-to-codepoints(translate($value,' -',''))
return if ($d = 88) then 10 else ($d - 48)) [9] * 2
+ (for $d in string-to-codepoints(translate($value,' -',''))
return if ($d = 88) then 10 else ($d - 48)) [10] * 1)
mod 11)
eq 0
"/>

<xsd:assertion test="
((sum(
(for $d in string-to-codepoints(translate($value,' -',''))
return if ($d = 88) then 10 else ($d - 48)
)[position() mod 2 = 1]
)
+ sum(for $d in
((for $d in
string-to-codepoints(translate($value,' -',''))
return if ($d = 88) then 10 else ($d - 48)
) [position() mod 2 = 0])
return 3 * $d))
mod 10)
eq 0
"/>

The use of entity references makes it far easier to be confident that the two, or ten, for-expressions all really do the same thing, and they provide a level of abstraction which, in a simple way, encapsulates the book-keeping details and allows the overall structure of the two test expressions to be more clearly exhibited.

(End of digression.)

The end result is an XSD 1.1 datatype that detects most typos in the recording of ISBNs. It does not, alas, ensure that the legal ISBN one types in is actually the correct ISBN, only that it is a correct ISBN. But using machines to check what machines can check will leave more time for humans to check those things that only humans can check.

Balisage 2012 – just a week to go

[31 July 2012]

Balisage 2012 is just a week away. Next Monday there is the pre-conference symposium on quality assurance and quality control in XML systems, page and a week from today the conference proper starts.

I’m looking forward to pretty much all of the papers on the program, artificial so it’s kind of hard to pick any out for particular mention. And yet, unless I want to just reproduce the program for the conference, I’m going to have to.

Several papers this year deal, one way or another, with the relation of XML and JSON. Some talk about JSON support in XML tools, some about simplifying XML so it has more appeal to the kind of person who finds JSON attractive. Hans-Jürgen Rennau has a different take: he proposes a modest generalization of the XDM data model (which underlies XPath, XSLT, and XQuery and is as close as anyone is likely to come to being the consensus model for XML) which makes the existing XDM and JSON models each a specialization of the more general model. Since XPath, XQuery, and XSLT work on XDM instances, not on serialized data, they then apply without contortions both to XML and to JSON. (Of course, they need a few modest extensions to cover the new data model, too.)

Changing the underlying data model for a technology is hard, of course, but it’s not impossible (SQL has done so, at least in some ways, and that’s one reason for its longevity). I think Rennau’s proposal merits serious discussion. It’s certainly one of the most far-reaching papers at this year’s conference.

Several talks address the relation of XML and non-XML notations for languages, and I’m looking forward to the discussions that that thread of the conference elicits. David Lee, now with MarkLogic, considers what life would be like if we marked up structure in programming languages the way we mark it up in documents. Norm Walsh continues the thread with a discussion of the general issue with particular reference to possible designs for a ‘compact syntax’ for XProc. Mark D. Flood, Matthew McCormick, and Nathan Palmer approach the problem complex from a different and enlightening angle, that of literate programming, in their case literate programming for the development of test cases for scientific function libraries. Mario Blažević offers the latest entry in the ongoing series of papers exploring how to do things with XML that were (in some form or other) part of SGML but were dropped when XML was designed. His paper shows how we might do SHORTREF in an XML context in a more general and more reliable way than was achieved when SHORTREF was bundled into SGML. And finally, Sam Wilmott opens the entire series of talks with a case study and general reflections on literate programming. I look forward to Wednesday at Balisage!

As is customary at Balisage, a few papers approach resolutely theoretical topics, either with or without overt practical applications. I’ll mention just a few: Hervé Ruellan of Canon discusses a long series of careful measurements of entropy in various data structures for XML; his paper feels in some ways like the theoretical underpinnings I wish the Efficient XML Interchange working group had had at the beginning of its work. Abel Braaksma describes the use of higher-order functions as a way to simplify XSLT stylesheet development. And Claus Huitfeldt, Fabio Vitali, and Silvio Peroni have produced a response to the paper presented in 2010 by Allen Renear and Karen Wickett of the University of Illinois claiming that documents (as we conventionally try to formalize them) do not exist. Huitfeldt and his co-authors explore the possibility of viewing documents as ‘timed abstract objects’.

Theory, practice, practice, and theory. I look forward to seeing you at Balisage.