Balisage paper deadline 19 April 2013

[9 April 2013]

Ten days to go until the paper deadline for Balisage 2013 and for the International Symposium on Native XML user interfaces, this August in Montréal.

If you have been thinking about submitting a paper to Balisage or the Symosium (and if you’re reading this blog, I bet the thought has crossed your mind at least once!), you may be thinking &ldquo:Ten days! Too late!” At times like this (when the deadline is not past, but it feels a little tight) I find it helpful to recall a remark attributed to Leonard Bernstein:

To achieve great things, two things are needed; a plan, and not quite enough time.

True, there’s not quite as much time to write your paper as you’d like. But think of this not as a reason to give up but as your opportunity to achieve great things!

Strategic Research Agenda for Multilingual Europe 2020

[April 2013]

The META Technology Council (part of the Multilingual Europe Technology Alliance) has now published its Strategic Research Agenda for Multilingual Europe 2020; it’s available in book form from Springer, but also available free (in chapter-by-chapter PDFs, not alas in live-text [ie XML or HTML] form) from the SpringerLink Open Access area.

I am not without bias (I served on the Technology Council), but it says here that the document provides an interesting look at where a group of very smart people believe research in language technology should head in the next years.

In addition to making a case for the cultural and economic importance of language technology, the paper identifies five specific lines of action (I paraphrase the description in the document’s executive summary). First, three research areas:

  • translingual cloud — cloud services for translation (and interpretation) among European (and major non-European) languages
  • social intelligence and e-participation — tools to support multilingual understanding and dialog to enable e-participation and improve collective decision making
  • socially aware interactive assistants — pervasive multimodal assistive technology

In support of these, it also identifies two other areas where work is needed:

  • Core technologies and resources — “a system of shared, collectively maintained, interoperable tools and resources. They will ensure that our languages will be sufficiently supported and represented in future generations of IT solutions.”
  • European service platform for language technologies — an e-infrastructure to support research and innovation by testing and showcasing results and integrating research and services

Parts of the text may sound a bit bureaucratic (as the paraphrase given shows), but when the document gets technical it is rather interesting.

Well worth reading. And may the funders heed it!

Balisage Symposium 2013: Native XML user interfaces

[19 February 2013]

The organizers of Balisage have announced that the topic of the 2013 pre-conference symposium on 5 August will be Native XML User Interfaces. I will be chairing the symposium this year.

Any project that uses XML intensively will occasionally find it handy to provide specialized interfaces for specific tasks. Historical documentary editions, for example, typically work through any document being published several times: once to transcribe it, several times to proofread it, once to identify points to be annotated, once (or more) to insert the annotation, once for indexing, and so on and so forth. Producers of language corpora similarly work in multiple passes: once to identify sentence boundaries in the transcriptions (possibly by automated means), once to check and correct the sentence boundaries, once to align the sentence boundaries with the sound data, and so on and so forth.

But how?

We can train all our users to use a general-purpose XML editor, but as Henry Thompson pointed out to me some time ago, one problem with that approach is that the person doing the correction of the sentence boundary markup probably should not be changing (on purpose or by accident) the transcription of the data. If you get bored by a tedious task while working with a general-purpose editor, there is essentially no limit to the damage you could do to the document (on purpose or by accident).

Each of these specialized tasks could benefit from having a specialized user interface. But how?

We can write an editor from scratch (if we have a good user interface library and sufficient time on our hands). Java and C++ and Objective-C all have well known user-interface toolkits; for many other languages, the user-interface toolkits are probably less well known (and possibly less mature) but they are surely there. Henry Thompson and his colleagues in Edinburgh did a lot of work in this vein using Python to implement what they called padded-cell editors.

We don’t have to write the editor from scratch: we can adapt a general-purpose open-source editor (if it’s in a language we are comfortable working in).

We can customize a general-purpose editor (if it has a good customization interface and we know how to use it). I believe a lot of organizations have commissioned customized versions of SGML and XML editors over the years; at one point, SoftQuad had a commercial product (Sculptor) whose purpose was to allow the extension and customization of Author/Editor using Scheme as the customization language.

We can write Javascript that runs in the user’s browser (if we are prepared to stock up on aspirin).

We can write elaborate user interfaces with XSLT 2.0 in the browser, using Michael Kay’s Saxon-CE.

We can write XQuery in the browser using either a plugin or an XQuery engine compiled into Javascript. (I am not sure I can say for certain how many variants of XQuery in the browser there are; I believe the initial implementations were as plugins for IE and Firefox, but the current version appears to be deployed in Javascript.)

We can write XForms, which have the advantage of being strictly declarative and (for many developers) of allowing much of the work to be done using familiar XHTML, CSS, and XPath idioms.

How well do these different approaches work? What is the experience of people and projects who have used them? And how is a person to choose rationally among these many possibilities? What are the relative strengths and weaknesses of the various options? What are the prospects for future developments in this area? These are among the questions I expect attendees at the symposium will get a better grip on.

(As any reader of this blog knows, I think XForms has a compelling story to tell for anyone interested in user interfaces to XML data, so I expect XForms to be prominent in the symposium program. But I’m interested in any and all possible solutions to the problem of developing good user interfaces for XML processing, so we are casting our nets as widely as we can in defining the topic of the symposium.)

I hope to see you there!

Balisage 2013 dates set

[10 December 2012]

The dates of Balisage 2013 have now been set: Tuesday-Friday 6-9 August 2013, with a one-day pre-conference symposium (topic currently being considered, let me know if you have preferences) on Monday 5 August. The conference organizers spent a good deal of time considering the feedback from attendees (and non-attendees) concerning possible changes of date and/or venue; the upshot is that the conference will continue to be in Montréal, in late summer, and that (for 2013, at least) we will again be in the Best Western Hotel Europa on Drummond St.

Paper submissions are due Friday 19 April 2013.

If you would like to serve as a peer reviewer for this year’s conference, you have until 15 March 2013 to sign up.

I hope to see readers of this blog in Montréal!

Checking ISBN check-digits in XSD 1.1

[6 December 2012]

I recently had occasion to write an XSD 1.1 schema for a client whose data includes ISBN and ISSN values.

In a DTD, all one can plausibly say about an element which is supposed to contain an ISBN is that it contains character data, something like this:

<!ELEMENT isbn (#PCDATA) >

That accepts legal ISBN values, like “0 13 651431 6” and “978-1-4419-1901-4”, but it also accepts strings with invalid check-digits, like “0 13 561431 6” (inversion of digits is said to be the most common single error in typing ISBNs), and strings with the wrong number of digits, like “978-1-4419-19014-4”. For that matter, it also accepts strings like “@@@ call Sally and ask what the ISBN is going to be @@@”. (There may be stages in a document’s life when you want to accept that last value. But there may also be stages when you don’t want to allow anything but a legal ISBN. This post is about what to do when writing a schema for that latter set of stages in a document’s life.)

In XSD 1.0, regular-expression patterns can be used to say, more specifically, that the value of a ten-digit ISBN should be of a specific length (thirteen, actually, not ten, because we want to require hyphens or blanks as separators) and should contain only decimal digits, separators, and X (because X is a legal check-digit).

<xsd:simpleType name="ISBN-10">
  <xsd:restriction base="xsd:string">
    <xsd:length value="13"/>
    <xsd:pattern value="[0-9X \-]*"/>
  </xsd:restriction>
</xsd:simpleType>

Actually, we can do better than that. In a ten-digit ISBN, there should be ten digits: one to five digits in the so-called group identifier (which divides the world in language / country areas), one to seven digits in the publisher code (in the US, all publisher codes use at least two digits, but I have not been able to find anything that plausibly asserts this is necessarily true for all publisher codes world-wide), one to seven in the item number, and a final digit (or X) as a check digit.

<xsd:simpleType name="ISBN-10">
  <xsd:restriction base="xsd:string">
    <xsd:length value="13"/>
    <xsd:pattern 
      value="[0-9]{1,5}-[0-9]{1,7}-[0-9]{1,7}-[0-9X]"/>
    <xsd:pattern 
      value="[0-9]{1,5} [0-9]{1,7} [0-9]{1,7} [0-9X]"/>
  </xsd:restriction>
</xsd:simpleType>

Since the number of separators is fixed, and the total length of the string is fixed, the type definition above will only accept literals with exactly ten non-separator digits. The patterns above assume that either hyphens or blanks will be used as separators, not a mix of hyphens and blanks; they also want any X appearing as a check-digit to be uppercase.

A similar type can be defined for thirteen-digit ISBNs, which add a three-digit industry-code prefix and another separator at the beginning:

<xsd:simpleType name="ISBN-13">
  <xsd:restriction base="xsd:string">
    <xsd:length value="17"/>
    <xsd:pattern 
      value="(978|979)-[0-9]{1,5}-[0-9]{1,7}-[0-9]{1,7}-[0-9]"/>
    <xsd:pattern 
      value="(978|979) [0-9]{1,5} [0-9]{1,7} [0-9]{1,7} [0-9]"/>
  </xsd:restriction>
</xsd:simpleType>

In XSD 1.0, that’s as much as we can conveniently do. (Well, almost. If we are willing to endure the associated tedium, we can check for the correct positioning of hyphens in at least the ISBNs of some areas which assign publisher codes in such a way as to ensure that ISBNs remain unique even if the separators are dropped. See the ISBN datatype defined by Roger Costello and Roger Sperberg for an illustration of the principle.)

In theory, we ought to be able to do better: the check-digit algorithm can be checked by a finite-state automaton, and the languages of ten-digit and thirteen-digit ISBNs are thus demonstrably regular languages. So in principle, there are regular expressions that can perform the check-digit calculation. When I have tried to translate from the FSA to a regular expression, however, the result has been uncomfortably long.

But in XSD 1.1, the addition of assertions makes it possible to replicate the check-digit algorithm. We can write a type definition similar to the ones given above, with an additional xsd:assertion element whose test attribute has as its value an XPath expression which will validate the check-digit.

The ISBN-10 check-digit is constructed in such a way that the sum of digit 1 × 10 + digit 2 × 9 + … + digit 8 × 3 + digit 9 × 2 + digit 10 (if digit 10 is a digit, or 10 if digit 10 is an X), modulo 11, is equal to 0. The ISBN-13 check-digit uses a similar but simpler calculation: the numeric values of digits in even-numbered positions are multiplied by three, those of the digits in odd-numbered positions by one, and the sum of these weighted values must be a multiple of ten. This calculation is well within the range of XPath 2.0; let us build up the expression in stages.

Given a candidate ISBN in variable $value, we can obtain a string of digits (or X) without the separators by deleting all hyphens and blanks, which we can do in XPath by writing:

translate($value,' -','')

We can turn that, in turn, into a sequence of numbers (the UCS code-point numbers for the characters) using the XPath 2.0 function string-to-codepoints:

string-to-codepoints(translate($value,' -',''))

For example, given the ISBN “0 13 651431 6”, as the value of $value, the expression just given evaluates to the sequence of integers (48 49 51 54 53 49 52 51 49 54). For purposes of the checksum calculation, however, we’d rather have a 0 in the ISBN appear as a 0, not a 48, in our sequence of numbers. And we need to turn X (which maps to 88) into 10. So we write the following XPath 2.0 expression:

for $d in string-to-codepoints(translate($value,' -',''))
return if ($d = 88) then 10 else ($d - 48)

Now the ten-digit ISBN “0 13 651431 6” and the thirteen-digit ISBN “978-1-4419-1901-4” map, respectively, to the sequences (0 1 3 6 5 1 4 3 1 6) and (9 7 8 1 4 4 1 9 1 9 0 1 4). This gives us precisely what we need for doing the arithmetic.

From the integer sequences thus created, we can extract the first digit by writing the filter expression [1], the second digit with [2], etc. It would be convenient to be able to assign the integer sequence to a variable, but that’s not possible in XPath 2.0 (at least, not using normal means). In writing the schema document, however, we can put the expression that generates the sequence into a named entity, thus:

<!ENTITY digit-sequence 
"(for $d in string-to-codepoints(translate($value,' -',''))
  return if ($d = 88) then 10 else ($d - 48))">

Now we can write the assertion for ISBN-10 thus:

<xsd:assertion test="
        ((&digit-sequence;[1] * 8
        + &digit-sequence;[2] * 7
        + &digit-sequence;[3] * 6
        + &digit-sequence;[4] * 5
        + &digit-sequence;[5] * 4
        + &digit-sequence;[6] * 3
        + &digit-sequence;[7] * 2
        + &digit-sequence;[8] * 1) mod 11) eq 0
        "/>

We could write a similar expression for ISBN-13, but in fact we can use simple arithmetic to simplify the expression to:

<xsd:assertion test="
        ((sum(&digit-sequence;
             [position() mod 2 = 1]) 
        + sum(for $d in (&digit-sequence;
                         [position() mod 2 = 0]) 
              return 3 * $d)
        ) mod 10) eq 0
        "/>

(Digression on entities …)

Some people, of course, frown on the use of entities in XML and claim that they are not helpful. I think examples like this one clearly show that entities can be very useful when used intelligently; it is much easier to see that the assertions given above are correct than it is in the equivalent assertions after entity expansion (post-edited to provide better legibility):

<xsd:assertion test="
  (((for $d in string-to-codepoints(translate($value,' -','')) 
     return if ($d = 88) then 10 else ($d - 48)) [1] * 10         
  + (for $d in string-to-codepoints(translate($value,' -','')) 
     return if ($d = 88) then 10 else ($d - 48)) [2] * 9         
  + (for $d in string-to-codepoints(translate($value,' -','')) 
     return if ($d = 88) then 10 else ($d - 48)) [3] * 8         
  + (for $d in string-to-codepoints(translate($value,' -','')) 
     return if ($d = 88) then 10 else ($d - 48)) [4] * 7         
  + (for $d in string-to-codepoints(translate($value,' -','')) 
     return if ($d = 88) then 10 else ($d - 48)) [5] * 6         
  + (for $d in string-to-codepoints(translate($value,' -','')) 
     return if ($d = 88) then 10 else ($d - 48)) [6] * 5         
  + (for $d in string-to-codepoints(translate($value,' -','')) 
     return if ($d = 88) then 10 else ($d - 48)) [7] * 4         
  + (for $d in string-to-codepoints(translate($value,' -','')) 
     return if ($d = 88) then 10 else ($d - 48)) [8] * 3         
  + (for $d in string-to-codepoints(translate($value,' -','')) 
     return if ($d = 88) then 10 else ($d - 48)) [9] * 2         
  + (for $d in string-to-codepoints(translate($value,' -','')) 
     return if ($d = 88) then 10 else ($d - 48)) [10] * 1) 
  mod 11) 
  eq 0
  "/>      

<xsd:assertion test="
   ((sum(
      (for $d in string-to-codepoints(translate($value,' -','')) 
       return if ($d = 88) then 10 else ($d - 48)
      )[position() mod 2 = 1]
     )          
   + sum(for $d in 
          ((for $d in 
              string-to-codepoints(translate($value,' -','')) 
            return if ($d = 88) then 10 else ($d - 48)
           ) [position() mod 2 = 0]) 
         return 3 * $d)) 
   mod 10) 
   eq 0
   "/>

The use of entity references makes it far easier to be confident that the two, or ten, for-expressions all really do the same thing, and they provide a level of abstraction which, in a simple way, encapsulates the book-keeping details and allows the overall structure of the two test expressions to be more clearly exhibited.

(End of digression.)

The end result is an XSD 1.1 datatype that detects most typos in the recording of ISBNs. It does not, alas, ensure that the legal ISBN one types in is actually the correct ISBN, only that it is a correct ISBN. But using machines to check what machines can check will leave more time for humans to check those things that only humans can check.

Balisage 2012 – just a week to go

[31 July 2012]

Balisage 2012 is just a week away. Next Monday there is the pre-conference symposium on quality assurance and quality control in XML systems, and a week from today the conference proper starts.

I’m looking forward to pretty much all of the papers on the program, so it’s kind of hard to pick any out for particular mention. And yet, unless I want to just reproduce the program for the conference, I’m going to have to.

Several papers this year deal, one way or another, with the relation of XML and JSON. Some talk about JSON support in XML tools, some about simplifying XML so it has more appeal to the kind of person who finds JSON attractive. Hans-Jürgen Rennau has a different take: he proposes a modest generalization of the XDM data model (which underlies XPath, XSLT, and XQuery and is as close as anyone is likely to come to being the consensus model for XML) which makes the existing XDM and JSON models each a specialization of the more general model. Since XPath, XQuery, and XSLT work on XDM instances, not on serialized data, they then apply without contortions both to XML and to JSON. (Of course, they need a few modest extensions to cover the new data model, too.)

Changing the underlying data model for a technology is hard, of course, but it’s not impossible (SQL has done so, at least in some ways, and that’s one reason for its longevity). I think Rennau’s proposal merits serious discussion. It’s certainly one of the most far-reaching papers at this year’s conference.

Several talks address the relation of XML and non-XML notations for languages, and I’m looking forward to the discussions that that thread of the conference elicits. David Lee, now with MarkLogic, considers what life would be like if we marked up structure in programming languages the way we mark it up in documents. Norm Walsh continues the thread with a discussion of the general issue with particular reference to possible designs for a ‘compact syntax’ for XProc. Mark D. Flood, Matthew McCormick, and Nathan Palmer approach the problem complex from a different and enlightening angle, that of literate programming, in their case literate programming for the development of test cases for scientific function libraries. Mario Blažević offers the latest entry in the ongoing series of papers exploring how to do things with XML that were (in some form or other) part of SGML but were dropped when XML was designed. His paper shows how we might do SHORTREF in an XML context in a more general and more reliable way than was achieved when SHORTREF was bundled into SGML. And finally, Sam Wilmott opens the entire series of talks with a case study and general reflections on literate programming. I look forward to Wednesday at Balisage!

As is customary at Balisage, a few papers approach resolutely theoretical topics, either with or without overt practical applications. I’ll mention just a few: Hervé Ruellan of Canon discusses a long series of careful measurements of entropy in various data structures for XML; his paper feels in some ways like the theoretical underpinnings I wish the Efficient XML Interchange working group had had at the beginning of its work. Abel Braaksma describes the use of higher-order functions as a way to simplify XSLT stylesheet development. And Claus Huitfeldt, Fabio Vitali, and Silvio Peroni have produced a response to the paper presented in 2010 by Allen Renear and Karen Wickett of the University of Illinois claiming that documents (as we conventionally try to formalize them) do not exist. Huitfeldt and his co-authors explore the possibility of viewing documents as ‘timed abstract objects’.

Theory, practice, practice, and theory. I look forward to seeing you at Balisage.

Balisage 2012 – T minus 21 days

[16 July 2012]

Hard to believe, but Balisage 2012 is only three weeks away.

On Monday 6 August there is a pre-conference symposium on quality assurance and quality control in XML. I won’t list all the scheduled talks here, but the symposium program has a good balance of theory and practice, abstract rule and concrete application, and there are several case studies from organizations with major XML publishing programs (Ontario Scholars Portal, the U.S. National Library of Medicine’s National Center for Biotechnology Information, the American Chemical Society, and Portico).

Tuesday through Friday, the conference proper will take place. Among the many talks I am looking forward to, today I’ll mention just a few.

Mary Holstege opens the conference with a talk about type introspection in XQuery; as a principal engineer at MarkLogic, she has a deep background both in the technology of XQuery and related specifications and good understanding of how real customers with large amounts of textual data actually use XML.

Later the same day, Steven Pemberton of W3C will speak on the relation between data abstractions and their serializations, with (passing) reference to work on XForms 2.0. Steven gives dynamite talks, and I want to hear how he describes the interplay of general design problems with the concrete work of spec development.

And at the other end of the week, Friday morning Liam Quin (also of W3C) will talk about work he has been doing to characterize the body of material served as XML on the Web, in particular that part of it which is not actually well-formed XML (and thus, in the strict sense, not XML at all). Since sometimes people use the existence of non-well-formed data on the Web to support arguments that XML’s well-formedness rules are too strict for practical use, I look forward to hearing Liam’s analysis.

Of course, there is a lot more to look forward to. I hope, dear reader, that I will see you in Montreal next month!

XSLTForms 1.0RC, subforms, and a 50% speedup

[9 July 2012]

A couple of weeks ago, I took some time to explore the use of sub-forms in XSLTForms, as a possible way to speed up an XForm I had written that was a little slower than I would have liked.

The short version of the story is: WOW! Well worth learning to use.

To understand the longer version, dear Reader, you should know that one of the most common performance issues in serious uses of XForms is that forms sometimes slow down when the instance documents they are working on get big. I assume this is because browsers are profligate with resources, perhaps because some aspects of the XML DOM force them to be, perhaps because profligacy pays off most of the time. But I can’t say I really know for sure.

So one of the things that sophisticated users of XForms spend a lot of time on is finding ways to avoid loading all the instance documents at once. (This is a lot easier when you’re using an XML database as a back end, of course.) Another is finding ways to avoid loading all of the form at once; that is where sub-forms come in. The word doesn’t occur in the XForms 1.1 spec, but a number of implementations provide experimental support for sub-forms as an extension. The basic idea is that whenever certain events occur in a form, the XForms implementation will load some appropriate resource specifying some XForms widgets and bind them into the current form. When other events occur, those widgets will be unloaded again. I first saw this in a demo on the BetterFORM site a few years ago, but I see that Mark Birbeck was talking about this as long ago as 2006. And more recently, Alain Couthures has added sub-form support to XSLTForms.

Making my form use a sub-form turned out to be simpler than I had feared. I already had a full working version of the form, and it was clear which part of it I wanted to load and unload dynamically. What I had to do was just:

  1. Move the part of the form that should load dynamically (which I’ll now call the subform) into a separate XHTML + XForms document. Give it a simple XForms model, and check to make sure that it works by itself. (It doesn’t actually have to work by itself, but it’s helpful to know the subform hasn’t got fatal errors on its own.)
  2. In the main form, put an XForms xf:group where the sub-form used to be; give that group an ID.
  3. Associate a load action with the appropriate event. (In my form, I had a trigger that toggled a switch, exposing the read/write view of some material. The sub-form now has that read-write view, and the trigger now throws a load action.)
  4. Associate an unload action with another appropriate event. (In my form, this was the trigger that formerly toggled the switch back to the read-only view.)

In principle, that’s it, though I had to fiddle a bit to make everything work right. In particular I ended up adding a ref="." attribute to the outermost xf:group in the sub-form. I’m not yet sure just when this is necessary and when it’s not.

The simple example of sub-forms loading on the XSLTForms web site is very helpful here: it’s a very simple example and illustrates all the moving parts clearly. (But you will need to read the source and think about what is going on; there isn’t a lot of commentary or documentation around.)

What really impressed me were the effects of this change on the performance of the form.

Since sub-form support was added fairly recently to XSLTForms, I had to upgrade from an older release of XSLTForms to the recent release 1.0RC. I did some fairly tedious timings before and after I made the change, and I can say with some evidence that this change alone gave my form about a 25% increase in speed. Then I made the changes mentioned above, to use sub-forms. That gave me another 25% increase, so that on almost all actions version 1.0RC using sub-forms was about twice as fast as the older version Beta3 using a monolithic form.

Moral 1: If you are having performance issues with an XForm, and you can see how you might use a sub-form, then try it.

Moral 2: If you are having performance issues with an XForm, and you are using XSLTForms, then try moving to 1.0RC even if you can’t see how to use a sub-form in your context. Alain Couthures has done a lot of work on performance, and it clearly helps.

Bear in mind that the precise syntax and semantics of sub-forms are a topic of discussion in the XForms working group, so (a) they are subject to change, and (b) the working group is open to suggestions for making sub-forms (or any other part of XForms) work better.