Balisage paper deadline 19 April 2013

[9 April 2013]

Ten days to go until the paper deadline for Balisage 2013 and for the International Symposium on Native XML user interfaces, this August in Montréal.

If you have been thinking about submitting a paper to Balisage or the Symosium (and if you’re reading this blog, I bet the thought has crossed your mind at least once!), you may be thinking &ldquo:Ten days! Too late!” At times like this (when the deadline is not past, but it feels a little tight) I find it helpful to recall a remark attributed to Leonard Bernstein:

To achieve great things, two things are needed; a plan, and not quite enough time.

True, there’s not quite as much time to write your paper as you’d like. But think of this not as a reason to give up but as your opportunity to achieve great things!

Balisage Symposium 2013: Native XML user interfaces

[19 February 2013]

The organizers of Balisage have announced that the topic of the 2013 pre-conference symposium on 5 August will be Native XML User Interfaces. I will be chairing the symposium this year.

Any project that uses XML intensively will occasionally find it handy to provide specialized interfaces for specific tasks. Historical documentary editions, for example, typically work through any document being published several times: once to transcribe it, several times to proofread it, once to identify points to be annotated, once (or more) to insert the annotation, once for indexing, and so on and so forth. Producers of language corpora similarly work in multiple passes: once to identify sentence boundaries in the transcriptions (possibly by automated means), once to check and correct the sentence boundaries, once to align the sentence boundaries with the sound data, and so on and so forth.

But how?

We can train all our users to use a general-purpose XML editor, but as Henry Thompson pointed out to me some time ago, one problem with that approach is that the person doing the correction of the sentence boundary markup probably should not be changing (on purpose or by accident) the transcription of the data. If you get bored by a tedious task while working with a general-purpose editor, there is essentially no limit to the damage you could do to the document (on purpose or by accident).

Each of these specialized tasks could benefit from having a specialized user interface. But how?

We can write an editor from scratch (if we have a good user interface library and sufficient time on our hands). Java and C++ and Objective-C all have well known user-interface toolkits; for many other languages, the user-interface toolkits are probably less well known (and possibly less mature) but they are surely there. Henry Thompson and his colleagues in Edinburgh did a lot of work in this vein using Python to implement what they called padded-cell editors.

We don’t have to write the editor from scratch: we can adapt a general-purpose open-source editor (if it’s in a language we are comfortable working in).

We can customize a general-purpose editor (if it has a good customization interface and we know how to use it). I believe a lot of organizations have commissioned customized versions of SGML and XML editors over the years; at one point, SoftQuad had a commercial product (Sculptor) whose purpose was to allow the extension and customization of Author/Editor using Scheme as the customization language.

We can write Javascript that runs in the user’s browser (if we are prepared to stock up on aspirin).

We can write elaborate user interfaces with XSLT 2.0 in the browser, using Michael Kay’s Saxon-CE.

We can write XQuery in the browser using either a plugin or an XQuery engine compiled into Javascript. (I am not sure I can say for certain how many variants of XQuery in the browser there are; I believe the initial implementations were as plugins for IE and Firefox, but the current version appears to be deployed in Javascript.)

We can write XForms, which have the advantage of being strictly declarative and (for many developers) of allowing much of the work to be done using familiar XHTML, CSS, and XPath idioms.

How well do these different approaches work? What is the experience of people and projects who have used them? And how is a person to choose rationally among these many possibilities? What are the relative strengths and weaknesses of the various options? What are the prospects for future developments in this area? These are among the questions I expect attendees at the symposium will get a better grip on.

(As any reader of this blog knows, I think XForms has a compelling story to tell for anyone interested in user interfaces to XML data, so I expect XForms to be prominent in the symposium program. But I’m interested in any and all possible solutions to the problem of developing good user interfaces for XML processing, so we are casting our nets as widely as we can in defining the topic of the symposium.)

I hope to see you there!

Balisage 2013 dates set

[10 December 2012]

The dates of Balisage 2013 have now been set: Tuesday-Friday 6-9 August 2013, with a one-day pre-conference symposium (topic currently being considered, let me know if you have preferences) on Monday 5 August. The conference organizers spent a good deal of time considering the feedback from attendees (and non-attendees) concerning possible changes of date and/or venue; the upshot is that the conference will continue to be in Montréal, in late summer, and that (for 2013, at least) we will again be in the Best Western Hotel Europa on Drummond St.

Paper submissions are due Friday 19 April 2013.

If you would like to serve as a peer reviewer for this year’s conference, you have until 15 March 2013 to sign up.

I hope to see readers of this blog in Montréal!

Balisage 2012 – T minus 21 days

[16 July 2012]

Hard to believe, but Balisage 2012 is only three weeks away.

On Monday 6 August there is a pre-conference symposium on quality assurance and quality control in XML. I won’t list all the scheduled talks here, but the symposium program has a good balance of theory and practice, abstract rule and concrete application, and there are several case studies from organizations with major XML publishing programs (Ontario Scholars Portal, the U.S. National Library of Medicine’s National Center for Biotechnology Information, the American Chemical Society, and Portico).

Tuesday through Friday, the conference proper will take place. Among the many talks I am looking forward to, today I’ll mention just a few.

Mary Holstege opens the conference with a talk about type introspection in XQuery; as a principal engineer at MarkLogic, she has a deep background both in the technology of XQuery and related specifications and good understanding of how real customers with large amounts of textual data actually use XML.

Later the same day, Steven Pemberton of W3C will speak on the relation between data abstractions and their serializations, with (passing) reference to work on XForms 2.0. Steven gives dynamite talks, and I want to hear how he describes the interplay of general design problems with the concrete work of spec development.

And at the other end of the week, Friday morning Liam Quin (also of W3C) will talk about work he has been doing to characterize the body of material served as XML on the Web, in particular that part of it which is not actually well-formed XML (and thus, in the strict sense, not XML at all). Since sometimes people use the existence of non-well-formed data on the Web to support arguments that XML’s well-formedness rules are too strict for practical use, I look forward to hearing Liam’s analysis.

Of course, there is a lot more to look forward to. I hope, dear reader, that I will see you in Montreal next month!

XML, XForms, and XQuery courses in June 2012

[5 April 2012]

I’ll be teaching three courses / workshops this June.

XML for digital librarians

Recently the organizers of the ACM / IEEE-CS 2012 Joint Conference on Digital Libraries asked me to teach a pre-conference tutorial on “Making the most of XML in Digital Libraries”; JCDL will be hosted by George Washington University in Washington, DC, this year.

The tutorial description runs something like this:

The Extensible Markup Language (XML) was designed to help make electronic information device- and application-independent and thus give that information a longer useful lifetime. XML is thus a natural tool for constructing digital libraries. But where exactly does XML fit into the conceptual framework of digital libraries? Where can XML and related technologies help achieve DL goals?

This tutorial will provide participants with an introduction to basic concepts of XML and a DL-oriented overview of XML and related technologies (XML, XPath, XSLT, XQuery, the XML information set, XDM, XForms, XProc, and many more). The intent is to show how XML can be used to help digital libraries achieve their goals and to enable participants to know which XML technologies are most relevant for the work they are involved with.

The JCDL site doesn’t have a detailed schedule yet, so I don’t know the exact time and date of this tutorial.

XForms for XML users

I’ve also arranged with Mulberry Technologies of Rockville, Maryland, to use their training facilities to offer two two-day training courses immediately before and after the JCDL conference, so that people traveling to DC for JCDL can extend their trip at one or both ends to attend the courses.

One course provides an Introduction to XForms for XML users, covering a technology with huge (and largely unrecognized) potential for users of XML. XForms is built around the model / view / controller idiom, with a collection of XML documents playing the role of the model, and with the view represented (in most uses of XForms) by an XHTML document. The end result is that a few lines of XHTML, XForms, and CSS can suffice to build user interfaces that would take hundreds of lines of Javascript and thousands or tens of thousands of lines of Java or a similar language. XForms makes it feasible to write project-specific, workflow-specific, even task-specific XML editors; no serious XML project should be without XForms capabilities.

The XForms course will be held just before JCDL, on Friday and Saturday 8 and 9 June.

XQuery for documents

The other course provides an introduction to XQuery for documents. Much of the interest in XQuery has come from database vendors and database users, and not surprisingly much of the public discussion of XQuery has focused on the kinds of problems familiar to users of database management systems. Those who use XML for natural-language documents have, I think, sometimes gotten the impression that XQuery must be aimed primarily at other kinds of XML and other kinds of people. This course is designed to introduce XQuery in a way that underscores its relevance to the human-readable documents that are historically the core use case for XML, with examples that assume an interest in documents rather than an interest in database management systems. This course will be held just following JCDL, on Friday and Saturday 16 and 16 June.

Further information on the two Black Mesa Technologies courses is on the Black Mesa Technologies site at the pages indicated.

Balisage 2011

[11 August 2011]

Last week, Balisage 2011 took place in Montréal.

As one of the organizers, I should not brag overmuch about the conference, but I can’t resist saying that on the whole it seemed to go fairly well this week. (But hey, you don’t have to take my word for it. Cornelia Davis of EMC, who gave a well received talk about Programming Application Logic for RESTful Services Using XML Technologies has written a blog post describing her experiences at Balisage 2011. Read it!)

Allen Renear (with collaborators) and Walter Perry gave thoughtful and thought-provoking papers on the nature of identity and the role of identifiers (rather dense, and probably not to everyone’s taste — I suspect some found them less thought-provoking than just provoking). There were case-study reports on ebook deployments, the markup of supplementary material in electronic journals (huge issue for the maintainers of the scholarly record in science), and the revival in XML of an old SGML project whose server died. Michael Kay and O’Neil Delpratt talked about their work measuring the performance benefits of byte-code generation in Saxon. Eric van der Vlist described a small Javascript library he has written for the support of multi-ended hyperlinks. Eric Freese reported on the state of EPUB3. Michael Kay gave an impromptu evening session on SaxonCE (XSLT 2.0 in the browser, now out for alpha testing). Jean-Yves Vion-Dury talked about a method of encrypting XML documents in such a way that a service provider can store them and perform certain operations on them (like running a restricted class of XSLT stylesheets or XQueries over them) without decrypting them, by the ingenious technique of encrypting appropriate parts (that’s the tricky bit, what exactly are the appropriate parts?) of the stylesheet or query. And there was much, much more.

Balisage 2012 will be in Montréal in August 2012. As Patrick Durusau put it:

If you see either < or > at work or anyone talks about them, you need to be at Balisage ….

Mark your calendars.

XML Prague 2011

[17 February 2011]

XML Prague is just over a month away. I’ll be there again this year, the organizers having once more generously invited me to provide some closing remarks at the end of the conference.

I’d urge anyone within easy travel distance of Prague to plan to attend, but it turns out the conference is booked to capacity already. So there’s nothing to do, if you haven’t already registered, but wait ’til next year. (Well, that’s not really true. XML Prague provides live streaming video, which means you may be able to watch some of the talks even if you can’t be in the room.)

The theme this year is Web and XML; the suggested topics include speculation on why XML never became the usual way to prepare Web pages and the relations between XML and HTML 5 and between XML and JSON. The latter two, at least, seem designed to provoke some bottle-throwing; maybe they will succeed.

The papers include one by Alain Couthures on JSON support in XForms and one by Jason Hunter on a JSON facade for Mark Logic Server; this suggests that at least some XML users plan to co-exist with JSON by the expedient of making XML tools present and work with JSON as if it were XML. When one has to deal with information provided by others which is available only in JSON form and not in XML, it will be handy to view the JSON information through XML lenses.

Other papers describe XQuery in the browser (by way of a Javascript implementation from ETH Zurich, created by compiling the Java-based MXQuery engine into Javascript using Google’s Web Toolkit), XSD in the browser (from the University of Edinburgh), and XSLT 2.0 in the browser (from Saxonica), as well as a general consideration of XML processing in the browser (Edinburgh again). Some papers are about XML and information processing outside the browser: one team is translating SPARQL into XQuery, and Uche Ogbuji of Zepheira is presenting the Akara framework under the title “Spicy Bean Fritters and XML Data Services”, which makes me eager to to go have some spicy bean fritters (figurative or literal).

There are other papers on EPUB, on electronic Bibles, on XQuery optimization, and on a variety of specific applications, projects, and tools.

It should be fun. If you’ll be there, I look forward to seeing you there; if you won’t be there this year, you might sample the conference using the video feed (some people I know turn off the video as distracting and just listen to the audio, which takes less bandwidth). And if not this year, then perhaps next year.

What constitutes successful format conversion?

[31 December 2010]

I wrote about the International data curation conference earlier this month, but did not provide a pointer to my own talk.

My slides are on the Web on this site; they may give some idea of the general thrust of my talk. (On slide 4, “IANAPL” expands to “I am not a preservation librarian”. On slide 20, the quotation is from an anonymous review of my paper.)

Over time, I become more and more convinced that formal proofs of correctness are important for things we care about. The other day, for example (29 December to be exact), I saw a front-page article in the New York Times about radiation overdoses resulting from both hardware and software shortcomings in the device used to administer radiotherapy. I found it impossible not to think that formal proofs of correctness could help prevent such errors. (Among other things, formal proofs of correctness force those responsible to say with some precision what correct behavior is, for the software in question, which is likely to lead to more explicit consideration of things like error modes than might otherwise happen.)

Formal specification of the meaning of markup languages is only a smaller part of making possible formal proofs of system correctness. But it’s a step, I think.

International Data Curation Conference (IDCC) 6, Chicago

[13 December 2010]

Spent the early part of last week in Chicago attending the 6th International Digital Curation Conference, co-sponsored by the U.K.-based Digital Curation Centre and the Graduate School of Library and Information Science (GSLIS) of the University of Illinois at Urbana/Champaign, “in partnership with” the Coalition for Networked Information (CNI).

I won’t try to summarize everything here, just mention some talks that caught my attention and stick in my mind.

The opening keynote by Chris Lintott talked about his experiences setting up the Galaxy Zoo, an interactive site that allows users to classify galaxies in the Sloan Sky Map by their form (clockwise spiral, counter-clockwise spiral, etc.). At the outset I was deeply skeptical, but he won me over by his anecdotes about some surprising results achieved by the mass mobilization of these citizen scientists, and by saying that if you want that kind of thing to work, you must treat the users who are helping you as full partners in the project: you must describe accurately what you are doing and how they are helping, and you must share the results with them, as members of the project team. The Galaxy Zoo has been such a success that they are now building an infrastructure for more projects of the same kind (essentially: where humans do better than software at recognizing patterns in the data, and where it’s thus useful to ask humans to do the pattern recognition on large datasets), called the Zooniverse.

Those with projects that might usefully be crowd-sourced should give the Zooniverse a look; it might make it feasible to do things you otherwise could not manage. (I wonder if one could get better training data for natural-language parsers and part of speech taggers that way?)

In the immediately following session, Antony Williams of the ChemSpider project described the somewhat less encouraging results of a survey of Web information about chemistry, from the point of view of a professional chemist who cares about accuracy (and in particular cares about stereochemical details). Barend Mons gave an optimistic account of how RDF can be used not just to describe Web pages but to summarize the information they contain, sentence by sentence, and the massive triple stores that result can be used to find new interesting facts. It was all very exciting, but his examples made me wonder whether you can really reduce a twenty-five- or forty-word sentence to a triple without any loss of nuance. In the question session, Michael Lesk asked an anodyne question about Doug Lenat and the Cyc project, which made me think he was a little skeptical, too. But the speaker dodged the bullet (or tried to) by drawing firm lines of difference between his approach and Cyc. John Unsworth rounded out the session by describing the MONK (Metadata offer new knowledge) project and making the argument that humanities data may resist correction and normalization more violently than scientific data, the idiosyncrasy of the data being part of the point. (As Jocelyn Penny Small of Rutgers once said in an introduction to databases for humanists, “Your job is not to use your computer database to clean up this mess. Your job is to use the computer and the database to preserve the mess in loving detail.”)

Another session that sticks in my mind is one in which Kate Zwaard of the U.S. Government Printing Office spoke about the GPO’s design of FDSys, intended to be a trusted digital repository for (U.S. federal) government documents. In developing their metadata ontology, they worked backwards from the required outcomes to identify the metadata necessary to achieve those outcomes. It reminded me of the old practice of requirements tracing (in which every feature in a software design is either traced to some accepted requirement, or dropped from the design as an unnecessary complication). In the same session, Michael Lesk talked about work he and others have done trying, with mixed success, to use explicit ontologies to help deal with problems of data integration — for example, recognizing all the questions in some database of opinion-survey questions which are relevant to some user query about exercise among the elderly. He didn’t have much of an identifiable thesis, but the way he approached the problems was almost worth the trip to Chicago by itself. I wish I could put my finger on what makes it so interesting to hear about what he’s done, but I’m not sure I can. He chooses interesting underlying questions, he finds ways to translate them into operational terms so you can measure your results, or at least check them empirically, he finds large datasets with realistic complexity to test things on, and he describes the results without special pleading or excuses. It’s just a pleasure to listen to him. The final speaker in the session was Huda Khan of Cornell, talking about the DataStaR project based there, in which they are using semantic web technologies in the service of data curation. Her remarks were also astute and interesting; in particular, she mentioned that they are working with a tool called Gloze, for mapping from XML document instances into RDF; I have got to spend some time running that tool down and learning about it.

My own session was interesting, too, if I say so myself. Aaron Hsu of Indiana gave an energetic account of the difficulties involved in providing emulators for software, with particular attention to the challenges of Windows dynamic link libraries and the particular flavor of versioning hell they invite. I talked about the application of markup theory (in particular, of a particular view about the meaning of markup) to problems of preservation — more on that in another post — and Maria Esteva of the University of Texas at Austin talked about visualizations for heterogeneous electronic collections, in particular visualizations to help curators get an overview of the collection and its preservation status, so you know where it’s most urgent to focus your attention.

All in all, a good conference and one I’m glad I attended. Just two nagging questions in the back of my mind, which I record here for the benefit of conference organizers generally.

(1) Would it not be easier to consult the program on the Web if it were available in HTML instead of only in PDF? (Possibly not; possibly I am the only Web user left whose screen is not the same shape as a piece of A4 paper and who sometimes resizes windows to shapes different from the browser default.)

(2) How can a conference on the topic of long-term digital preservation choose to require that papers be submitted in a closed proprietary format (here, .doc files), instead of allowing, preferring, or requiring an open non-proprietary format? What does this say about the prospects for long-term preservation of the conference proceedings? (A voice in my ear whispers “Viable just as long as Microsoft thinks the current .doc format is commercially a good idea”; I think IDCC and other conferences on long-term preservation can and should do better than that.)