Strategic Research Agenda for Multilingual Europe 2020

[April 2013]

The META Technology Council (part of the Multilingual Europe Technology Alliance) has now published its Strategic Research Agenda for Multilingual Europe 2020; it’s available in book form from Springer, but also available free (in chapter-by-chapter PDFs, not alas in live-text [ie XML or HTML] form) from the SpringerLink Open Access area.

I am not without bias (I served on the Technology Council), but it says here that the document provides an interesting look at where a group of very smart people believe research in language technology should head in the next years.

In addition to making a case for the cultural and economic importance of language technology, the paper identifies five specific lines of action (I paraphrase the description in the document’s executive summary). First, three research areas:

  • translingual cloud — cloud services for translation (and interpretation) among European (and major non-European) languages
  • social intelligence and e-participation — tools to support multilingual understanding and dialog to enable e-participation and improve collective decision making
  • socially aware interactive assistants — pervasive multimodal assistive technology

In support of these, it also identifies two other areas where work is needed:

  • Core technologies and resources — “a system of shared, collectively maintained, interoperable tools and resources. They will ensure that our languages will be sufficiently supported and represented in future generations of IT solutions.”
  • European service platform for language technologies — an e-infrastructure to support research and innovation by testing and showcasing results and integrating research and services

Parts of the text may sound a bit bureaucratic (as the paraphrase given shows), but when the document gets technical it is rather interesting.

Well worth reading. And may the funders heed it!

Balisage 2012 – just a week to go

[31 July 2012]

Balisage 2012 is just a week away. Next Monday there is the pre-conference symposium on quality assurance and quality control in XML systems, and a week from today the conference proper starts.

I’m looking forward to pretty much all of the papers on the program, so it’s kind of hard to pick any out for particular mention. And yet, unless I want to just reproduce the program for the conference, I’m going to have to.

Several papers this year deal, one way or another, with the relation of XML and JSON. Some talk about JSON support in XML tools, some about simplifying XML so it has more appeal to the kind of person who finds JSON attractive. Hans-Jürgen Rennau has a different take: he proposes a modest generalization of the XDM data model (which underlies XPath, XSLT, and XQuery and is as close as anyone is likely to come to being the consensus model for XML) which makes the existing XDM and JSON models each a specialization of the more general model. Since XPath, XQuery, and XSLT work on XDM instances, not on serialized data, they then apply without contortions both to XML and to JSON. (Of course, they need a few modest extensions to cover the new data model, too.)

Changing the underlying data model for a technology is hard, of course, but it’s not impossible (SQL has done so, at least in some ways, and that’s one reason for its longevity). I think Rennau’s proposal merits serious discussion. It’s certainly one of the most far-reaching papers at this year’s conference.

Several talks address the relation of XML and non-XML notations for languages, and I’m looking forward to the discussions that that thread of the conference elicits. David Lee, now with MarkLogic, considers what life would be like if we marked up structure in programming languages the way we mark it up in documents. Norm Walsh continues the thread with a discussion of the general issue with particular reference to possible designs for a ‘compact syntax’ for XProc. Mark D. Flood, Matthew McCormick, and Nathan Palmer approach the problem complex from a different and enlightening angle, that of literate programming, in their case literate programming for the development of test cases for scientific function libraries. Mario Blažević offers the latest entry in the ongoing series of papers exploring how to do things with XML that were (in some form or other) part of SGML but were dropped when XML was designed. His paper shows how we might do SHORTREF in an XML context in a more general and more reliable way than was achieved when SHORTREF was bundled into SGML. And finally, Sam Wilmott opens the entire series of talks with a case study and general reflections on literate programming. I look forward to Wednesday at Balisage!

As is customary at Balisage, a few papers approach resolutely theoretical topics, either with or without overt practical applications. I’ll mention just a few: Hervé Ruellan of Canon discusses a long series of careful measurements of entropy in various data structures for XML; his paper feels in some ways like the theoretical underpinnings I wish the Efficient XML Interchange working group had had at the beginning of its work. Abel Braaksma describes the use of higher-order functions as a way to simplify XSLT stylesheet development. And Claus Huitfeldt, Fabio Vitali, and Silvio Peroni have produced a response to the paper presented in 2010 by Allen Renear and Karen Wickett of the University of Illinois claiming that documents (as we conventionally try to formalize them) do not exist. Huitfeldt and his co-authors explore the possibility of viewing documents as ‘timed abstract objects’.

Theory, practice, practice, and theory. I look forward to seeing you at Balisage.

XML for the Long Haul

[9 June 2010]

The preliminary program for the one-day symposium on XML for the Long Haul (2 August 2010 in Montréal) is up and on the Web. Actually, it’s been up for a while, but I’ve been very busy lately and have had no time for blogging. (I know it seems implausible, but it’s true.)

The preliminary program has a couple slots left open for invited talks, which aren’t ready to be announced yet, but even in its current form it looks good (full disclosure: I’m chairing the symposium and made the final decisions on accepting papers for the symposium): we have two reports from major archives (Portico [Sheila Morrissey and others] and PubMed Central [Jeff Beck]) which face many of the same problems but take somewhat different approaches to addressing them. We have a retrospective report from a group of authors involved in a multi-year German project on the sustainability of linguistic resources [Georg Rehm and others]; the project has wound down now and I am hoping that the authors will be able to give a useful summary of its results.

Josh Lubell of NIST will talk about the long-term preservation of product data; for certain kinds of products (think Los Alamos and Oak ridge) the longevity of that information is really, really important to get right. And Quinn Dombrowski and Andrew Dombrowski of the University of Chicago shed an unexpected light on the problem of choosing archiveal data formats, by applying Montague semantics (a very interesting method of assigning meaning to utterances, usually applied to natural-language semantics and thus a little unexpected in the markup-language context) to the problem.

And the day concludes with Liam Quin of W3C providing a very cogent high-level survey of long-term preservation as an intellectual problem, and drawing out some consequences of the issues raised.

As is usual at Balisage symposia, we have reserved ample time for discussion and for an open session (aka free-for-all) at the end of the day.

As you can see, the program examines the problem are from a variety of perspectives and provides ample opportunity for people who might not often hear from each other to exchange views and learn from experience in other fields.

If you have any interest at all in long-term preservation of information (for whatever value of “long-term” makes sense in your context), you should plan to be in Montréal on 2 August. See you there!