International Data Curation Conference (IDCC) 6, Chicago

[13 December 2010]

Spent the early part of last week in Chicago attending the 6th International Digital Curation Conference, co-sponsored by the U.K.-based Digital Curation Centre and the Graduate School of Library and Information Science (GSLIS) of the University of Illinois at Urbana/Champaign, “in partnership with” the Coalition for Networked Information (CNI).

I won’t try to summarize everything here, just mention some talks that caught my attention and stick in my mind.

The opening keynote by Chris Lintott talked about his experiences setting up the Galaxy Zoo, an interactive site that allows users to classify galaxies in the Sloan Sky Map by their form (clockwise spiral, counter-clockwise spiral, etc.). At the outset I was deeply skeptical, but he won me over by his anecdotes about some surprising results achieved by the mass mobilization of these citizen scientists, and by saying that if you want that kind of thing to work, you must treat the users who are helping you as full partners in the project: you must describe accurately what you are doing and how they are helping, and you must share the results with them, as members of the project team. The Galaxy Zoo has been such a success that they are now building an infrastructure for more projects of the same kind (essentially: where humans do better than software at recognizing patterns in the data, and where it’s thus useful to ask humans to do the pattern recognition on large datasets), called the Zooniverse.

Those with projects that might usefully be crowd-sourced should give the Zooniverse a look; it might make it feasible to do things you otherwise could not manage. (I wonder if one could get better training data for natural-language parsers and part of speech taggers that way?)

In the immediately following session, Antony Williams of the ChemSpider project described the somewhat less encouraging results of a survey of Web information about chemistry, from the point of view of a professional chemist who cares about accuracy (and in particular cares about stereochemical details). Barend Mons gave an optimistic account of how RDF can be used not just to describe Web pages but to summarize the information they contain, sentence by sentence, and the massive triple stores that result can be used to find new interesting facts. It was all very exciting, but his examples made me wonder whether you can really reduce a twenty-five- or forty-word sentence to a triple without any loss of nuance. In the question session, Michael Lesk asked an anodyne question about Doug Lenat and the Cyc project, which made me think he was a little skeptical, too. But the speaker dodged the bullet (or tried to) by drawing firm lines of difference between his approach and Cyc. John Unsworth rounded out the session by describing the MONK (Metadata offer new knowledge) project and making the argument that humanities data may resist correction and normalization more violently than scientific data, the idiosyncrasy of the data being part of the point. (As Jocelyn Penny Small of Rutgers once said in an introduction to databases for humanists, “Your job is not to use your computer database to clean up this mess. Your job is to use the computer and the database to preserve the mess in loving detail.”)

Another session that sticks in my mind is one in which Kate Zwaard of the U.S. Government Printing Office spoke about the GPO’s design of FDSys, intended to be a trusted digital repository for (U.S. federal) government documents. In developing their metadata ontology, they worked backwards from the required outcomes to identify the metadata necessary to achieve those outcomes. It reminded me of the old practice of requirements tracing (in which every feature in a software design is either traced to some accepted requirement, or dropped from the design as an unnecessary complication). In the same session, Michael Lesk talked about work he and others have done trying, with mixed success, to use explicit ontologies to help deal with problems of data integration — for example, recognizing all the questions in some database of opinion-survey questions which are relevant to some user query about exercise among the elderly. He didn’t have much of an identifiable thesis, but the way he approached the problems was almost worth the trip to Chicago by itself. I wish I could put my finger on what makes it so interesting to hear about what he’s done, but I’m not sure I can. He chooses interesting underlying questions, he finds ways to translate them into operational terms so you can measure your results, or at least check them empirically, he finds large datasets with realistic complexity to test things on, and he describes the results without special pleading or excuses. It’s just a pleasure to listen to him. The final speaker in the session was Huda Khan of Cornell, talking about the DataStaR project based there, in which they are using semantic web technologies in the service of data curation. Her remarks were also astute and interesting; in particular, she mentioned that they are working with a tool called Gloze, for mapping from XML document instances into RDF; I have got to spend some time running that tool down and learning about it.

My own session was interesting, too, if I say so myself. Aaron Hsu of Indiana gave an energetic account of the difficulties involved in providing emulators for software, with particular attention to the challenges of Windows dynamic link libraries and the particular flavor of versioning hell they invite. I talked about the application of markup theory (in particular, of a particular view about the meaning of markup) to problems of preservation — more on that in another post — and Maria Esteva of the University of Texas at Austin talked about visualizations for heterogeneous electronic collections, in particular visualizations to help curators get an overview of the collection and its preservation status, so you know where it’s most urgent to focus your attention.

All in all, a good conference and one I’m glad I attended. Just two nagging questions in the back of my mind, which I record here for the benefit of conference organizers generally.

(1) Would it not be easier to consult the program on the Web if it were available in HTML instead of only in PDF? (Possibly not; possibly I am the only Web user left whose screen is not the same shape as a piece of A4 paper and who sometimes resizes windows to shapes different from the browser default.)

(2) How can a conference on the topic of long-term digital preservation choose to require that papers be submitted in a closed proprietary format (here, .doc files), instead of allowing, preferring, or requiring an open non-proprietary format? What does this say about the prospects for long-term preservation of the conference proceedings? (A voice in my ear whispers “Viable just as long as Microsoft thinks the current .doc format is commercially a good idea”; I think IDCC and other conferences on long-term preservation can and should do better than that.)