What constitutes successful format conversion?

Towards a formalization of ‘intellectual content’

4 December 2010

Overview

a view of the problem (from outside)
recent work in markup-language semantics
a proposition
applications of the idea
some complications
conclusions

The base assumption

Mike Lesk (Lesk 1992):

Reformatting, instead of being a last resort as material physically collapses, will be a common way of life in the digital age.

Two kinds of copying:

disk to disk
format to format

The problem

[Disclaimer: IANAPL ...]

How do we know a copy succeeded?
- For media conversions, compare input and output.
- For substantive conversions, ... what?
  - inspect the output?
  - test the output selectively?
  - wave our hands and hope for the best?
  Can we do better?

Improving the situation

We can do better, if we:

define intellectual content concretely;
specify how to check it operationally.

Assumptions / limitations:

digital information
SGML or XML markup*

Markup semantics

Recent* markup theory borrows notion from software semantics. Turski and Maibaum [1987] write:

Two points deserve special attention: we expect programs to be capable of expressing a meaning and we want to be able to compare meanings. Unless we are very careful, we may very soon be forced to consider an endless chain of questions: what is the meaning of ...? what is the meaning of the meaning of ...? etc. Without going into a philosophical discussion of issues certainly transgressing any reasonable interpretation of the title of this book, we shall accept that the meaning of A is the set of sentences S true because of A. The set S may also be called the set of consequences of A. Calling sentences of S consequences of A underscores the fact that there is an underlying logic which allows one to deduce that a sentence is a consequence of A.

Usually the A itself is a set of sentences, thus we are saying in fact that the meaning of a set of sentences is the set of consequences that are deducible from it.

Markup semantics (bis)

Let's focus a bit tighter:

... we shall accept that the meaning of A is the set of sentences S true because of A. The set S may also be called the set of consequences of A. Calling sentences of S consequences of A underscores the fact that there is an underlying logic which allows one to deduce that a sentence is a consequence of A.

Example: OAI (1)

For example, consider this OAI-PMH message:

<OAI-PMH xmlns="http://www.openarchives.org/OAI/2.0/" 
  xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
  xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/
  http://www.openarchives.org/OAI/2.0/OAI-PMH.xsd">
  <responseDate>2002-05-01T19:20:30Z</responseDate>
  <request verb="GetRecord" 
    identifier="oai:arXiv.org:hep-th/9901001"
    metadataPrefix="oai_dc"
    >http://an.oa.org/OAI-script</request> 
  <GetRecord>
    <record>
      ...
    </record>
  </GetRecord> 
</OAI-PMH>

Example: OAI (2)

What the spec says:

This verb is used to retrieve an individual metadata record from a repository. Required arguments specify the identifier of the item from which the record is requested and the format of the metadata that should be included in the record. Depending on the level at which a repository tracks deletions, a header with a "deleted" value for the status attribute may be returned, in case the metadata format specified by the metadataPrefix is no longer available from the repository or from the specified item.

Example: OAI (3)

So we know (can infer):

There was a request q, to which this XML document (r) is a response.
The request q raised no error and no exception.
The item requested (i) exists within the repository (a).
The metadata format requested exists (or did exist) for item i.
...

Example: OAI (4)

More formally:

request_verb(q, ‘GetRecord’)

∧ errorfree(q)

∧ item_isin_repository(i, a)

∧ repository_hasformat_foritem(a, format-id, i)

∧ ...

Some technical details

Inferences generated by instantiating sentence schemata (‘skeleton sentences’)
identifiers: generate, or do without?
infinite sets of inferences

Related work: XML to RDF, XML essence testing, Gloze, Planets tool, ...

Proposal

Take intellectual content (for preservation purposes) as the meaning of the document.

Take the meaning of the document as consisting of (a) the content (character sequences) of the document and (b) the inferences licensed by the markup in the document.

Corollary

document meanings comparable by mechanical means
... including pre- and post-conversion forms

Comparing source and result

To compare a document being transformed from a source vocabulary into a target vocabulary, make:

S: list of inferences in source
T: list of inferences in target
‘ST mapping rules’: crosswalk from source concepts to target concepts. E.g.

(s_olist(x) ∨ s_ulist(x) ∨ s_deflist(x)) ⇒ t_list(x)
‘TS mapping rules’: crosswalk from target concepts to source concepts. E.g.

t_list(x) ⇒ (s_olist(x) ∨ s_ulist(x) ∨ s_deflist(x))

Information loss

Checking for information loss: for each sentence in S (source inferences):

Does this follow from T?

Yes: all information in source preserved.

No: information lost.

Information gain (noise)

Checking for noise: for each sentence in T (target inferences):

Does this follow from S?

Yes: all information in target traceable to source.

No: information added; spurious?

Proposal, rephrased

The gold standard (or ideal goal) of format conversion is

noise-free
non-lossy

transformation from source format to target format.

Conversions can be tested empirically.

Applications

Possible applications for a transformation procedure:

prove procedure correct
prove correct for one particular document
testing during development

But wait — Are formal methods helpful?

But .. formal methods are so last century!

I remember when "proving software correct" was an interesting topic, but it has not shown itself to be of much use in the long run.

A fair analogy.

But faulty empirical observation.

What formal methods give us

sometimes, proofs about software or processes
tools for improving reliability
clear statement of the problem
better understanding of where proofs are and aren't feasible

Complications

There are some complications:

Not all digital objects are XML.
Not all information should be preserved.
- character-encoding
- HTTP-equivalent (and other metadata?)
- ...
Different levels of detail (refinement, abstraction).
Different ways of carving the world (sloppy overlap).

Morals we can draw

In designing target formats:

Allow sliding scale of generality.

(author, editor, translator, composer, arranger — but also creator)
Allow open-ended annotation (e.g. for problems).
Provide documentation.
Provide complete documentation.

Ongoing work

Sketch semantic descriptions of colloquial XML vocabularies.
Develop proof-of-concept implementation.

Conclusions

What does this mean?

a proposed operational definition
theoretical (conceptual) implications
possible practical applications
definite practical implications
- Without documentation, no proof possible.
- So: insist on documentation!