What constitutes successful format conversion?
Towards a formalization of ‘intellectual content’
4 December 2010
Overview
- a view of the problem (from outside)
- recent work in markup-language semantics
- a proposition
- applications of the idea
- some complications
- conclusions
The base assumption
Mike Lesk (Lesk 1992):
Reformatting, instead of being
a last resort as material physically collapses,
will be a common way of life in the digital age.
Two kinds of copying:
- disk to disk
- format to format
The problem
[Disclaimer: IANAPL ...]
- How do we know a copy succeeded?
- For media conversions, compare input and output.
- For substantive conversions, ... what?
- inspect the output?
- test the output selectively?
- wave our hands and hope for the best?
Can we do better?
Improving the situation
We can do better, if we:
- define intellectual content
concretely;
- specify how to check it operationally.
Assumptions / limitations:
- digital information
- SGML or XML markup*
Markup semantics
Recent* markup theory borrows notion from software semantics.
Turski and Maibaum [1987] write:
Two points deserve special attention: we expect programs to be capable of expressing a meaning and we want to be able to compare meanings. Unless we are very careful, we may very soon be forced to consider an endless chain of questions: what is the meaning of ...? what is the meaning of the meaning of ...? etc. Without going into a philosophical discussion of issues certainly transgressing any reasonable interpretation of the title of this book, we shall accept that the meaning of A is the set of sentences S true because of A. The set S may also be called the set of consequences of A. Calling sentences of S consequences of A underscores the fact that there is an underlying logic which allows one to deduce that a sentence is a consequence of A.
Usually the A itself is a set of sentences, thus we are saying in fact that the meaning of a set of sentences is the set of consequences that are deducible from it.
Markup semantics (bis)
Let's focus a bit tighter:
... we shall accept that the meaning of A is the set of sentences S true because of A. The set S may also be called the set of consequences of A. Calling sentences of S consequences of A underscores the fact that there is an underlying logic which allows one to deduce that a sentence is a consequence of A.
Example: OAI (1)
For example, consider this OAI-PMH message:
<OAI-PMH xmlns="http://www.openarchives.org/OAI/2.0/"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/
http://www.openarchives.org/OAI/2.0/OAI-PMH.xsd">
<responseDate>2002-05-01T19:20:30Z</responseDate>
<request verb="GetRecord"
identifier="oai:arXiv.org:hep-th/9901001"
metadataPrefix="oai_dc"
>http://an.oa.org/OAI-script</request>
<GetRecord>
<record>
...
</record>
</GetRecord>
</OAI-PMH>
Example: OAI (2)
What the spec says:
This verb is used to retrieve an individual metadata record from a repository. Required arguments specify the identifier of the item from which the record is requested and the format of the metadata that should be included in the record. Depending on the level at which a repository tracks deletions, a header with a "deleted" value for the status attribute may be returned, in case the metadata format specified by the metadataPrefix is no longer available from the repository or from the specified item.
Example: OAI (3)
So we know (can infer):
- There was a request q, to which this XML document (r)
is a response.
- The request q raised no error and no exception.
- The item requested (i) exists within the repository (a).
- The metadata format requested exists (or did exist) for item i.
- ...
Example: OAI (4)
More formally:
request_verb(q, ‘GetRecord’)
∧ errorfree(q)
∧ item_isin_repository(i, a)
∧ repository_hasformat_foritem(a, format-id, i)
∧ ...
Some technical details
- Inferences generated by instantiating sentence schemata (‘skeleton sentences’)
- identifiers: generate, or do without?
- infinite sets of inferences
Related work: XML to RDF, XML essence testing, Gloze, Planets tool, ...
Proposal
Take intellectual content (for preservation
purposes) as the meaning of the document.
Take the meaning of the document as
consisting of (a) the content (character sequences) of the document
and (b) the inferences licensed by the markup in the document.
Corollary
- document meanings comparable by mechanical means
- ... including pre- and post-conversion forms
Comparing source and result
To compare a document being transformed
from a source vocabulary
into a target vocabulary, make:
- S: list of inferences in source
- T: list of inferences in target
- ‘ST mapping rules’: crosswalk
from source concepts to target concepts. E.g.
(s_olist(x) ∨ s_ulist(x) ∨ s_deflist(x)) ⇒ t_list(x)
- ‘TS mapping rules’: crosswalk
from target concepts to source concepts. E.g.
t_list(x) ⇒ (s_olist(x) ∨ s_ulist(x) ∨ s_deflist(x))
Information loss
Checking for information loss: for each sentence in
S (source inferences):
Does this follow from T?
Yes: all information in source preserved.
No: information lost.
Information gain (noise)
Checking for noise: for each sentence in
T (target inferences):
Does this follow from S?
Yes: all information in target traceable to source.
No: information added; spurious?
Proposal, rephrased
The gold standard (or ideal goal) of format conversion is
transformation from source format to target format.
Conversions can be tested empirically.
Applications
Possible applications for a transformation procedure:
- prove procedure correct
- prove correct for one particular document
- testing during development
But wait — Are formal methods helpful?
But .. formal methods are so
last century!
I remember when
"proving software correct" was an interesting topic, but it has not shown
itself to be of much use in the long run.
A fair analogy.
But faulty empirical observation.
What formal methods give us
- sometimes, proofs about software or processes
- tools for improving reliability
- clear statement of the problem
- better understanding of where proofs are
and aren't feasible
Complications
There are some complications:
- Not all digital objects are XML.
- Not all information should be preserved.
- character-encoding
- HTTP-equivalent (and other metadata?)
- ...
- Different levels of detail (refinement, abstraction).
- Different ways of carving the world (sloppy overlap).
Morals we can draw
In designing target formats:
Allow sliding scale of generality.
(author, editor, translator, composer, arranger
— but also creator)
Allow open-ended annotation (e.g. for problems).
Provide documentation.
Provide complete documentation.
Ongoing work
- Sketch semantic descriptions of colloquial XML vocabularies.
- Develop proof-of-concept implementation.
Conclusions
What does this mean?
- a proposed operational definition
- theoretical (conceptual) implications
- possible practical applications
- definite practical implications
- Without documentation, no proof possible.
- So: insist on documentation!