graphic with four colored squares

Sending Messages to the Future

Preservation Aspects of Data Representation

C. M. Sperberg-McQueen, Black Mesa Technologies

Summer XML 2009, 28 July 2009



Why care?

What do we want from digital resources?
  • Ease of use
  • Reusability
  • Sharing
  • Secondary analysis
  • Historical record
  • Permanence

Preservation as one-way messages

The future is another country

The future is a foreign country; they do things differently there.

-After L.P.Hartley

Interoperation across temporal boundaries

interoperation across geographic boundaries.
All the usual rules of device independence and application independence apply.

Operating with minimal feedback





A threat model

Failures of the expressive function

First failure point: sender ...
  • fails to project
  • fails to say what they mean to say (poor modeling)
  • loses or deletes the data

Failures of the conative function

Second point of failure: recipient ...
  • does not listen
  • listens but does not care
  • cares but does not understand
N.B. We do not know who the recipient is, in detail. Using the message to ask the recipient to do something is ... unreliable, and possibly dangerous. Stick to description of facts.

Phatic failures

Failures of the channel:
  • Bad media
  • Obsolete hardware
  • Obsolete software
  • Disappearing repositories?
  • Invisible repositories?

Emulation, migration, normalization

Which approach, when?

Do digital objects suffer generational degeneration?

Second-generation photocopies are less clear.
Ditto second-generation photographs, copies of drawings, manuscript transcriptions of texts.
Are digital objects free of this degeneration?

Yes, digital objects suffer generational degeneration

Straight copies are usually exempt.
Every format conversion involves a better or worse match; most are potentially lossy.

Metalinguistic failures

Failures in the code:
  • Character set problems
  • Data format problems
    • proprietary formats
    • ad-hoc (homebrew) formats
    • unconventional use of public formats
    • semantic failures

Safeguarding semantics

Semantic failures

A rich source of problems.
  • lightning bugs
  • failure to grasp full meaning and import
  • failure to understand what the fields / elements / attributes / records mean
  • accumulating errors of translation

When automation turns bad

  <dcvalue element="contributor" qualifier="none"
    >Scanning, indexing, and description 
    sponsored by the Illinois State Library and
    the University of Illinois at Urbana-Champaign 
    Library. Geo-referencing sponsored and 
    performed by the Geographic Modeling Systems 
    Laboratory, University of Illinois at 
  <dcvalue element="contributor" 
    >United States. 
    Agricultural Adjustment Agency.</dcvalue>
  <dcvalue element="contributor" 
    >Aerial Photographs</dcvalue>
There appears to be an error here (“Aerial Photographs” is apparently not the corporate or individual name of an author of this photograph).
Can this be avoided?

How did this happen?

Dubin explains:
An obvious interpretation is that this is simple tag abuse or human error, but the history of this description reveals it to be an example of a more general and complicated problem. This is the latest in a series of descriptions each derived from an earlier version:
  1. A paper description accompanied the original photograph, which had been taken in 1938.
  2. In 1998 the photograph was scanned for inclusion in an image database made available on the web [grainger99]. A metadata record for the photograph was entered into a relational database. The fields for that database were derived from the FGDC [Federal Geographic Data Committee] Content Standard for Digital Geospatial Metadata [fgdc98].
  3. In May of 2005 an OAI 2.0 metadata record was derived from that database entry, via a mapping from the database fields into Dublin Core.
  4. Several months later the OAI record was transformed via XSLT [XSL Transformations] into a form suitable for ingestion into a DSpace installation.
  5. When the record was exported from DSpace, additional DC [Dublin Core] metadata statements had been automatically added.

Verifying emulation, migration, normalization

How do you know when an emulator is working correctly?
How do you know when a migration program has produced a correct result? How do you detect errors in migration?

Methods of verification and quality control

Naked human eyeballs.
Automated processes:
  • validation
  • supravalidation
  • false-color proofs
Semi-automated human intervention (padded cells).
The 1, 10, 100 practice.

Formal methods of verification?

Identifying the meaning of markup

Identifying the meaning of markup

A simple example

From a formalization of the OAI-PMH vocabulary:
(∃ q : OAI-request) (∃ r : OAI-response) (∃ s : OAI-server) (∃ t : moment)
( q = (℩ q : OAI-request)(models({ ./oai:request }, q))
    ∧ s = (℩ s : OAI-server)(uri_server({ string(./oai:request) }, s))
    ∧ t = (℩ t : moment)(xsd_lv(xsd:dateTime, { string(./oai:responseDate) }, t))
    ∧ (∀ x)(uri_server(x, s) ⇒ x = { string(./oai:request) })
    ∧ { . } = r
    ∧ served_response(q,s,t,r))
(∃ q : OAI-request) (models({ preceding-sibling::oai:request }, q)
∧ invalid(q)
∧ request_error(q, { string(@code) })
∧ ({ string(.) } ≠ "" ⇒ error_nldesc(q, { string(.) }) ) ) )
(∃ q : OAI-request) (∃ s : oai-server) (∃ d : string) (∃ i : oai-item) (∃ p : string)
(q = (℩ q : OAI-request)(req_resp(q,{ ancestor::oai:OAI-PMH }))
    s = (℩ s : oai-server)(resp_server({ ancestor::oai:OAI-PMH }, s))
    d = (℩ d : string)(request_identifier(q, d))
    i = (℩ i : oai-item)(item_id(i, d))
    p = (℩ p : string(p)(request_metadataPrefix(q, p))
∧ request_verb(q, "GetRecord")
∧ errorfree(q)
∧ isin_repository_item(s, i)
∧ hasformat_repository_item_format(s, i, p) )

If you can't prevent degeneration, detect it

Before: ... blah blah blah ...
After: ... blah blah
(∃ p)((person(p) ∨ organization(p)) ∧ name(p,“Aerial photographs”))
blah ...

Is there a reason these are different?


Preserving semantics

  • Think about what you wish to say.
  • Design the vocabulary (the format) carefully, to make instances easy to understand.
  • Document the vocabulary and your usage.
  • Avoid (undocumented) tag abuse.
  • Provide and document ancillary materials.
  • Validate early and often.
  • Verify early and often.