<!DOCTYPE TEI.2 PUBLIC '-//C. M. Sperberg-McQueen//DTD
          TEI Lite 1.0 plus SWeb (XML)//EN'
          '../../../lib/swebxml.dtd' [
<!ATTLIST list type CDATA 'bullets' >
<!ATTLIST seg  rend CDATA 'incremental' >
<!ATTLIST xref href CDATA '' >

<!ATTLIST div id ID #IMPLIED >
<!ATTLIST item id ID #IMPLIED >

<!ENTITY date.last.touched '30 December 2010'>

<!ENTITY ntilde  "&#241;" ><!-- small n, tilde -->

]>
<?xml-stylesheet type="text/xsl" href="../../../lib/bmtdocs.xsl"?> 
<TEI.2>
<teiHeader>
<fileDesc>
<titleStmt>
<title type="main">Thutmose II</title>
<title type="sub">MARCXML to TEI Header translation</title>
</titleStmt>
<publicationStmt>
<pubPlace>Espa&ntilde;ola, New Mexico</pubPlace>
<publisher>Black Mesa Technologies LLC</publisher>
<date>2009</date>
</publicationStmt>
<sourceDesc>
<p>No source; created in electronic form.</p>
</sourceDesc>
</fileDesc>
</teiHeader>
<text>
<front>
<titlePage>
<docTitle>
<titlePart>Thutmose II</titlePart>
<titlePart>MARC XML to TEI header translation</titlePart>
</docTitle>
<docAuthor>C. M. Sperberg-McQueen, Black Mesa Technologies LLC</docAuthor>
<docDate>21 December 2010</docDate>
<docDate>rev. &date.last.touched;</docDate>
</titlePage>

<div id="navbar" type="navbar">
<head>Nearby documents</head>
<list>
<item><xref href="online.html">Online interface to Thutmose II</xref></item>
<item><xref href="progdoc.xml">Programmers' documentation</xref></item>
<item id="siteroot"><xref href="../../..">Home</xref></item>
</list>
</div>
</front>
<body>
<p>Thutmose II translates from MARCXML records to TEI headers.  The
<q>II</q> indicates that this is the second
version of the program; an earlier
version of the program
(<xref href="../../../2009/10/thutmose/L1/userdoc.xml">Thutmose I</xref>) 
is also available.
When complete (which should be any day now), Thutmose II 
will handle all of the fields described in the 
TEI Library SIG's best practices document.</p>

<div id="usage">
<head>Usage</head>
<p>This document assumes you have an XSLT 1.0 processor and know how
to invoke it on your MARCXML data, and how to pass parameters to
the stylesheet processor.</p>
<div id="parms">
<head>Run-time parameters</head>
<p>The most important parameters are these:

<list type="glossary">

<label><code>result</code></label>
<item>Specifies which result element to use for each MARC record in the
input:<list>
<item><code>TEI</code> means to produce a complete <soCalled>tadpole</soCalled>
TEI document, containing a header with a rudimentary text structure.</item>
<item><code>teiHeader</code> means to produce just the header, without
the TEI wrapper element and rudimentary text.</item>
</list>
If the input contains more than one MARC record, the set of resulting
elements (whether TEI elements or teiHeader elements) will be wrapped
in a <ident>teiCorpus</ident> element regardless of the value of 
<code>result</code>.
</item>

<label><code>config-file</code></label>
<item>Supplies the URI (or local file name) of an XML document
with <list>
<item>configuration information specifying how information
from the MARC record is to be mapped into the TEI header</item>
<item>user-supplied text to be inserted at appropriate places in
the output.</item>
</list>
For example, the MARC record is assuemd to describe the
source being digitized, not the TEI document itself; 
the publication statement for the
TEI document will typically identify the project which is
creating the TEI document; appropriate XML should be provided in
the configuration file.
Defaults to <q><code>thutmose.config.xml</code></q>.
</item>

<!--*
<label><code>marcitem</code></label>
<item>Specifies whether the MARC record in the input describes 
the source of the TEI document, or the TEI document itself.
<list>
<item><code>source</code> means the MARC record describes
the source, not the TEI document itself.  
</item>
<item><code>tei-from-source</code> means the MARC record describes
the TEI document itself, and was prepared with the source of 
the TEI document in hand.
</item>
<item><code>tei-from-digital</code> means the MARC record describes
the TEI document itself, and was prepared without the source of 
the TEI document in hand, only the digital object (page scans, etc.) 
itself.
</item>
</list>
The <code>marcitem</code> determines whether certain information in
the MARC record (for example, the edition statement) is copied
into the TEI header as a child of <ident>fileDesc</ident> or
as a descendant of <ident>sourceDesc</ident>.
</item>
*-->

</list>
</p>
</div>
<div id="alsoparms">
<head>Additional parameters</head>
<p>Additional parameters include:
<list type="glossary">

<label><code>idno</code></label>
<item>Use the value specified both as the value of the <att>id</att>
attribute on the <gi>TEI</gi> element (if one is generated) and
as the value of an <gi>idno</gi> element in the publication 
statement.</item>

<label><code>year-of-publication</code></label>
<item>The year in which the TEI document is being published.
If not supplied as a parameter, this is taken from the configuration
file.  If not present there, Thutmose II tries to guess; sometimes
it may guess right.</item>

<label><code>trace-source</code></label>
<item>Specifies (if the value is <code>1</code> or <code>true</code>)
that data from the MARC record should be wrapped in TEI 
<gi>seg</gi> elements with <att>type</att> attributes
indicating which MARC field and subfield the data came from.
Useful for debugging; probably clutters the output too
much for normal use.</item>

<!--*
<label><code>beautification</code></label>
<item>Specifies whether trailing punctuation should be stripped
from certain values before the values are copied to the 
output.  <list>
<item><code>1</code> means yes, attempt to strip the non-significant
punctuation.</item>
<item><code>0</code> means no, do not attempt to strip non-significant
punctuation, just copy things as they are.</item>
</list>
Because MARC records vary so much in their punctuation practice,
and because sometimes the trailing punctuation really does belong
to the value (e.g. a title ending in the word <q>etc.</q>), 
Value beautification is at best an approximation:  sometimes
it strips characters that shouldn't be stripped, and sometimes
it leaves characters that should be stripped.
Turning it off, on the other hand, will leave a large number of extraneous commas, 
slashes, and full stops in personal names
and work titles.  Either way, ideally the header should be
cleaned up by hand.</item>
*-->

<!--*
<label><code>sourceDesc</code></label>
<item>Specifies how the source description will be tagged.
Known values are:<list>
<item><code>biblFull</code></item>
</list>
Later, <code>biblStruct</code> and possibly other values
will also be supported.
</item>
*-->
</list>
</p>
</div>
</div>

<div id="getit">
<head>Getting Thutmose II</head>
<p>Just download what you need:<list>
<item><xref href="mt2.xsl">mt2.xsl</xref> the stylesheet itself</item>
<item><xref href="thutmose.config.xml">thutmose.config.xml</xref> the 
default configuration file</item>
<item><xref href="userdoc.xml">userdoc.xml</xref> this document (in TEI Lite)</item>
<item><xref href="progdoc.xml">progdoc.xml</xref> a document aimed at people who 
want to modify or maintain the stylesheet (in TEI Lite)</item>
</list>
</p>
<p>
Many XSLT processors will be able to retrieve the stylesheets from the
Web, so you may not need to download them at all.  But if you're
going to using Thutmose a lot, it's probably better network
citizenship to download a copy and check here periodically for
updates.
</p>
</div>

<div id="plans">
<head>Plans for the future</head>
<p>This is the second version of Thutmose.</p>

<p>Follow-on projects are expected to:<list>
<item>improve the data beautification options</item>
<item>provide some record of changes made in the course of 
beautification, so that it can be reviewed and the data can 
be edited if necessary; changes will be recorded either 
inline (in processing instructions) or in a log emitted as
a side effect of transformation</item>
<item>map from TEI headers to MARC (currently this has lower priority)</item>
</list>
</p>
</div>
<div id="gaps">
<head>Known gaps, bugs, and shortcomings</head>

<p>In the default configuration, Thutmose II handles, by design, only
the mappings specified in the Best Practices document, and only MARC
21 fields not withdrawn, marked OBSOLETE, or otherwise deprecated.</p>
<p>If you have MARC records with fields no longer present in MARC
21 (say, they have 440 fields for series titles), you can add mappings
for them by customizing the configuration file.</p>

<div id="thutmose-II">
  <head>Features planned for Thutmose II</head>
  <p>All of the following features are expected to be
  part of Thutmose II.  (That is, I don't plan to declare
  Thutmose II complete until these are all done.  The
  reader may think of this as a to-do list for the
  programmer.)</p> 
  <p><list>
    <item>
      <p>
	The <gi>respStmt</gi> elements being generated by MARC fields 100 and 700
	are, to put it gently, sub-optimal.  That is to say, they are very very bad.
      </p>
      <p>
	<hi>(23 December 2010:  this should be fixed in the medium-term future.)</hi>
      </p>
    </item>
    <item>
      <p>
	MARC fields 100 and 700 should tag their contents (or: the relevant
	parts of their contents)
	as personal names.
      </p>
      <p>
	<hi>(30 December 2010:  this should be fixed in the medium-term future.)</hi>
      </p>
    </item>
    <item>
      <p>
	MARC fields 110 and 720 generate <gi>author</gi>, 
	<gi>editor</gi>, or <gi>respStmt</gi> elements
	as appropriate.  Their content should also be
	tagged as a corporate name.
      </p>
      <p>
	<hi>(21 December 2010:  this should be fixed in the medium-term future.)</hi>
      </p>
    </item>
    <item>
      <p>
	MARC fields 050-099 should generate <gi>idno</gi>, 
	<gi>classDecl</gi>, <gi>taxonomy</gi>, and <gi>classCode</gi> elements
	as appropriate.  This requires some close reading of the MARC 21 documentation
	which has not yet been done; currently
	these MARC fields are ignored.
      </p>
      <p>
	<hi>(21 December 2010:  this should be fixed in the medium-term future.)</hi>
      </p>
    </item>
    <!--*
    <item>
      <p>
	MARC 5xx fields (notes) should generate a <gi>notesStmt</gi>; currently
	they are ignored.
      </p>
      <p>
	<hi>(21 December 2010:  this should be fixed in the near future.)</hi>
	Fixed 30 December 2010.
      </p>
    </item>
    *-->
    <!--* 
	<item>On some of the test records, MARC fields 100 and 700
	seem to be generating multiple <gi>author</gi>, 
	<gi>editor</gi>, and <gi>respStmt</gi> elements;
	this is probably an error in the stylesheet.
	<hi>(23 December 2010:  this should be fixed in the medium-term future.)</hi>
	</item>
    *-->

    <!--* 
	<item>Additional parameters are needed to provide information
	for the initial entry in the revision history.
	<hi>(21 December 2010:  this should be fixed in the near future.)</hi>
	</item>
	*-->
    
    
    <!--* 
	<item>[Done 22 Dec 2010]
	The various forms of title given in the MARC record
	all generate <gi>title</gi> elements; the <att>type</att> attribute
	should be given appropriate values, as described in the best-practices
	document; currently this does not happen.
	<hi>(21 December 2010:  this should be fixed in the medium-term future.)</hi>
	</item>
	*-->
    
    
  </list>
  </p>
</div>

<div id="other-gaps">
  <head>Other</head>
  <p>The following items all describe features and functionality that
  will not be part of the first complete version of Thutmose II; they
  may be added as and when I have time and inspiration.</p>
  <p>
    <list>

      <item>
	<p>
	  The user interface to the online demo version of Thutmose
	  should be refined to make it easier to see what is going on.
	  One way to do this would be to provide both a read-only 
	  view (with formatting and syntax coloring) and an editable
	  view using text widgets (as in the current interface), with
	  buttons to switch between views.
	</p>
      </item>

      <item>
	<p>
	  MARC fields 100 and 700 generate <gi>author</gi>, 
	  <gi>editor</gi>, or <gi>respStmt</gi> elements
	  as appropriate.  Some subfields, however, don't belong in
	  those elements and should be filtered out.
	  User options should be available (through the configuration
	  file) to control this filtering.
	</p>
      </item>
      
      <item>
	<p>
	  The <soCalled>beautification</soCalled> (punctuation-stripping) routines
	  which are part of Thutmose I are not yet integrated into Thutmose II.
	</p>
      </item>
      
      <!--*
	  <item>It's boring to have just one sample MARC record, in the
	  Web interface to Thutmose II.
	  It would be nicer to have a button which selects a MARC
	  record at random from some collection of MARC records, and uses
	  it to populate the input document.
	  <hi>(21 December 2010:  this should be fixed in the near future;
	  for browser security reasons, we can't just point at some public
	  source of MARC XML documents, even if we could find one on the Web;
	  instead, we'll put five or ten MARC records on this server and
	  add a button to choose one of them at random.)</hi>
	  [Fixed 23 Dec 2010]
	  </item>
	  *-->
      
    </list>
  </p>
</div>

<!--* 
<p><list>
<item>The punctuation-stripping routine used when the beautify option is chosen 
does not always do the right thing.  In the data used for testing, it is
not always clear what the right thing is.</item>
<item>The <code>tei-from-source</code> and <code>tei-from-digital</code>
options are not yet fully worked out.</item>
<item>Fields currently mapped to appropriate elements in the TEI header
include: <list>
<item>245 (title)</item>
<item>100, 110, 111 (various forms of author)</item>
<item>250 $a and $b (editionStmt)</item>
<item>300 $a, $b, and $c (extent)</item>
<item>260 $a, $b, $c (publicationStmt)</item>
<item>440, 490, and 830 $a (seriesStmt)</item>
<item>500, 546 (notesStmt)</item>
<item>600, 610, 611, 650, 651, 655 (profileDesc/textClass/kwywords) 
currently assumes withotut checking that values are from LCSH</item>
</list>
Not yet included:
<list>
<item>040 $b (teiHeader/@lang)</item>
<item>130, 240, 246 (other forms of title)</item>
<item>533 and 534 various subfields (*e.g. $a author of source, $t title, 
$b edition, $e extent, $c publication statement, etc.))</item>
<item>700, 710, 711 (added tracings for author, editor, other responsibles)</item>
<item>500 (respStmt, editorialDecl, projectDesc, ...)</item>
<item>other 5XX fields</item>
<item>028 5_, 099, 766 $w idno</item>
<item>050-099 classDecl</item>
<item>6XX second indicator (classDecl/taxonomy)</item>
<item>6xx _7 $2 (classDecl/taxonomy)</item>
<item>008/35-37 langUsage</item>
<item>041, 546 langUsage/language</item>
</list>
</item>
</list>
</p>
*-->



</div>
</body>
</text>
</TEI.2>
<!-- Keep this comment at the end of the file
Local variables:
mode: xml
sgml-default-dtd-file:"/Library/SGML/Public/Emacs/sweb.ced"
sgml-omittag:t
sgml-shorttag:t
End:
-->

