XML Summer School
XML Summer SchoolBMT logo

Open Source XML Applications
21 September 2009

Bodleian Library

XML Tools and Applications with FLOSS

C. M. Sperberg-McQueen
Black Mesa Technologies LLC
http://www.blackmesatech.com/2009/09/flossxml

Abstract

From the very beginnings of XML, open source parsers and processors have been freely available to the XML development community. Over the last ten years, building on these basic foundations, a huge library of free/libre open source software (FLOSS) has grown up to support and capitalise on XML open standards. This class explores some of those XML tools and applications — covering both the essential and esoteric — and shows how it can make a real impact for you and your organisation.

Overview

Organization of the presentation

Free/libre open-source software covers almost all of IT.
So does XML.
This is thus a 90-minute course in what kinds of software exist.
Needless to say, it is not complete.
From time to time, we will pause for a demo.

Who's here? (1)

So it's clearer where to linger, who ...
  • writes programs for a living?
  • writes programs from time to time?
  • has learned a programming language?
  • has other people to deal with that kind of thing?

Who's here? (2)

How many of you work / have an interest in ...
  • SQL databases and their ilk?
  • Word? other office-automation software?
  • publishing? back-of-the-book indexing?
  • Web site design?

Road map: processes

Things we do:
  • input: create XML
  • process: update, transform, change, mangle, munge, mince, macerate, puree, aggregate, mix, combine, transmit, interchange, ... our XML
  • store: save, query, extract, retrieve, manage XML
  • output: display, format, print, deliver XML
Some software specializes, some generalizes.

Road map: agents

Who does it / whom is it for?
  • humans
  • software
This sometimes matters a lot, sometimes very little.

Road map: orientation

How general is it?
  • vertical: specific industries or application areas (e.g. ‘pig-farming markup language’)
  • horizontal: no specific area, universal importance (typically infrastructure)
  • diagonal: not exactly horizontal, not exactly vertical. E.g.
    • office documents
    • slide shows
    • images, graphics
    • Web application tools
    • software development tools

What counts as open source?

You tell me.
  • included if clearly open source
  • excluded if clearly not
  • In the large gray area I have chosen to be arbitrary, inconsistent, and capricious. When the details matter to you, check the license.

Other preliminaries

Input

Document creation and validation.
  • editors
  • parsers, validators
  • conversion tools
  • XForms
  • office documents

Editors 0: overview

  • programmers' editors extended for XML
  • HTML editors extended for XML
  • GUI XML editors
  • IDEs
What most people choose ...

Editors 1: text-based XML editors

Mostly these are programmers' editors extended to handle XML.
  • jEdit, mature programmers' text editor (popular as Eclipse plugin), in Java
  • nxml.el (James Clark) a major mode for Emacs (GNU Emacs only)
  • psgml.el (Lennart Stafflin et al.) an Emacs major mode for SGML and XML
  • Rinzo XML editor Eclipse plug-in, close integration with Java
  • vim does XML syntax coloring automatically nowadays (scripts on www.vim.org can help, too)
  • XED (Henry S. Thompson) XML instance editor; works very hard to make it impossible to make ill-formed documents

Editors 2: HTML/XML editors

Mostly these are HTML editors extended to handle XML.
  • Amaya (INRIA, W3C) — predominantly a “Web editor”, but extended to support XML. Generic XML editing is “still experimental” (29 Feb. 2008)
  • Quanta Plus
  • Screem (web development environment with XML-capable editor)

Editors 3: GUI XML editors

N.B. “GUI” ≠ “WYSIWYG”!
  • BXE (Liip AG): browser-based WYSIWYG XML editor (currently Mozilla only)
  • Jaxe, configurable WYSIWYM editor in Java; also available in applet form as WebJaxe
  • MIView (Gnome) source and tree views, validation, ...
  • Pollo XML editor in Java; heavy emphasis on tree structure (best for data-oriented XML?), proud of its tree widget
  • Serna Free XML Editor (Syntext) open version of commercial product
  • Vex (John Krasnay et al.) visual editor for XML, in Java
  • Xerlin (Exari?) Opensource Extensible XML Modeling Application (based on Merlot)
  • XML Copy Editor (G. N. Schmidt), syntax coloring etc., but mostly text-oriented
N.B. boundary to following list is hazy.

Editors 4: XML IDEs

Interactive development environments, mostly aimed at programmers. Boundary to preceding list hazy.
  • Butterfly XML error highlighting, incremental parsing, auto-completion, XSLT pipelines, etc.
  • Cooktop (Victor Pavlov) editor and development environment; Windows only. Freeware (as in beer).
  • OrangevoltXSLT (XSLT development environment, Eclipse plug-in)
  • XCarecrows 4 XML (Cogenit) Eclipse plug-in (XML, XSD, XSLT editor, graphic tree comparator, schema validator, XSLT transformation tool kit)
  • XPontus (Yves Zoundi), text-oriented, with validation, transforms, DTD generation, etc.
Tony Graham maintains a page on XSLT testing.

Demo: jEdit

Pause here for brief demo of jEdit.
Note:
  • syntax coloring
  • tree view
  • context-sensitive element insertion
  • limits of syntax awareness
  • treatment of entities

Demo: Serna Free XML editor

Pause here for brief demo of Serna.
Note:
  • WYSIWYG display (XSL FO)
  • consequent setup overhead
  • styling
  • tree view
  • context-sensitive element insertion
  • treatment of entities

Parsers

For software, essentially several classes of parsers / several interfaces:
  • SAX Simple API for XML
  • StAX Streaming API for XML
  • DOM Document Object Model
  • JAXP (Java API for XML Processing) provides both SAX and DOM
  • parsers with other interfaces
Many parsers supplied in program libraries, essentially invisible.
For humans, not much difference.

Major current parsers

Other parsers

Some of these are listed mostly for historical interest.
  • Aelfred (Saxon version)
  • Aelfred 2 (GNU XML package)
  • Crimson (Sun) XML parser in JDK up to 1.4; SAX, DOM
  • GNU JAXP
  • Lark (Tim Bray) one of the first XML parsers; Larval (validating) built on same code base
  • VTD-XML
  • XML::Sax::PurePerl fallback when others unavailable
  • XML::Twig to process large documents in limited space
  • XML::Xerces (Perl interface to Xerces C++)
  • XParse-J “aspires to be the smallest Java XML parser on the planet.”
  • XOM (Elliotte Rusty Harold)

Parsing infrastructure

In particular, XML catalogs and URI resolvers. Both SAX and JAXP provide hooks for user-specified entity resolvers.
Some but not all parsers support catalogs. It matters.
For more information see Norm Walsh's article XML Entity and URI Resolvers.

Validation

DTD validators

All parsers described as “validating” are DTD validators.

XSD validators

  • libxml2 / xmllint (Gnome) partial implementation
  • MSV Multi-schema validator (Sun)
  • Xerces-C (Apache)
  • Xerces-J (Apache) Apache version, JDK version*
  • XSV (Univ. Edinburgh and W3C)
Many data binding tools also produce validating code. See also

Relax NG validators

  • Jing (James Clark) also validates XSD, Schematron 1.5, NVDL
  • libxml2 / xmllint (Gnome) partial implementation
  • MSV Multi-schema validator (Sun)
  • RNV (David Tolpin) implementation of Relax NG compact syntax
See also list at RelaxNG.org, and
  • Trang (schema translator)

Schematron implementations

Conversion tools

Won't be covered here; see Lars Marius Garshol's list.

XForms 1: in the browser

XForms 2: server-side

The server does most or all of the work:

XForms 3: stand-alone

Stand-alone and embedded:
  • OpenOffice
  • X-Smiles (Helsinki Univ. of Technology) XML browser in Java, runs both stand-alone and embedded; can also run in browser as applet

Office documents

Processing (1)

  • XML and programming languages (‘data binding’)
  • Web services
  • XML messages / PKI
  • programmer tools
  • user-interface specification
  • processing tools specifically for XML

Data binding tools

See also object-relational mapping.

Web services toolkits

See also lists here

XML messages

Transmitting XML messages may involve:
  • encryption
  • digital signatures
  • canonicalization
  • compression
  • character recoding issues

Public-key infrastructure

Most public-key infrastructure specs now implemented in standard libraries. But see also:

Compression

Other programmer tools: toolkits

  • 4Suite open-source platform for XML and RDF processing (→ Amara2)
  • lxml Python access to libxml
  • LT-XML (Language Technology Group, Univ. Edinburgh) a set of tools in C, with Python interfaces; includes sggrep

Other programmer tools: diff

File comparison (diff) tools:
  • XmlDiff (part of VM Tools; aimed at data-oriented XML only, not human-readable documents)
  • 3DM XML 3-way merging and differencing tool (Tancred Lindholm)
  • diffxml (Adrian Monat)
  • DiffMk (Norm Walsh)
  • XMLunit (diff class, part of larger complex)
  • JXyDiff
  • diffx (Topologi)
  • xmlpatch (includes a simple xml-diff utility)
  • xmldiff (by CoreFiling; uses xmlpp pretty-printer, then system diff)

User-interface description languages / tools

See also XForms.

Processing (2): native-XML tools

XSLT

  • Amara2 XSLT 1.0 + EXSLT
  • Gestalt (Gnu / Colin Adams) basic-level XSLT 2.0 processor, in Eiffel
  • libxslt (Gnome / Daniel Veillard) XSLT 1.0, in C
  • Saxon (Saxonica / Michael Kay) 1.0 and 2.0
  • Xalan-C++ (Apache) XSLT 1.0
  • Xalan-Java (Apache) XSLT 1.0
And do not overlook
  • FXSL (Dimitre Novatchev) library that makes XSLT 1.0 and 2.0 into fully functional languages*

XQuery (1)

Static / on-disk / indexed XQuery:
  • BaseX (Univ. Konstanz) native XML db, Java, with visual front end
  • eXist (Wolfgang Meier) very popular XML database
  • MonetDB/XQuery (CWI, Amsterdam) XQuery front end to MonetDB SQL database
  • Oracle Berkeley DB XML (sic)
  • OrientX (Renmin University of China) native XML dbms
  • Rainbow (Worcester Polytechnic) XQuery processing system using relational technology
  • Sedna (Institute for System Programming of the Russian Academy of Sciences) native XML database in C/C++
  • XQEngine (Fatdog Software Inc.), oriented toward collections and full-text search

XQuery (2)

Dynamic / in-memory XQuery implementations:
  • GCX (DBIS, Univ. Freiburg) Garbage-collected XQuery, in-memory streaming XQuery processor
  • Saxon (Michael Kay / Saxonica)
  • Qexo (GNU) partial implementation based on Gnu Kawa, compiles queries to Java byte code (or to native code)
  • qizx/open
  • QueryMachine.XQuery (Semyon A. Chertkov) standalone XQuery implementation in .NET
  • xbird “light-weight” processor in Java
  • XQiB (FLWOR Foundation) XQuery in the browser “the same as JavaScript, just with less code” (experimental)
  • XQilla
  • XQP (Univ. Texas at Arlington) XQuery processing on a P2P system

XQuery (3)

Other XQuery implementations:
See also
  • nux (Lawrence Berkeley Lab) Java toolkit for XML processing (XQuery, update, full-text search ...)
  • XQDT (FLWOR Foundation) XQuery Development Tools
  • XQ2XML (David Carlisle) translations from XQuery into XML, XQueryX, XSLT, and ... XQuery

XPath

  • AquaPath (Todd Ditchendorf) XPath 2.0 evaluator (Mac OS X only)
  • PsychoPath (Eclipse) schema-aware XPath 2.0 processor (library; API, no UI)
  • XpathWorkbook Eclipse plug-in for testing XPath expressions

XProc processors

  • Calabash (Norman Walsh)
  • Cocoatron (Todd Ditchendorf) Mac OS X only
  • Half-Pipe (Philip Fennell) partial implementation in XSLT, atop Saxon 9
  • Tubular (Herve Quiroz)
  • xmlsh (David Lee) command line shell for XML; now includes XProc module
  • xprocxq (James Fuller) XProc in XQuery (to be integrated with eXist)
  • yax (Jörg Möbius)

XML scripting

  • xmlsh (David Lee) command line shell for XML; now includes XProc module
  • xsh and XML::XSH (Petr Pajas) XML editing shell (and Perl access)
  • virgule (HTML/XML-based scripting language “conceptually similar to Lisp”)

Storage and retrieval

  • XML and databases
  • object/relational mapping
  • native XML databases
  • XML indexing and search
  • intelligent search

XML and (SQL) databases

SQL databases with some form of XML support (see also XQuery)
  • MonetDB (CWI, Amsterdam) SQL database with XML support (also XQuery front end)
  • mySQL 5.1 and 6.0 have (limited) support for XML*
  • mySQLdump can dump database contents in a simple (flat) XML form.
  • PostgreSQL (limited XML support)

Object/relational mapping

Mapping back and forth among objects, XML, and relational DBMS.
  • Cayenne (Apache project for object relational mapping, persistence and caching for Java; can serialize to XML)
  • Castor: Java objects → XML → Java objects
  • dbsql2xml maps relational data into trees; can de-normalize for nicer trees
  • DBIx::XML::DataLoader, Perl module to transfer data from an XML document into a SQL database
  • XML::Generator::DBI, Perl module for creating XML from existing DBI datasources
  • Hibernate, object/relational persistence and query service; maps between relations and objects and/or XML (in DOM4J form)
  • Hyperjaxb2 combines JAXB and Hibernate, generates Hibernate mappings from XSD schemas
  • XML-DBMS (Ron Bourret)
See also data binding.

Smart applications

Indexing and search

See also XML and databases, XQuery.

Topic maps

  • Isidorus (Marc Wilhelm Küster) TM engine in Common Lisp
  • LTM processor (Ontopia / Bouvet) reads linear topic-map format, builds object structures, exports XML
  • mappa (Lars Heuer) Python TM engine
  • Onotoa (Hannes Niederhausen) Eclipse-based ontology editor for TM
  • Ontopia (Bouvet) topic-map engine, now open-source.
  • QuaaxTM (Johannes Schmidt) TM engine with PHP interfaces
  • RTM Ruby Topic Maps (Benjamin Bock) TM engine for Ruby
  • SharpTM (Marcel Hoyer) small TM engine for .NET
  • Tiny TiM Java TM engine with “small overhead and minimal runtime dependencies”
  • TM (Robert Barta) Perl module for topic maps reference model
  • TM++ (Inge Henriksen) embedded persistent TM engine
  • TM4J (Kal Ahmed?) umbrella for several Java-based TM packages (TM engine, desktop TM navigator, graph creation tools, integration with Web application frameworks)
  • TM4Jscript (Alexander Johannesen, Thomas Passin) TM engine in Javascript
  • TMAPIX (Lars Heuer) Java library for TMAPI-compliant engines, provides XPath-like queries
  • Versavant (Steve Newcomb) “Topic Map Application (TMA) Bus / Subject Addressing Engine” following TM reference model (rather than TM data model)
  • Wandora “desktop application to build and manage topic maps” with GUI interface
  • WP2TM (TopicObserver.com) WordPress plug-in to turn RSS feed into XTM feed
  • XTM4XMLDB (Stefan Lischke) Java TM engine, supports TMAPI atop any XMLDB database
  • ZTM Zope Topic Maps (Bouvet) Python-based tools for building TM-driven portals
See also http://www.fuzzzy.com/ (TM-driven ‘folktology’ site)

Ontology managers (etc.)

Including semantic tools I didn't know what else to do with.
  • Gnowsys (GNU) “generic distributed network based memory/knowledge management”
  • Protege

Other ‘smart application’ tools

Annotation handling:
  • ANNIS2 (Univ. Potsdam) tools for manipulating and search data in PAULA format
  • AXE (MITH, Univ. Maryland) Ajax XML Encoder, web-based tool for tagging text, video, etc. with XML metadata
  • CATMA Computer Aided Textual Markup and Analysis (Univ. Hamburg)
  • CWB Corpus Workbench (IMS, Univ. Stuttgart) “(partial) support of structural annotations (e.g. SGML)”; central component is CQP Corpus Query Processor
  • Dexter (Boston U.) tools for stand-off annotation
  • Elan (Max Planck Institute for Psycholinguistics) complex annotations on video and audio resources
  • EXMARaLDA (Univ. Hamburg) esp. for discourse analysis
  • GATE General Architecture for Text Engineering (NLP group, Univ. Sheffield)
  • iNote (IATH, Univ. Virginia) XML annotation of images
  • ITE Interlinear Text Editor (Michel Jacobson)
  • MMAX2 multi-modal annotation
  • Monk Metadata Offer New Knowledge - “digital environment” for studying patterns in texts
  • Nite XML Toolkit (Univ. of Edinburgh) tools for managing heavily annotated corpora
  • NLTK Natural Language Tool Kit (Steven Bird, Edward Loper, Ewan Klein et al.) Python modules for NLP
  • PACX Platform for Annotated Corpora in XML
  • Rapid Miner (originally Univ. Dortmund, now Rapid-I) data mining toolkit providing simple operators combinable via GUI
  • SACODEYL (System Aided Compilation and Open Distribution of European Youth Language) tools support multi-media, transcription, annotation, and search in TEI P5 format
  • SoundIndex (Michel Jacobson) authoring tool for text/sound synchronization
  • Transcriber (for transcription and annotation of spoken language)
  • Xaira XML Aware Indexing and Retrieval Architecture (Oxford University Computing Services) generalization of Sara (British National Corpus search tool)

Output

Display, styling, web delivery (but see also above):
  • XML formatters
  • images, graphics
  • Web application development

Publishing / formatting

  • FOP (Formatting Objects Processor)
  • Scribus desktop publishing (uses XML internally)
  • xmlroff XSL formatter (focused on DocBook)

XML and graphics

  • Batik (Apache) SVG renderer
  • Inkscape GUI SVG editor
  • librsvg (Gnome) component to enable software to support SVG

Content management / publishing frameworks

Thank you

Thank you.
Questions?

Group work

Miscellaneous additional material

Credits

And thanks for their assistance to Robin Berjon (Robineko), Ron Bourret (rpbourret.com), Anthony B. Coates (Londata), Micah Dubinko (xformsinstitute.com), Michael Dyck, Betty Harvey (Electronic Commerce Connection, Inc.), Jirka Kosek, Deborah Aleyne Lapeyre (Mulberry Technologies), Steven R. Newcomb (Cool Heads), Uche Ogbuji (Zepheira), Liam Quin (W3C), B. Tommie Usdin (Mulberry Technologies), and Mohamed Zergaoui (Innovimax).