/*
Alloy version of XPath 1.0 data model
C. M. Sperberg-McQueen
19-26 January 2010
Rev. 28 January 2010 (see change log at end)
*/
/* Copyright (c) 2010 Black Mesa Technologies LLC */
module xpath10
/****************************************************************
1 Introduction
****************************************************************/
/* This Alloy model consists of a more or less direct translation into
Alloy of the specification of the XPath 1.0 data model in the XPath
1.0 specification [W3C 1999]. Note that the focus is on the data
model; no attempt is made here to describe the full semantics of XPath
1.0.
The model presented is intended to serve several purposes:
- as a simple exercise in Alloy
- as a standard library resource for use by other Alloy models
describing specifications or software building on this model
- as a tool for checking the definition of the XPath 1.0 model for
completeness and correctness
- as a tool for checking possible reformulations of the model for
completeness and correctness
The presentation assumes that the reader has at least a passing
familiarity with XML [W3C 2008] and XPath 1.0 [W3C 1999], or else a
capacity for following discussions of unfamiliar material without
discomfort. A beginner's familiarity with the Alloy notation [Jackson
2006] will also be helpful, but every important formula given in Alloy
is also paraphrased in English, so that readers unfamiliar with the
notation should be able to follow the main lines of the development
even if they miss some details.
To make it easier to check the correspondence of the Alloy formulation
to the prose of the specification, we interleave the two: first a
quotation from the spec, then the Alloy formulation, together with any
commentary that seems helpful. The extracts from the specification
are given in document order, for the benefit of readers who wish to
follow along in context. Spec passages not relevant for modeling
purposes are omitted.
Experience suggests that a few words about the assumptions of the work
and the nature of the commentary may be in order so as to reduce the
likelihood of misunderstandings by some readers.
The author's assumption is that one purpose of including an explicit
definition of a "data model" in the XPath 1.0 specification is to lay
out explicitly the formal relations among the objects in an XML
document, as that document is operated upon by an XPath 1.0 processor.
These relations have long been familiar to users of SGML and XML and
they are closely tied to properties built into XML by the rules of the
grammar given in [W3C 2008]. One purpose of formulating them
explicitly, in terms mostly independent of the XML grammar, is to
establish that any data structure representing the abstract data model
described here can be operated upon by software using this data model,
whether the data structure was created by parsing an XML document or
by other means. In such an exercise it is preferable not to leave any
important assumptions vague or implicit, and not to appeal to common
knowledge, common belief, or the XML spec itself. It is also a
natural goal to make the definition of the data model complete and
self-explanatory, free of reliance on rules given elsewhere in the
specification.
In some cases, constructs and rules given elsewhere in the XPath 1.0
spec rely upon certain properties being true of any instance of the
data model; these constructs and rules may be ill-defined or
non-sensical if those properties are not in fact guaranteed. For
example, the rest of the XPath 1.0 specification assumes that any node
in a data model instance has at most one parent, also that it has at
most one immediate left neighbor and at most one immediate right
neighbor.
In these cases, I believe the authors of the XPath 1.0 spec intended
the properties in question to be guaranteed by the definition of the
model and believed that they were in fact guaranteed; I do not believe
that they intended the implicit assumption of those properties
elsewhere in the spec to constitute an additional constraint upon
instances of the data model. Now in point of fact, the definition of
the data model in section 5 of [W3C 1999] does not in fact guarantee
all the properties assumed elsewhere. Uniqueness of the parent is
guaranteed, for example, while uniqueness of immediate neighborhood is
not, at least if the formalization presented here is correct. In
these cases I believe the failure of the model to guarantee the
property assumed by the rest of the XPath 1.0 spec reflects an
unintended flaw in the definition of the data model. I do not believe
that it reflects a decision to specify the required properties
obliquely, by formulating other rules which silently assume that
properties hold. The assumptions made in the rest of the
specification make clear, in these cases, what was intended. But they
do not constitute part of the definition of the data model and do not
fill the gaps in the definition.
In cases where the model is underspecified, the entailments of the
rest of the XPath spec have naturally enough made clear to
implementors what is expected, so in fact implementations of XPath 1.0
typically exhibit consistent behavior. This fact will suggest to some
readers that there really is no problem, because there is no
uncertainty in practice about what was intended. If I insist upon
saying that the data model does not determine a unique answer to
certain questions, it is not because I am unfamiliar with the original
working groups' intent and with current implementation practice. If I
regard the gaps in the definition as flaws to be corrected, it is not
because I believe that the gaps have led to interoperability problems
but because I believe that the creators of the spec intended the
definition of the data model to be complete and self-sufficient. They
came close enough to that goal that it would be a shame not to go the
rest of the way if we can. A tool like Alloy can help in the careful
analysis necessary.
*/
/****************************************************************
2 General
****************************************************************/
/* We begin at the beginning of section 5 of [W3C 1999], entitled
"Data Model."
XPath operates on an XML document as a tree. This section
describes how XPath models an XML document as a tree. This model
is conceptual only and does not mandate any particular
implementation. The relationship of this model to the XML
Information Set [XML Infoset] is described in [B XML Information
Set Mapping].
XML documents operated on by XPath must conform to the XML
Namespaces Recommendation [XML Names].
This Alloy formulation does not go into the details of the Namespaces
specification (but see the definition of the Name signature and the
name_eq predicate in section 3 below). */
/****************************************************************
2.1 Nodes
****************************************************************/
/*
The tree contains nodes. There are seven types of node:
- root nodes
- element nodes
- text nodes
- attribute nodes
- namespace nodes
- processing instruction nodes
- comment nodes
For every type of node, there is a way of determining a
string-value for a node of that type. For some types of node, the
string-value is part of the node; for other types of node, the
string-value is computed from the string-value of descendant
nodes.
We define an abstract signature Node, which will have Root, Element,
etc. as refinements. We'll give Node a stringvalue property, though
at the moment this formalization does nothing useful with it. Node
also has a parent property, though the spec has not yet mentioned it
(see below).
*/
abstract sig Node {
stringvalue : UCSstring,
parent : lone Node
}
// The Alloy Analyzer seems to treat 'String' as a reserved name, so
// we'll call it UCSstring.
sig UCSstring {}
/****************************************************************
2.2 Expanded names
****************************************************************/
/* The Namespaces recommendation [W3C 2009] defines 'qualified names'
(QNames) consisting of a namespace prefix (bound by the context to a
namespace URI) and a 'local name', separated by a colon. For example
'mathml:equation'. The interpreted form of a QName is an expanded
name:
Some types of node also have an expanded-name, which is a pair
consisting of a local part and a namespace URI. The local part is
a string. The namespace URI is either null or a string. The
namespace URI specified in the XML document can be a URI reference
as defined in [RFC2396]; this means it can have a fragment
identifier and can be relative. A relative URI should be resolved
into an absolute URI during namespace processing: the namespace
URIs of expanded-names of nodes in the data model should be
absolute. Two expanded-names are equal if they have the same local
part, and either both have a null namespace URI or both have
non-null namespace URIs that are equal.
We define Name in the obvious way, as an expanded name. We have no
other form of names, so there is no need to call this one an
'expanded' name; there is no other kind of name here.
*/
sig Name {
NSName : lone UCSstring,
Localname : UCSstring
}
pred name_eq[a, b : Name] {
a.NSName = b.NSName
and
a.Localname = b.Localname
}
/****************************************************************
2.3 Document order
****************************************************************/
/* The spec then defines the term 'document order', which plays a
central role in the semantics of XPath 1.0.
There is an ordering, document order, defined on all the nodes in
the document corresponding to the order in which the first
character of the XML representation of each node occurs in the XML
representation of the document after expansion of general
entities. Thus, the root node will be the first node. Element
nodes occur before their children. Thus, document order orders
element nodes in order of the occurrence of their start-tag in the
XML (after expansion of entities). The attribute nodes and
namespace nodes of an element occur before the children of the
element. The namespace nodes are defined to occur before the
attribute nodes. The relative order of namespace nodes is
implementation-dependent. The relative order of attribute nodes is
implementation-dependent. Reverse document order is the reverse of
document order.
Several aspects of this definition invite comment in the context of
formalization:
- The description given seems to imply that the full details of
the ordering follow from the cases given. It will be interesting
to see if they do.
- Note that by itself, the rule that parent nodes precede their
children does NOT guarantee, as the spec suggests, that element
nodes occur in the order of the occurrence of their start-tags in
the XML document.
The text implicitly assumes a rule that Document order for
siblings follows the order assigned by their parent's ordered list
of children (supplied in the definition of basic precedence
below). In the rather pedantic state of mind encouraged by
exercises like this one, this seems something one might have
expected to be stated explicitly. But the XPath 1.0 spec
was not written by pedants.
- It's not completely clear in what vein to take the references to
the order of character representations in XML syntax.
On the face of it, at least some of the references appear to be
informative, not normative, intended to provide context and help
confirm the reader's understanding of the normative statements.
The order being defined is intended to correspond to the order of
character representations in the XML syntax, and some of the
individual constraints have obvious analogues in the grammar of
XML. If the references to XML syntax were intended as part of the
actual specification of the ordering, there would be no need for
several of the statements in the paragraph about the relative
positions of parents, attributes, and children.
The rules concerning namespace nodes and attributes, on the other
hand, do not in the general case agree with the ordering of
character representations in the XML.
Some readers, however, may disagree and take the mentions of
serial-form XML as substantive and normative (except as overridden
by the other rules given). Such readers will correspondingly find
some aspects of the formalization here to disagree with. Note,
however, that a NORMATIVE appeal to the order of character
representations in XML syntax would, to be effective, require an
explicit account of the mapping between XML surface syntax and
data model instance which is not in fact given in the spec.
- The spec mentions an "ordering". It seems likely that a total
order is intended, but this is not explicitly specified, and in
most contexts a partial ordering is in fact counted as an
ordering.
To derive a general ordering covering all nodes from the basic cases
explicitly specified, we define two predicates, 'basic_precedes' for
the primitive cases, and 'precedes' for the transitive closure.
(Alloy note: it might be more convenient to define some
immediate-precedes relation on Node, so that the general 'precedes'
could just be "b in a.^immediate_precedes", but the current
formulation seems closer to the spec so we stick with it. The
alternative formulation may be worth experimenting with, sometime.)
*/
pred basic_precedes[disj a, b : Node] {
// the root node is first, so it precedes everything
(a in Root)
// parents precede their children
or (a + b in Element and a = b.parent)
or (a in Element and b in (Attribute + NSNode) and a = b.parent)
or (a in NSNode and b in Attribute and a.parent = b.parent)
// "The attribute nodes and namespace nodes of an element
// occur before the children of the element."
or (a in (Attribute + NSNode) and b in Element and
a.parent = b.parent)
// "The namespace nodes are defined to occur before the
// attribute nodes."
or (a in NSNode and b in Attribute and a.parent = b.parent)
// The next clause is implicit in the text; I think it is clear
// from the statement about start-tags and from the description
// of ordered child nodes that this is intended.
or (a + b in Element and a.parent = b.parent
and a.parent.chseq.idxOf[a] < a.parent.chseq.idxOf[b])
}
// a precedes b, in general, if there is a chain of basic_precedes
// relations
pred full_precedes[a, b : Node] {
basic_precedes[a,b]
or
some c : Node | basic_precedes[a,c] and full_precedes[c,b]
}
/* Note: Oops. The definition just given, being recursive, is not
suitable for Alloy. Found this out much later; I should have run the
Analyzer more frequently along the way. We'll need to provide a
different definition of document order using some sort of 'precedes'
relation on nodes, and transitive closure. */
/****************************************************************
2.4 Children, parents, descendants
****************************************************************/
/*
Root nodes and element nodes have an ordered list of child
nodes. Nodes never share children: if one node is not the same
node as another node, then none of the children of the one node
will be the same node as any of the children of another node.
We model the ordered list of child nodes with a chseq property whose
value is a sequence of nodes. It is convenient to be able to refer to
the children as a set, so we define the property ch, whose value is a
set of nodes, as the set of elements in the sequence chseq. (The
first pair of braces in an Alloy signature declaration enclose
declarations of properties of instances of that signature; the second
pair encloses constraints holding for all instances of the signature.)
*/
abstract sig NodeWithChildren extends Node {
chseq : seq Node,
ch : set Node
}{
ch = chseq.elems
}
/* The prohibition on shared children is most conveniently formulated
as a fact, i.e. a constraint holding for all instantiations of the
model. But in fact there is no need to state it as a fact: it follows
from the other rules given, in particular the statement immediately
following that every node other than the root has exactly one
parent. */
assert no_shared_children {
all disj p, q : NodeWithChildren |
// the intersection of P's children with Q's children is the empty set
no (p.ch & q.ch)
}
check no_shared_children for 7
/* "Every node other than the root node has exactly one parent,
which is either an element node or the root node. A root node
or an element node is the parent of each of its child nodes."
*/
// None of these follow from other constraints as currently
// formulated; all must be stated explicitly.
fact only_root_parentless {
all n : Node | #(n.parent) = 0 iff (n in Root)
}
fact parent_type {
univ.parent in (Element + Root)
}
fact parent_child {
~ch in parent
}
/* "The descendants of a node are the children of the node
and the descendants of the children of the node."
*/
fun descendants[ n : Node ] : set Node {
n.^ch
}
/****************************************************************
2.5 A note on node identity and element identity
****************************************************************/
/* The XML spec is programmatically vague about ontological questions;
XML may be processed using many different models. It provides a
grammar and additional constraints which together make it possible to
distinguish XML data streams from other data streams, but the XML spec
does not attempt to say what kind of thing an XML document is, or when
in the general case XML constructs are to be treated as identical,
distinct but equivalent, or different. In the following document, for
example, it may be convenient in one application to treat the two
occurrences of the string "" as denoting two occurrences of "the
same element", and in another application to treat the two occurrences
as two distinct elements.
(1)
The idea that there is only one 'b' element in this document is not
chosen at random; since XML defines a grammar for all XML documents,
it is not unnatural to hold, as some have done, that an XML document
is essentially a sequence of characters (and similarly for XML
elements). In normal usage, the term 'sequence of characters' is
taken as meaning a sequence of character-types, not of
character-tokens. We might say that there are two occurrences of *the
element* "" in document (1), rather than two elements with the
same character representation, just as we say that the integer three
occurs twice in the fourth line of Pascal's triangle (1, 3, 3, 1) and
not that there are two distinct integers named three which are here
written using the same numeral (or similar numerals?).
It might in retrospect have been useful to prescribe a usage for these
questions, but for better or worse the creators of XML did not do so.
In defining a model to be used for particular software operating on
XML documents, however, it is clearly helpful to make clear which
notion of element identity is to be used. The XPath 1.0 data model,
for example, specifies by means of the rule against shared children
(see section 2.4 above) that in document (1) there are two 'b'
elements, not one.
Oddly, the spec does NOT say explicitly that no two positions in the
list of children are occupied by the same element. Would it be
consistent with the definition of the data model to hold that in the
document
(2)
there is only one b element? It is clearly not the intended
interpretation, since the definition of the preceding-sibling and
following-sibling axes elsewhere in the XPath 1.0 specification
assumes that each element node has a unique preceding and a unique
following sibling (except the first and last children of a given
parent, which lack predecessor and successor respectively). But while
the spec includes an explicit rule about parentage, it includes no
explicit rule requiring the sequence of children to be an injection
(i.e. requiring all siblings to be distinct nodes). If there is any
way to prove, from the specification of the data model, that there are
four distinct element nodes in the model instance for document (2),
and not two, I have not seen it.
Alloy's analysis makes clear that the disjointness of siblings does
not follow from the other rules specified; the instances generated by
the Alloy Analyzer include numerous cases of the same node appearing
more than once among the children of its parent. An explicit test of
documents (1) and (2) shows the difference.
*/
lone sig A, B, C extends Name {}
pred document_1[r : Root, a, b, c : Element] {
// Three elements named in the obvious say
a.gi = A
b.gi = B
c.gi = C
// None of them has any attributes
no (a.atts + b.atts + c.atts)
// A is the document element, the single child of root.
r.chseq = {0 -> a}
// The elements have the parent-child relations shown:
a.chseq = {0 -> b} + {1 -> c}
b.chseq.isEmpty
c.chseq = {0 -> b}
}
// Owing to its violation of the no-shared-children rule,
// predicate document_1 has no instances, as can be
// seen by running the document_1 predicate in any
// suitable scope.
run document_1 for 3 // but 1 Root, 1 A, 1 B, 1 C
run document_1 for 7
pred document_2 [r : Root, a : Element] {
// rather than specify a particular configuration, we'll
// describe the constraints and let Alloy decide how many
// nodes to use for the children of A.
// element A has gi A, no attributes, and three children.
a.gi = A
and no (a.atts)
and #(a.chseq) = 3
// A is the document element, the single child of root.
and r.chseq = {0 -> a}
// Each child of A has the same GI, no children, no atts
and (all c : Node |
(c in a.ch)
implies
(c in Element and c.gi = B and #(c.atts) = 0 and #(c.ch) = 0))
}
// When we ask for instances of document_2, we can see that
// Alloy finds instances with four distinct elements, but also with
// two. And three.
run document_2 for 3 but 1 Root, 4 Element
// If the number of instances generated becomes burdensome,
// it may be convenient to select just those in which A has a
// particular number of distinct children
pred document_2_select {
some a : Element | some r : Root |
document_2[r, a]
and #(a.ch) = 3 // 1, or 2, or 3 (larger numbers have no instances)
}
run document_2_select for 7
/*
As a side note: The XML Information Set specification [W3C 2004]
similarly has no rule specifying whether an information item can
appear more than once in the ordered list of its parent's children; in
the Infoset, the prohibition against parents sharing children follows
naturally from the fact that each element information item is
specified as having a [parent] property.
If we wish to assume, for purposes of exploration, that all the
elements in a parent's sequence of children are pairwise distinct, we
can do so by specifying the predicate chseq_injective. It specifies
that for any node n, when the relation n.chseq (which maps from
integers to nodes) and its transpose ~(n.chseq) are composed, each
integer in the domain is mapped back to itself, and not to any other
integer. That is, the chseq relation is injective.
A second formulation may be clearer to some. The predicate
chseq_nodups uses the standard function hasDups to test whether the
sequence n.chseq contains duplicate items.
The assertion 'nodup_injective' asserts that these two predicates are
true in exactly the same cases.
*/
pred chseq_injective {
all n : Node | n.chseq.~(n.chseq) in iden
}
pred chseq_nodups {
all n : Node | not n.chseq.hasDups
}
assert nodup_injective {
chseq_nodups iff chseq_injective
}
check nodup_injective for 7
/****************************************************************
3 Individual node types
*****************************************************************
3.1 The root node
****************************************************************/
/* Section 5.1 of the spec describes root nodes.
The root node is the root of the tree. A root node does not occur
except as the root of the tree. The element node for the document
element is a child of the root node. The root node also has as
children processing instruction and comment nodes for processing
instructions and comments that occur in the prolog and after the
end of the document element.
The string-value of the root node is the concatenation of the
string-values of all text node descendants of the root node in
document order.
The root node does not have an expanded-name.
We formalize this with a Root signature, whose children are
constrained to include only processing instructions, comments, and
exactly one element. We ignore the rule about string-value for now.
*/
sig Root extends NodeWithChildren {}{
ch in (PI + Comment + Element)
#(ch & Element) = 1
}
/****************************************************************
3.2 Element nodes
****************************************************************/
/* Section 5.2 describes element nodes.
There is an element node for every element in the document. An
element node has an expanded-name computed by expanding the QName
of the element specified in the tag in accordance with the XML
Namespaces Recommendation [XML Names]. The namespace URI of the
element's expanded-name will be null if the QName has no prefix
and there is no applicable default namespace.
NOTE: In the notation of Appendix A.3 of [XML Names], the
local part of the expanded-name corresponds to the type
attribute of the ExpEType element; the namespace URI of the
expanded-name corresponds to the ns attribute of the ExpEType
element, and is null if the ns attribute of the ExpEType
element is omitted.
The children of an element node are the element nodes, comment
nodes, processing instruction nodes and text nodes for its
content. Entity references to both internal and external entities
are expanded. Character references are resolved.
The string-value of an element node is the concatenation of the
string-values of all text node descendants of the element node in
document order.
Again, we ignore string value rules for now. We also ignore the
treatment of unique ID values for elements.
Section 5.3 also mentions something relevant to elements:
Each element node has an associated set of attribute nodes; the
element is the parent of each of these attribute nodes; however,
an attribute node is not a child of its parent element.
*/
sig Element extends NodeWithChildren {
gi : Name,
atts : set Attribute,
nss : set NSNode
}{
ch in (Element + Comment + PI + Textnode)
all a : atts | this = a.@parent
all n : nss | this = n.@parent
}
/****************************************************************
3.3 Attribute nodes
****************************************************************/
/* Section 5.3 describes attribute nodes:
Each element node has an associated set of attribute nodes; the
element is the parent of each of these attribute nodes; however,
an attribute node is not a child of its parent element.
NOTE: This is different from the DOM, which does not treat the
element bearing an attribute as the parent of the attribute
(see [DOM]).
The constraint on ch, in the definition of the Element signature,
guarantees that no attribute node is a child of its parent.
Elements never share attribute nodes: if one element node is not
the same node as another element node, then none of the attribute
nodes of the one element node will be the same node as the
attribute nodes of another element node.
NOTE: The = operator tests whether two nodes have the same
value, not whether they are the same node. Thus attributes of
two different elements may compare as equal using =, even
though they are not the same node.
This already follows from other rules (specifically the definition of
the parent relation, which says each node has at most one parent, and
the second constraint on the Element signature. We can check that it
does in fact follow by means of a simple Alloy assertion. Alloy finds
no counter-examples.*/
assert att_parent_is_function {
all disj e, f : Element | no (e.atts & f.atts)
}
check att_parent_is_function for 5
/* Attributes have names:
An attribute node has an expanded-name and a string-value. The
expanded-name is computed by expanding the QName specified in the
tag in the XML document in accordance with the XML Namespaces
Recommendation [XML Names]. The namespace URI of the attribute's
name will be null if the QName of the attribute does not have a
prefix.
And attribute nodes are disjoint from namespace nodes (though both are
represented by attributes, within the meaning of the XML
specification).
There are no attribute nodes corresponding to attributes that
declare namespaces (see [XML Names]).
We define a name property for the name; the string-value property is
inherited from Node. The Attribute and NSNode signatures are declared
as extensions of Node, which guarantees that they are disjoint. So
nothing more is needed for attributes than the following
declaration. */
sig Attribute extends Node {
name : Name,
value : UCSstring
}
/* There's a gap in the model here: nothing in the definition
of the model says that no two attributes on the same element
have the same name. */
/****************************************************************
3.4 Namespace nodes
****************************************************************/
/* Section 5.4 describes namespace nodes:
Each element has an associated set of namespace nodes, one for
each distinct namespace prefix that is in scope for the element
(including the xml prefix, which is implicitly declared by the XML
Namespaces Recommendation [XML Names]) and one for the default
namespace if one is in scope for the element. The element is the
parent of each of these namespace nodes; however, a namespace node
is not a child of its parent element. Elements never share
namespace nodes: if one element node is not the same node as
another element node, then none of the namespace nodes of the one
element node will be the same node as the namespace nodes of
another element node. This means that an element will have a
namespace node:
- for every attribute on the element whose name starts with
xmlns:;
- for every attribute on an ancestor element whose name starts
xmlns: unless the element itself or a nearer ancestor redeclares
the prefix;
- for an xmlns attribute, if the element or some ancestor has an
xmlns attribute, and the value of the xmlns attribute for the
nearest such element is non-empty
NOTE: An attribute xmlns="" "undeclares" the default
namespace (see [XML Names]).
A namespace node has an expanded-name: the local part is the
namespace prefix (this is empty if the namespace node is for the
default namespace); the namespace URI is always null.
The string-value of a namespace node is the namespace URI that is
being bound to the namespace prefix; if it is relative, it must be
resolved just like a namespace URI in an expanded-name.
For the immediate purposes of this model there is no need to model
namespace nodes and their inheritance; it may be a helpful extension
for the future, but for now we'll just treat namespace nodes as an
unexamined primitive concept. */
sig NSNode extends Node {}
/****************************************************************
3.5 Processing instruction nodes
****************************************************************/
/* Section 5.5 of the spec describes processing-instruction nodes:
There is a processing instruction node for every processing
instruction, except for any processing instruction that occurs
within the document type declaration.
A processing instruction has an expanded-name: the local part is
the processing instruction's target; the namespace URI is
null. The string-value of a processing instruction node is the
part of the processing instruction following the target and any
whitespace. It does not include the terminating ?>.
NOTE: The XML declaration is not a processing
instruction. Therefore, there is no processing instruction
node corresponding to the XML declaration.
There seems no particular point in modeling the details here; they are
not the source of any particular obscurity or the focus of any
noticeable design concern. But it's easy enough and (unlike modeling
the details of NSNode inheritance) does not require extensive
reworking of the existing model. So let's do it. A PI has an
expanded_name and a target; the expanded name has no NSName part, and
the local name is the same as the target. */
sig PI extends Node {
expanded_name : Name,
target : UCSstring
}{
no expanded_name.NSName
expanded_name.Localname = target
}
/****************************************************************
3.6 Comment nodes
****************************************************************/
/* Section 5.6 describes comment nodes:
There is a comment node for every comment, except for any comment
that occurs within the document type declaration.
The string-value of comment is the content of the comment not
including the opening .
A comment node does not have an expanded-name.
*/
sig Comment extends Node {}
/****************************************************************
3.7 Text nodes
****************************************************************/
/* And finally section 5.7 of the spec describes text nodes:
Character data is grouped into text nodes. As much character data
as possible is grouped into each text node: a text node never has
an immediately following or preceding sibling that is a text
node. The string-value of a text node is the character data. A
text node always has at least one character of data.
Each character within a CDATA section is treated as character
data. Thus, in the source document will treated the
same as <. Both will result in a single < character in a text
node in the tree. Thus, a CDATA section is treated as if the
were removed and every occurrence of < and &
were replaced by < and & respectively.
NOTE: When a text node that contains a < character is written
out as XML, the < character must be escaped by, for example,
using <, or including it in a CDATA section.
Characters inside comments, processing instructions and attribute
values do not produce text nodes. Line-endings in external
entities are normalized to #xA as specified in the XML
Recommendation [XML].
*/
sig Textnode extends Node {}
/****************************************************************
4 Conclusion
****************************************************************/
/* Since one of the purposes of this model is to illustrate how Alloy
can be used to check a specification like section 5, it will perhaps
be helpful to define some predicates that can be used to make Alloy
generate instances of the model with various properties. The
following predicates and run commands are intended for that
purpose. */
// One document
pred single_doc {
#Root = 1
}
// One document with no duplicates among siblings
pred single_doc_nodups {
single_doc
chseq_nodups
}
// One document with duplicates among siblings
pred single_doc_yesdups {
single_doc
not chseq_nodups
}
run single_doc for 5
run single_doc_nodups for 5
run single_doc_yesdups for 5
/* If one examines enough instances of these predicates, one will
discover a flaw in the formalization of the parent relation here: we
have ensured that each node has at most one parent, and we have
ensured that when node A has node B as child or attribute, then B has
A as parent. But we have not ensured that when node B has node A as a
parent, then A has B as a child or attribute or NSNode. This
predicate will generate examples of this gap. */
pred unacknowledged_children {
parent != ~(ch + atts + nss)
}
run unacknowledged_children for 5
/* Note also that as currently defined, the parent relation is not
guaranteed acyclic; examples of elements which are their own parent
(but not their own child) crop up in the instances of unacknowledged
children.
It's probably fair to say that this problem is a failure in the
formulation of this model and not necessarily a flaw in the prose
definition of the model; it seems plausible to assume that a relation
labeled with names like parent and child is intended to be irreflexive
and acyclic unless otherwise stated (just as algebraists used to
assume that any operation denoted with + or the multiplication sign
must be taken automatically as commutative and associative, until
Hamilton developed an algebra in which it was not so).
For now, the repair is left as an exercise for the future. */
/* If parent is not guaranteed acyclic, owing to the gap just
mentioned, perhaps we should check to make sure that the ch relation
is acyclic, just in case. */
assert ch_acyclic {
no n : Node | n in n.^ch
}
check ch_acyclic for 5
// Oops. Bad news, that.
/* What else could go wrong? Well, we haven't yet checked to see
whether 'precedes' gives us a total ordering. (If an element can be
among its own descendants, we can already guess the answer, but let's
formulate an explicit test.) So let us assert that the ordering is
total. */
assert total_order {
all a, b : Node | a = b or full_precedes[a,b] or full_precedes[b,a]
}
check total_order for 5
/* And since precedes is intended to model <, not <=, we should have
this, too: */
assert precedes_unique {
all a, b : Node | not (full_precedes[a,b] and full_precedes[b,a])
}
check precedes_unique for 5
/* And now that we do check, we discover that the definition of
precedes is faulty: to keep things analysable, Alloy prohibits
recursive definitions of predicates. So we'll need to come back to
that. */
// Two documents (to illustrate that documents don't share)
pred two_doc {
#Root = 2
}
run two_doc for 7
// Assert that two documents don't share; seek counterexamples.
assert docs_disjoint {
all disj r, s : Root | no ( r.^(ch + atts) & s.^(ch + atts) )
}
check docs_disjoint for 7
// Alternative formulation of assertion
assert docs_disjoint2 {
all r, s : Root | all n : Node |
(r in n.*parent and s in n.*parent)
implies r = s
}
check docs_disjoint2 for 7
/* In describing work with an automatic analysis tool, the creator of
Alloy writes [Jackson 2006, p. xiii]:
The experience of exploring a software model with an automatic
analyzer is at once thrilling and humiliating. Most modelers have
had the benefit of review by colleagues; it's a sure way to find
flaws and catch omissions. Few modelers, however, have had the
experience of subjecting their models to continual, automatic
review. Building a model incrementally with an analyzer,
simulating and checking as you go along, is a very different
experience from using pencil and paper alone. The first reaction
tends to be amazement,: modeling is much more fun when you get
instant, visual feedback. When you simulate a partial model, you
see examples immediately that suggest new constraints to be added.
Then the sense of humiliation sets in, as you discover that
there's almost nothing you can do right. What you write down
doesn't mean exactly what you think it means. And when it does,
it doesn't have the consequences you expected. Automatic analysis
tools are far more ruthless than human reviewers. I now cringe at
the thought of all the models I wrote (and even published) that
were never analyzed, as I know how error-ridden they must be.
Slowly but surely the tool teaches you to make fewer and fewer
errors.
The application of this description to the case of this effort to
formalize the data model of XPath 1.0 is left as an exercise for the
reader.
*/
/****************************************************************
References
****************************************************************/
/*
Jackson 2006
Daniel Jackson, Software Abstractions: Logic, Language, and
Analysis (Cambridge, Mass.: MIT Press, 2006).
The Alloy system is also described in tutorials available on the
Alloy Web site, .
W3C 1999
World Wide Web Consortium (W3C), XML Path Language (XPath) Version
1.0, W3C Recommendation 16 November 1999, ed. James Clark and
Steve DeRose ([Cambridge (Mass.), Sophia-Antipolis, Tokyo]: W3C,
1999).
The version cited is available on the Web at
; the latest
version of the spec, reflecting all updates and corrections, is
available at .
W3C 2004
World Wide Web Consortium (W3C), XML Information Set (Second
Edition), W3C Recommendation 4 February 2004, ed. John Cowan and
Richard Tobin ([Cambridge (Mass.), Sophia-Antipolis, Tokyo]: W3C,
2004).
The version cited is available on the Web at
; the latest
version of the spec, reflecting all updates and corrections, is
available at .
W3C 2008
World Wide Web Consortium (W3C), Extensible Markup Language (XML)
1.0 (Fifth Edition), W3C Recommendation 26 November 2008, ed. Tim
Bray et al. ([Cambridge (Mass.), Sophia-Antipolis, Tokyo]: W3C,
2008).
The version cited is available on the Web at
; the latest version
of the spec, reflecting all updates and corrections, is available
at .
W3C 2009
World Wide Web Consortium (W3C), Namespaces in XML 1.0 (Third
Edition), W3C Recommendation 8 December 2009 ed. Tim Bray et
al. ([Cambridge (Mass.), Sophia-Antipolis, Tokyo]: W3C, 2009).
The version cited is available on the Web at
; the latest
version of the spec, reflecting all updates and corrections, is
available at .
*/
/* To do:
- Fix parent relation; it should be the inverse of (ch+atts+ns).
- Ensure that parent and ch are acyclic.
- Integrate namespace nodes in at least a rudimentary way.
- Fix document order definition to eliminate recursion.
- Model maximal-textnode (no adjacent textnode) rule.
- Extend to define relations among string values? Pro: makes the
model more complete. Con: makes the model larger and more
cumbersome, clarifies nothing (string values are already pretty
clear, no?)
- Elaborate QNames more fully?
- Model the details of namespace nodes and their inheritance.
This would also allow the modeling of namespace-wellformedness.
- Pair this model with a similar model based on the XML Infoset;
check correspondences and differences between them.
- Pair this model with a separate definition of the semantics of
XPath 1.0 expressions?
*/
/* Revision history:
18 March 2010: Add clauses to 'basic_precedes' predicate to cover
the rules saying NSNodes precede Attributes precede
children.
28 January 2010: Add nss property to Element (but don't model
NSNode propagation). Add value property to Attribute.
26 January 2010: Tweak some wording, correct some typos.
Publish model.
24 January 2010: Add to-do list. Work through Attribute, PI,
Comment, and Textnode. Decide not to elaborate
NSNode. Discover that parent and ch are not yet
guaranteed acyclic. Discover that the defintion of
'precedes' must be rewored to avoid recursion.
23 January 2010: Elaborate the prose comments to make the model more
accessible to non-Alloy users, and to introduce the
prose-formalism redundancy recommended by Z experts.
Move draft (still incomplete) to Web server.
20 January 2010: Blog about sibling-identity issue.
19 January 2010: Begin Alloy formalization of XPath 1.0 data model,
discover that the model does not by itself guarantee
that ./preceding-sibling[1] is unique.
*/