-
Notifications
You must be signed in to change notification settings - Fork 18
Primer: XPath Query Evaluation
This document is a primer on XPath evaluation basics to help understand how they are implemented and optimized in CommCare.
At a fundamental and native level an XPath query follows a straightforward pattern.
The expression we're concerned with starts with a reference to data somewhere in the session, which is an expression which refers to one of the session's data models.
/data/child/node[@value = 'filtered']/output
A reference doesn't necessarily refer to a value. A reference is conceptually a way to refer to any number of instance elements named output which follow the path to get there.
For clarity in the evaluation process it is helpful to distinguish between a reference which refers to an abstract set of nodes (like the one above), and "fully qualified" reference to a single node, like
/data[1]/child[1]/node[4]/output[1]
The process by which the value of "unqualified" reference is requested and is covered to a usable value is what we're referring to here as evaluating a query.
Query evaluation takes an input tree reference like
/data/child/node[@value = 'filtered']/output
and converts it to a nodeset, which is a set of fully qualified references that match that query, like
/data[1]/child[1]/node[1]/output[1]
/data[1]/child[1]/node[4]/output[1]
Often only a single node is a member of the nodeset, in which case the result of a query is the contents of the requested node. In other cases the nodeset will have multiple elements, and are only useful as inputs to functions like count(), sum(), or join().
In CommCare, XPath expressions are evaluated against an evaluation context, which provides a frame for evaluating a query. Among other things, the context provides
- The relative context of a query - IE: "What should a relative query (
../some_node) evaluate to right now?" - The "original" context of a query - IE: If we are in a nested query, what should
current()evaluate to? - The body of instances available for evaluation
- XPath variables (if relevant)
In order to keep state isolated, the operation to "expand" a reference into a nodeset of fully qualified references is performed within an evaluation context.
Any time a new context is needed (for instance, inside of a predicate expression where ../some_node needs to refer to a new relative reference), a new derivative evaluation context is created and used to expand and evaluate the new expressions.
The native process for performing a query is to walk each "step" of the provided query depth-first against the live instance, and including a fully qualified reference to each element which matches.
To evaluate the XPath expression
/data/state[@population > 1000]/city[@type='city'][@capital = 'yes']/name
against a sample body
<data>
<state population="1500">
<city capital="no" type="town">
<name>Woburn</name>
</city>
<city capital="yes" type="city">
<name>Boston</name>
</city>
<city capital="no" type="city">
<name>Cambridge</name>
</city>
</state>
<state population="500">
<city capital="no" type="city">
<name>Albuquerque</name>
</city>
<city capital="yes" type="city">
<name>Santa Fe</name>
</city>
</state>
<state population="2000">
<city capital="yes" type="city">
<name>Albany</name>
</city>
<city capital="no" type="city">
<name>New York</name>
</city>
</state>
</data>
we start at the root (/), and look for matches. Since we find one (data), we continue processing with just that node as our working set
/data[1]/
On the next step state[@population > 1000] we need to both check for children with matching names, and also evaluate a filter condition.
First the engine generates a candidate set of references which match the name pattern
/data[1]/state[1]
/data[1]/state[2]
/data[1]/state[3]
Then the engine identifies which of the candidate references is compatible with the filter condition, striking out any references which fail to meet the filter
/data[1]/state[1]/@population > 1000 = true
/data[1]/state[2]/@population > 1000 = false
/data[1]/state[3]/@population > 1000 = true
Once all candidate references are evaluated, we move on to the next step with our current working set
Current Working Set:
/data[1]/state[1]/
/data[1]/state[3]/
Remaining query to evaluate:
city[@type='city'][@capital = 'yes']/name
We'll evaluate each remaining step in the reference (city[@type='city'][@capital = 'yes']/name) against this working set. If at any point it is empty, an empty nodeset is returned and evaluation stops.
To evaluate the next step city[@type='city'][@capital = 'yes'], the engine once again generates a candidate set of matching references, based on the current instance.
/data[1]/state[1]/city[1]
/data[1]/state[1]/city[2]
/data[1]/state[1]/city[3]
/data[1]/state[3]/city[1]
/data[1]/state[3]/city[2]
and once again processes the filter expressions against those candidate references.
Predicates are evaluated depth-first and only if the previous predicate passed for that reference.
/data[1]/state[1]/city[1]/@type = 'town' > false
/data[1]/state[1]/city[2]/@type = 'city' > true
/data[1]/state[1]/city[2]/@capital = 'yes' > true
/data[1]/state[1]/city[3]/@type = 'city' > true
/data[1]/state[1]/city[3]/@capital = 'no' > false
/data[1]/state[3]/city[1]/@type = 'city' > true
/data[1]/state[3]/city[1]/@capital = 'yes' > true
/data[1]/state[3]/city[2]/@type = 'city' > true
/data[1]/state[3]/city[2]/@capital = 'yes' > false
moving forward with the resulting working nodeset below
Current Working Set:
/data[1]/state[1]/city[2]
/data[1]/state[3]/city[1]
Remaining query to evaluate:
name
Finally, the last step (name) is applied to each working reference, resulting in the final nodeset
/data[1]/state[1]/city[2]/name[1]
/data[1]/state[3]/city[1]/name[1]
which would produce an error if referenced alone, IE:
<output value="/data/state[@population > 1000]/city[@type='city'][@capital = 'yes']/name"/>
but can be used with an expression like
<output value="join(", ", /data/state[@population > 1000]/city[@type='city'][@capital = 'yes']/name)"/>
to produce
Boston, Albany