I have been reading through BaseX's documentation and I found they offer a token index as well as an attribute token. However, it is not clear to me what the difference between the two is.
Attributes seem to be the regular attributes as I know them:
<node attribute="value"/>
However, for tokens the documentation reads:
In many XML dialects, such as HTML or DITA, multiple tokens are stored in attribute values.
So it would almost seem as if tokens are values for attributes? So, like this:
<node attribute="token1 token2"/>
If that is the case, what is indexed in both these cases then? If the attribute index improves equality checks such as
//country[@car_code = 'J']
and a token index improves containment checks such as
//div[contains-token(@class, 'row')]
isn't a token index then simply an advanced attribute index, working with multiple values? Or am I missing something? When would one use the one or the other, and are they ever useful in combination?
Unfortunately token means a few different things in different contexts in XPath, XML, XML Schema, DTDs, and other related technologies which can make it a bit unclear when the term comes up.
Here they are referring to token in the sense of a string made up of XML name chars.
Of the many ways that attributes can be defined, one case is as having multiple tokens separated by whitespace, with no meaning assigned to the order of such tokens. To take one of the examples you quote:
This would match each of:
It would not match any of:
Yes. A very useful one. Writing a test for an attribute containing a value as a token so that it would match each of the four cases it should match above, but none of the two cases it shouldn't match would be very fiddly, and in a lot of cases this need comes up a lot (the example above matches the CSS selector
div.row
for example).Also, note that while a very common use-case for this function is with attribute values, it operates on any string, so it could also be element text, the result of another string function, an entire imported document, etc.
Really it's a matter of what you care about. Is your query "I want to match all
<div>
s that have aclass
attribute of"row"
" or is your query "I want to match all<div>
s that have aclass
attribute that contains the token"row"
. In HTML or XHTML considering howclass
is used, we'd probably be in the latter case most of the time.In a way, they already are in combination; you are using the
[]
and@
to identify nodes that have a particular attribute, and then using thecontains-token
function to specify what you do in filtering the values of those attributes.We generally wouldn't care to do both a
=
test and acontains-token
test on the same attribute as generally the=
should suffice; if we've a requirement of what the entire contents of the attribute must be then any requirement of what tokens are present is entailed by that. Of course all sorts of surprising rare cases can happen in coding, especially when we are bringing two or more separate criteria together. It's also more common to have both types working on separate attributes;Would use
=
on one attribute andcontains-token
on another.(Again, really
contains-token
isn't a type of index, its a string function that works on strings, that is often useful within indices).