I'm looking for a MarkLogic cts query that I can use as a source query in Data Hub to return the distinct values on a combination of json property paths.
For example, I 200K+ docs with this structure:
{
"name": "Bentley University",
"unit": null,
"type": "University or College",
"location": {
"state": {
"code": "MA",
"name": "Massachusetts"
},
"division": "New England",
"region": "Northeast",
"types": [
"School-College",
"University"
]
}
}
I would like to have cts query that returns the distinct name + location/state/code values.
I've tried using cts.jsonPropertyScope() but that return entire docs. I just want the distinct values returned.
If you had a range index on those two fields, then you could do this very easily with
cts.valueCoOcurrences().For example, with path-range-indexes on the
/nameand/location/state/codethen the query would look like this:Without indexes, then the brute force method that reads all of the documents would look something like this:
But it might be slow, and you run the risk of blowing an Expanded Tree Cache error if all of the docs can't be read at once.
You could also do some sort of iterative search to sample some documents, add their values to the set, and then use the accumulated values to exclude docs in the next search until it no longer returns any documents or hits some limit of number of searches.