I have the following structure in my documents:
doc: 1
{
"123e4567-e89b-12d3-a456-426655440000": {
"order_id": "100",
"qty": 27
},
"321e7654-e89b-21d3-a654-426655441111": {
"order_id": "234",
"qty": 12
}
}
doc: 2
{
"123e4567-e89b-12d3-a456-426655440000": {
"order_id": "101",
"qty": 27
},
"789ab763-a56b-87bb-a654-873655442222": {
"order_id": "345",
"qty": 23
}
}
Where uuid
in the document root represents a customer identifier and the nested object represents an order the customer made.
The only query I care about is simple query by single field on customer identifier and order identifier, to find their orders:
customer_idx?q=*:*&fq=123e4567-e89b-12d3-a456-426655440000.order_id:*&sort=123e4567-e89b-12d3-a456-426655440000.order_id asc&rows=3
or particular one:
customer_idx?q=*:*&fq=123e4567-e89b-12d3-a456-426655440000.order_id:101&rows=1
Question. Would it be ok to define the dynamicField
on customer identifier? From performance perspective. In this case I will end up with hundreds of thousands or millions fields for particular schema.
<dynamicField name="*.order_id" type="string" indexed="true" stored="false" multiValued="false" />
I understand that large number of indexed fields would have impact on performance and memory consumption if I would use many of them in single query since Lucene creates an array of one item per document for every field I query or sort on. But would it be a problem if having hundreds of thousands or millions fields, I'll just query on one of them at the same time?
If not, what would be a better solution?
Thanks.
UPDATE: updated query examples. Added filter, sort and limit. In case it matter.
The main problem with queries like these come when you start to sort the result set. The FieldCache (which you may be able to avoid if you're using docValues now) will get populated with an int (the docid) for each document in the index about its position, and even if just a small amount of documents has a field, the whole array will be generated. There was a patch available to create a sparse list instead, only listing those documents that do contain the field.
Anyhow, the easy fix is to transform your data structure to only use a single field for each query type:
.. so you get one cache for each field regardless of how many fields you have.
You can also break your documents into two separate documents, one for each customer/order_id combination, and thus, query them as regular documents (instead of having two values inside each document).