Using an unique identifier in Solr indexed field name

315 Views Asked by At

I have the following structure in my documents:

doc: 1

{
  "123e4567-e89b-12d3-a456-426655440000": {
    "order_id": "100",
    "qty": 27
  },
  "321e7654-e89b-21d3-a654-426655441111": {
    "order_id": "234",
    "qty": 12
  }
}

doc: 2

{
  "123e4567-e89b-12d3-a456-426655440000": {
    "order_id": "101",
    "qty": 27
  },
  "789ab763-a56b-87bb-a654-873655442222": {
    "order_id": "345",
    "qty": 23
  }
}

Where uuid in the document root represents a customer identifier and the nested object represents an order the customer made.

The only query I care about is simple query by single field on customer identifier and order identifier, to find their orders:

customer_idx?q=*:*&fq=123e4567-e89b-12d3-a456-426655440000.order_id:*&sort=123e4567-e89b-12d3-a456-426655440000.order_id asc&rows=3

or particular one:

customer_idx?q=*:*&fq=123e4567-e89b-12d3-a456-426655440000.order_id:101&rows=1

Question. Would it be ok to define the dynamicField on customer identifier? From performance perspective. In this case I will end up with hundreds of thousands or millions fields for particular schema.

<dynamicField name="*.order_id" type="string" indexed="true" stored="false" multiValued="false" />

I understand that large number of indexed fields would have impact on performance and memory consumption if I would use many of them in single query since Lucene creates an array of one item per document for every field I query or sort on. But would it be a problem if having hundreds of thousands or millions fields, I'll just query on one of them at the same time?

If not, what would be a better solution?

Thanks.

UPDATE: updated query examples. Added filter, sort and limit. In case it matter.

1

There are 1 best solutions below

6
On

The main problem with queries like these come when you start to sort the result set. The FieldCache (which you may be able to avoid if you're using docValues now) will get populated with an int (the docid) for each document in the index about its position, and even if just a small amount of documents has a field, the whole array will be generated. There was a patch available to create a sparse list instead, only listing those documents that do contain the field.

Anyhow, the easy fix is to transform your data structure to only use a single field for each query type:

customer_id:123e4567-e89b-12d3-a456-426655440000
customer_id_order_id:123e4567-e89b-12d3-a456-426655440000_101

.. so you get one cache for each field regardless of how many fields you have.

You can also break your documents into two separate documents, one for each customer/order_id combination, and thus, query them as regular documents (instead of having two values inside each document).