I was working on a project that needs to index a bunch of Products and their Variants into ElasticSearch. Variants have the same schema as Products in DB. So naturally, I started with designing a mapping that is exactly the same as the Product schema and index products and variants as their own documents.
But later, when I accidentally tried to index variants as nested objects inside products, the indexing process is 3x-5x faster (tested several times locally with 1000 products&5 variants, 2000 products&10 variants, and 25000 products&5 variants). The mapping looks something like the below:
id: keyword
name: text
sku: keyword
price: long
color: keyword
...
variants: [
{
id: keyword
name: text
sku: keyword
price: long
color: keyword
...
}
]
So the question is why is that. Since the data size would be the same, a nested mapping will cause a longer index time due to 2x fields. Also, I'm using _bulk
API to index products with their variants in each API call, so the request count would be the same.
Thanks in advance for any suggestions on why is this.
PS: I'm running ElasticSearch 6.7 locally
Just trying to answer the question, "why indexing time is different."
Nested documents are indexed differently. Internally nested documents are indexed as separate documents - but indexed as a single block within Lucene.
Suppose your document contains two variants in the nested data structure. In that case, the total number of documents indexed will be 3 ( 1 parent doc + 2 variants as separate docs) - internally by
.[addDocuments()][1]
Lucene call. This guarantees documents are indexed in a single block and available to query using nested query ( nested query joins these documents in runtime).This results in a different indexing behavior. In your case - it got faster, but say if you have thousands of variants per product, too many nested structures can give you indexing problems. There are some limits to avoid it.