How to estimate increase in size of data after creating a new range index in MarkLogic?

109 Views Asked by At

I want to create a new element range index in my ML db. How can I estimate the size of this new index? I am using ML 8.0-3.2.

2

There are 2 best solutions below

0
On

The best thing to do is to run a test on a representative sample of data and then extrapolate.

String indexes share unique values and unique tokens within a stand so the size will be highly dependent on the number of distinct values and it is hard to pre-compute that.

For other data types, the size is dependent on the number of actual values in the content. If you knew that there were on average k values per document and N documents, you'd expect about 8*N*k bytes or 16*N*k bytes if you have positions turned on. Float indexes are half this size; point indexes are double if you use double precision.

1
On

Key data is stored in MARKLOGIC_DATA_DIR (depends on your install) in the sub directory Forests/<Forest Name>/ along with the non-key data. The key and non-key data are dependent. If your intent is to estimate how much more disk space it will take if you add a new index, take the size of all the forests directories for your Database without that index, then add the index, and subtract.

Yes I know that doesn't sound much like 'estimate'. Anything else is a rough guess.

For a 'rough guess' -- 'it depends' -- and any guess should be normalized by trying it. Basically a typical text index size corresponds to the number of distinct terms * 8 * num-docs-that-have-that-term.

Each index entry will contain at least one 64 bit value for each document containing that term. In addition it will (possibly sharing with other indexes) store an encoded version of that term.

This 'rough guess' may be off by 10x or more -- depending on the kind of index and distribution of data, compression and encryption etc. Hence, you should really compare before & after indexing similar indexes.