I have a Cosmos DB collection with 4 million documents (~5GB). The following query reports a charge of 2.79 RUs:
SELECT * FROM c WHERE c.type='type1' and STRINGEQUALS(c.name,'abc',false)
But the same query with case-insensitive search (by replacing false
by true
) costs 1228 RUs.
Is there an explanation for why the case-insensitive query is more than 470 times more expensive than the case-sensitive query? I'm surprised by this because the documentation states
The RU charge for StartsWith and StringEquals is slightly higher with the case-insensitive option than without it.
Details:
- Both queries return 0 results.
- The partition key is
type
. - The logical partition
type1
contains 2 million documents. - The
name
property has a different value for almost all 2 million documents. - The default indexing strategy is used (
"path": "/*"
)
This is what I found out so far:
Does Cosmos DB support efficient case-insensitive string comparison?
Unfortunately, the RU charge for case-insensitive
STRINGEQUALS
seems to be linear in the cardinality of the property (i.e. the number of different values for that property). Which is really, really bad if you have lots of documents. The query above takes almost 1 s at a throughput of 10,000 RU/s. In contrast, case-sensitive string comparison is independent of the size of the collection. See also this discussion.What if I need efficient case-insensitive string comparisons?
For small collections (< 10,000 docs) case-sensitivity doesn't make that much of a difference. (And also of course if the inclusion of the partition key restricted the size of the potential result set to a much smaller number.)
For larger collections you could store a duplicate of each property that should support efficient case-insensitive search in lower case and do a case-sensitive search on the lower-case property instead.
You can vote for the Feature Request to support efficient case-insensitive queries here.