Querying DynamoDB global secondary index with values IN a list

104 Views Asked by At

I have a DynamoDB table with many attributes and a global_secondary_index defined on one of them. This table contains millions of items, and I need to do a query which on SQL would be something like: Select * from table where field1 is IN (value1,..., valuen). field1 is the indexed one.

Note:

  • I have 10k value to check in the IN
  • I don't want to scan the table, but use the global secondary index

So far I have not been able to find any easy way to do that, either via DynamoDB AWS console or via the AWS programmatic API.

I have noticed that PartiQL offers such syntax, and I am curious if it does a table scan under the hood or just fetch the items one by one. (Anyway, PartiQL is currently not available for me)

As a workaround, I did a script to return the items one by one.

I have read such post, which does not help here: Querying DynamoDB with a partition key and list of specific sort keys

Finally I would also be curious to understand possibly why such feature does not exist natively (what are the technical constraints if any)?

Thanks

2

There are 2 best solutions below

0
Charles On

The way PartQL would do the following

Select * 
from "ddbTable"."mygsi"
where field1 is IN (value1, value2)

Is simply that behind the scenes it basically ends up as two queries to DDB.

I can't find any documented limit with a quick search, but I suspect that 10K values would be outside that limit.

The thing to understand is that a Query in DDB can only touch 1 partition. A "table scan" is broken into individual scans of each partition; sequentially by default but can be done in parallel

Also that a GSI is basically just a separate DDB table with slightly different rules (duplicate Primary Keys allowed) than the main table.

So in short, you need to write some code that would loop through the 10K values and run a query of the GSI for each value.

4
Leeroy Hannigan On

PartiQL will be broken up into individual Query operations under the hood. However, because it needs to keep a pointer to the LastEvaluatedKey then it must do each Query operation sequentially. That means that it is a slower operation than firing 10k Async requests to DynamoDB on your own.

The limit of number of values in the IN clause is 50. So a maximum of 50 keys, which can return 0-N items each.

You can use an IAM policy which will prevent PartiQL from conducting a Scan, which will safeguard you from writing a bad PartiQL Query.

Be sure to set your index explicitly, DynamoDB does not have a SQL optimiser.