DynamoDBMapper how to get all items without pagination

1k Views Asked by At

I have about 780K(count) items stored in DDB.

I'm calling DynamoDBMapper.query(...) method to get all of them. The result is good, bcs I can get all of the items. But it cost me 3min to get them.

From the log, I see the DynamoDBMapper.query(...) method is trying to get items page by page, each page will request an individual query call to DDB which will cost about 0.7s for each page. I counted that all items returned with 292 pages, so the total duration is about 0.7*292=200s which is unacceptable. My code is basically like below:

    // setup query condition, after filter the items count would be about 780K
    DynamoDBQueryExpression<VendorAsinItem> expression = buildFilterExpression(filters, expression); 

    List<VendorAsinItem> results = new ArrayList<>();
    try {
        log.info("yrena:Start query");
        DynamoDBMapperConfig config = getTableNameConfig();
        results = getDynamoDBMapper().query( // get DynamoDBMapper instance and call query method
                VendorAsinItem.class,
                expression,
                config);
    } catch (Exception e) {
        log.error("yrena:Error ", e);
    }
    log.info("yrena:End query. Size:" + results.size());

So how can I get all items at once without pagination. My final goal is to reduce the query duration.

1

There are 1 best solutions below

0
On

EDIT Just re-read the title of the question and realized that perhaps I didn't address the question head on: there is no way to retrieve 780,000 items without some pagination because of a hard limit of 1MB per page


Long form answer

780,000 items retrieved, in 3 minutes, using 292 pages: that's about 1.62 pages per second.

Take a moment and let that sync in..

Dynamo can return 1MB of data per page, so you're presumably transferring 1.5MB of data per second (that will saturate a 10 Mbit pipe).

Without further details about (a) the actual size of the items retrieved; (b) the bandwidth of your internet connection; (c) the number of items that might get filtered out of query results and (d) the provisioned read capacity on the table I would start looking at:

  1. what is the network bandwidth between your client and Dynamo/AWS -- if you are not maxing that out, then move on to next;

  2. how much read capacity is provisioned on the table (if you see any throttling on the requests, you may be able to increase RCU on the table to get a speed improvement at a monetary expense)

  3. the efficiency of your query:

  • if you are applying filters, know that those are applied after query results are generated and so the query is consuming RCU for stuff that gets filtered out and that also means the query is inefficient

  • think about whether there are ways you can optimize your queries to access less data

Finally 780,000 items is A LOT for a query -- what percentage of items in the database is that?

Could you create a secondary index that would essentially contain most, or all of that data that you could then simply scan instead of querying?

Unlike a query, a scan can be parallelized so if your network bandwidth, memory and local compute are large enough, and you're willing to provision enough capacity on the database you could read 780,000 items significantly faster than a query.