dsbulk to load in batches and improved throughput

413 Views Asked by At

I am running dsbulk to load CSV into cassandra. I tried with a csv that has 2 million records and dsbulk took almost 1 hr 6 mins to load the file into DB.

    total | failed | rows/s |  p50ms |  p99ms | p999ms | batches
2,000,000 |      0 |    500 | 255.65 | 387.97 | 754.97 |    1.00

This is what I see from the console output. I am trying to increase the batches and also the no.of rows/sec. I have added maxConcurrentQueries and bufferSize but I still see dsbulk is starting with single batch and 500 rows/sec.

How can I improve the load performance for dsbulk?

2

There are 2 best solutions below

0
harish bollina On BEST ANSWER

I have tried using batching and other concurrent parameters with dsbulk but couldn't see any improvement. I have tried with datastax Cluster and Session api to create a session and used that session to execute batch statements.

cluster = cluster.builder().addContactPoints("0.0.0.0", "0.0.0.0")
            .withCredentials("userName","pwd")
            .withSSL()
            .build();
    session = cluster.connect("keySpace");
BatchStatement batchStatement = new BatchStatement();
batchStatement.add(new SimpleStatement("String query with JSON Data"));
session.execute(batchStatement);

I have used ExecutorService with 10 threads and each thread inserting 1000 queries per batch.

I have tried with something like above and it worked fine for my use case. I was able to insert 2 million records in 15 mins. I am creating insert queries using JSON keyword and creating json from the resultSet. We can also use executeAsync in which case you application thread will finish in a minute or two but cassnadra cluster still took the same 15 mins to add all the records.

To read data from source sybase DB, I have used jdbcTemplate.queryForList which will list records as List> and each object in that list is map which can be converted to JSNO using JSON ObjectMapper writeValueAsString method.

Hope this will be useful to someone.

2
Madhavan On

You need to post the full dsbulk load command that you used, source & target C* versions, your hardware specs to triage this efficiently.

Please see comments to this answer.

If you're doing a single threaded operation like loading from a single file (i.e. -url /path/to/a/single_file.csv) then there is not much we could do here to improve the throughput. One thing to try would be to allocate more memory to the DSBulk process itself via export DSBULK_JAVA_OPTS="-Xmx10G" prior to running your load command. Try if that works out in your environment and make sure your target cluster is able to handle the increased load.