How to speed up Spark read of Veeva CRM

51 Views Asked by At

I am reading data from Veeva CRM using Spark in Databricks. I am using spark.read.format("springml....") Though I am not entirely sure, but does this read happens over a single thread as is the case with JDBC read, or is it otherwise? Is there any way to speed up the read process?

I tried with numpartition on a partition key, but I don't know if Veeva CRM stores any column as indexed. This didn't speed up the read.

1

There are 1 best solutions below

0
On

There is always a tradeoff when you speed things up. It's likely the case that it's safer to single thread things so your Veeva CRM doesn't get hammered with connections/data requests. You could use the same trick that is used to speed up something similar to JDBC connections. You could divide up your required data into mapPartitions and then use manual JDBC calls(you can't use spark context inside mapPartitions) from inside the mapPartition passed function to pull data.

You need to be careful what you choose for partition strategy, as you could DDOS your veeva CRM. Experiment with this but side on caution if it's an operational system.