Unable to load data from Cloudant into Python/Spark dataframe in Watson Studio Notebook

332 Views Asked by At

I am trying to load the data from Cloudant DB into a Python/Spark dataframe in Python and Spark environment in Watson Studio. I have followed the steps mentioned in this link and stuck in Procedure 3: Step 5. I already have a cloudant DB with the name 'twitterdb' and I am trying to load data from here.

Error screenshot

Error Screenshot when loading the data from cloudant db

1

There are 1 best solutions below

0
On

By looking at the error, i see that you must have installed incorrect Cloudant Connector as compare to the kind of Spark version available on Spark As Service from IBM Cloud. Spark As Service offers spark version 2.1.2.

Now from the tutorial, one of the step indicates to install Spark Cloudant Package.

pixiedust.installPackage("org.apache.bahir:spark-sql-cloudant_2.11:0")

which i think must be installing wrong version of spark cloudant connector as the error state it is trying to use.

/gpfs/global_fs01/sym_shared/YPProdSpark/user/s97c-0d96df4a6a0cd8-8754c7852bb5/data/libs/spark-sql-cloudant_2.11-2.2.1.jar

The right version to install/use would be https://mvnrepository.com/artifact/org.apache.bahir/spark-sql-cloudant_2.11/2.1.2

Now important part is that Spark Cloudant connector is already installed by default. /usr/local/src/dataconnector-cloudant-2.0/spark-2.0.0/libs/

You should uninstall your user-installed package using pixiedust.

pixiedust.packageManager.uninstallPackage("org.apache.bahir:spark-sql-cloudant_2.11:2.2.1")

Then restart the kernel and then use cloudant connector as describe to read from your cloudant database.

spark = SparkSession\
    .builder\
    .appName("Cloudant Spark SQL Example in Python using dataframes")\
    .config("cloudant.host","ACCOUNT.cloudant.com")\
    .config("cloudant.username", "USERNAME")\
    .config("cloudant.password","PASSWORD")\
    .config("jsonstore.rdd.partitions", 8)\
    .getOrCreate()

# ***1. Loading dataframe from Cloudant db
df = spark.read.load("n_airportcodemapping", "org.apache.bahir.cloudant")
df.cache() 
df.printSchema()

Ref:- https://github.com/apache/bahir/tree/master/sql-cloudant

Thanks, Charles.