How to setup spark-redis in python?

200 Views Asked by At

I'm building an stream-processing app and I thought redis could be a good addition. I'm trying to read the data files into dataframes and then loading them to redis.
I've defined the SparkSession as follows:

 spark_session = SparkSession \
        .builder \
        .config("spark.jars.packages", "com.databricks:spark-xml_2.12:0.13.0") \
        .config("spark.jars.packages", "com.redislabs:spark-redis:2.3.0") \
        .config("spark.redis.host", "localhost") \
        .config("spark.redis.port", "6379") \
        .getOrCreate()

Then i create the dataframe:

  df = spark_session.read \
        .format('xml') \
        .options(rowTag='pm') \
        .load("data/traffic_data.xml", schema=custom_schema)

Till this point, everything works correctly, but now when I try to load the data to redis in the same way provided by the documentation (spark-redis docs)

df.write\
        .format("org.apache.spark.sql.redis")\
        .option("table", "last24hour")\
        .save()

it gives me the error: java.lang.ClassNotFoundException: Failed to find data source: org.apache.spark.sql.redis

All I've found about this is the error is because the package is not provided but I've added it in the SparkSession.config, so don't know where the problem is.

Also tried using pyspark --jars <path-to-jar and to read an example dataset and load it to redis; and running the app with spark-submit --packages com.redislabs:spark-redis:2.3.0 but I got the same error.

0

There are 0 best solutions below