Spark-redis: dataframe writing times too slow

1.1k Views Asked by holypriest At 27 June 2025 at 15:27

I am an Apache Spark/Redis user and recently I tried spark-redis for a project. The program is generating PySpark dataframes with approximately 3 million lines, that I am writing in a Redis database using the command

df.write \
  .format("org.apache.spark.sql.redis") \
  .option("table", "person") \
  .option("key.column", "name") \
  .save()

as suggested at the GitHub project dataframe page.

However, I am getting inconsistent writing times for the same Spark cluster configuration (same number of EC2 instances and instance types). Sometimes it happens very fast, sometimes too slow. Is there any way to speed up this process and get consistent writing times? I wonder if it happens slowly when there are a lot of keys inside already, but it should not be an issue for a hash table, should it?

Original Q&A

There are 1 best solutions below

Pubudu Sitinamaluwa On 24 January 2019 at 04:49

This could be a problem with your partition strategy.

Check Number of Partitions of "df" before writing and see if there is a relation between number of partitions and execution time.

If so, partitioning your "df" with suitable partiton stratigy (Re-partitioning to a fixed number of partitions or re-partitioning based on a column value) should resolve the problem.

Hope this helps.

Spark-redis: dataframe writing times too slow

There are 1 best solutions below

Related Questions in APACHE-SPARK

Related Questions in DATAFRAME

Related Questions in REDIS

Related Questions in PYSPARK

Related Questions in SPARK-REDIS

Trending Questions

Popular # Hahtags

Popular Questions