I am an Apache Spark/Redis user and recently I tried spark-redis for a project. The program is generating PySpark dataframes with approximately 3 million lines, that I am writing in a Redis database using the command
df.write \
.format("org.apache.spark.sql.redis") \
.option("table", "person") \
.option("key.column", "name") \
.save()
as suggested at the GitHub project dataframe page.
However, I am getting inconsistent writing times for the same Spark cluster configuration (same number of EC2 instances and instance types). Sometimes it happens very fast, sometimes too slow. Is there any way to speed up this process and get consistent writing times? I wonder if it happens slowly when there are a lot of keys inside already, but it should not be an issue for a hash table, should it?
This could be a problem with your partition strategy.
Check Number of Partitions of "df" before writing and see if there is a relation between number of partitions and execution time.
If so, partitioning your "df" with suitable partiton stratigy (Re-partitioning to a fixed number of partitions or re-partitioning based on a column value) should resolve the problem.
Hope this helps.