PySpark map function - send n rows instead of one to build a list

310 Views Asked by At

I am using Spark 3.x in Python. I have some data (in millions) in CSV files that I have to index in Apache Solr. I have deployed pysolr module for this purpose

import pysolr
def index_module(row ):
    ...
    solr_client = pysolr.Solr(SOLR_URI)
    solr_client.add(row)
    ...
df = spark.read.format("csv").option("sep", ",").option("quote", "\"").option("escape", "\\").option("header", "true").load("sample.csv")

df.toJSON().map(index_module).count()

index_module module simply get one row of data frame as json and then index in Solr via pysolr module. Pysolr support to index list of documents instead of one. I have to update my logic so that instead of sending one document in each request, I'll send a list of document. Definatelty, it will improve the performance.

How can I achieve this in PySpark ? Is there any alternative or best approach instead of map and toJSON ?

Also, My all activities are completed in transformation functions. I am using count just to start the job. Is there any alternative dummy function (of action type) in spark to do the same?

Finally, I have to create Solr Object each time, is there any alternative for this ?

0

There are 0 best solutions below