I have millions of sentences I want to encode with a model from sentence transformers (which is a pytorch model). https://www.sbert.net/
I am planning to use pyspark and an apply in pandas function. https://spark.apache.org/docs/3.2.1/api/python/reference/api/pyspark.sql.GroupedData.applyInPandas.html. I am creating a fake group column to use (group by and apply).
Questions:
- How does this function work behinds the scenes? Will a copy of the model object be copied to every worker and then each executor will access it? Or does each executor need a copy?
- Given that the model size is ~ 1 GB, does it make sense to create the group column with many values (1,000s) or less (100s)?. My understanding is that each group get processed separately. Does that mean an executor needs to fit an entire group in memory?
- Any advice on how to set number of executors, partitions or executor memory for this task?