I want to parse a series of .csv files using spark.read.csv, but I want to include the row number of each line inside the file.
I know that Spark typically doesn't order DataFrames unless explicitly told to do so, and I don't want to write my own parser of .csv files since this will be substantially slower than Spark's own implementation. How can I add this row number in a distributed-safe fashion?
From reading about zipWithIndex, it seems like it could be useful but it unfortunately seems to require the partition ordering to be stable
Let's assume we have the following setup which is used to create a
.csvfile on disk with contents that we control:In this setup, we can reproducibly create
.csvfiles and save them to disk, then retrieve them as one would when parsing them for the first time.Our strategy to parse these files will come down to the following:
zipWithIndexThis looks like the following: