How to use spark for map-reduce flow to select N columns, top M rows of all csv files under a folder?

600 Views Asked by At

To be concrete, say we have a folder with 10k of tab-delimited csv files with following attributes format (each csv file is about 10GB):

id  name    address city...
1   Matt    add1    LA...
2   Will    add2    LA...
3   Lucy    add3    SF...
...

And we have a lookup table based on "name" above

name    gender
Matt    M
Lucy    F
...

Now we are interested to output from top 100,000 rows of each csv file into following format:

id  name    gender
1   Matt    M
...

Can we use pyspark to efficiently handle this?

How to handle these 10k csv files in parallel?

1

There are 1 best solutions below

1
On BEST ANSWER

You can do that in python to exploit the 1000 first line of your files :

top1000 = sc.parallelize("YourFile.csv").map(lambda line : line.split("CsvSeparator")).take(1000)