To be concrete, say we have a folder with 10k of tab-delimited csv files with following attributes format (each csv file is about 10GB):
id name address city...
1 Matt add1 LA...
2 Will add2 LA...
3 Lucy add3 SF...
...
And we have a lookup table based on "name" above
name gender
Matt M
Lucy F
...
Now we are interested to output from top 100,000 rows of each csv file into following format:
id name gender
1 Matt M
...
Can we use pyspark to efficiently handle this?
How to handle these 10k csv files in parallel?
You can do that in python to exploit the 1000 first line of your files :