I am trying to read a file with a specific name which exists in multiple .gz files within a folder.
For example
D:/sample_datasets/gzfiles
|-my_file_1.tar.gz
|-my_file_1.tar
|-file1.csv
|-file2.csv
|-file3.csv
|-my_file_2.tar.gz
|-my_file_2.tar
|-file1.csv
|-file2.csv
|-file3.csv
I am only interested in reading contents of file1.csv
which has the same schema across all the .gz
files.
I am passing the path D:/sample_datasets/gzfiles
to the wholeTextFiles()
method in JavaSparkContext
. However, it returns the contents of all the files in within the tar viz. file1.csv, file2.csv, file3.csv.
Is there a way I can only read the contents of file1.csv
in Dataset or an RDD. Thanks in advance!
I was able to perform the process using the following snippet I used from multiple answers on SO