How can I write the partition by column X and writes the data without Column X values?
I had a data frame with two-columns and the values are as shown below.
pkey string, output_value string
Values as
pkey ===== output_value
100 ===== 100-Hundred-some-text-value
100 ===== 101-Hundred-some-text-value
200 ===== 200-TwoHundred-some-text-value
300 ===== 300-ThreeHundred-some-text-value
How can I write this data frame using partition by pkey value and write only output_value?
output:
......./target-dir/stage-100/somefilename_100.csv
......./target-dir/stage-200/somefilename_200.csv
......./target-dir/stage-300/somefilename_300.csv
somefilename_100.csv should have the below entries:
100-Hundred-some-text-value
101-Hundred-some-text-value
somefilename_200.csv should have the below entries:
200-TwoHundred-some-text-value
somefilename_300.csv should have the below entries:
300-ThreeHundred-some-text-value
I tried like the below code, but the compiler is expecting the data frame should have both columns.
df.select('output_value')
.write()
.partitionBy('pkey')
By only selecting
output_valueyou are striping your Dataframe ofpkeycolumn at that momentpartitionBy would output your Data by
pkeycolumn instead , keeping it out from your final output within the fileRemoving the
selectclause would be enough to accomplish thisHowever the file names within each partition would start with
part-*