Partition the data frame using column X and writes the data without column X

320 Views Asked by At

How can I write the partition by column X and writes the data without Column X values?

I had a data frame with two-columns and the values are as shown below.

pkey string, output_value string

Values as

pkey ===== output_value

100  ===== 100-Hundred-some-text-value

100  ===== 101-Hundred-some-text-value

200  ===== 200-TwoHundred-some-text-value

300  ===== 300-ThreeHundred-some-text-value

How can I write this data frame using partition by pkey value and write only output_value?

output:

......./target-dir/stage-100/somefilename_100.csv

......./target-dir/stage-200/somefilename_200.csv

......./target-dir/stage-300/somefilename_300.csv

somefilename_100.csv should have the below entries:

100-Hundred-some-text-value

101-Hundred-some-text-value

somefilename_200.csv should have the below entries:

200-TwoHundred-some-text-value

somefilename_300.csv should have the below entries:

300-ThreeHundred-some-text-value

I tried like the below code, but the compiler is expecting the data frame should have both columns.

df.select('output_value')
   .write()
   .partitionBy('pkey') 
1

There are 1 best solutions below

0
Vaebhav On

By only selecting output_value you are striping your Dataframe of pkey column at that moment

partitionBy would output your Data by pkey column instead , keeping it out from your final output within the file

Removing the select clause would be enough to accomplish this

df.write.partitionBy("pkey") \
        .mode("overwrite") \
        .csv("<path>")

However the file names within each partition would start with part-*