I have a vcf file whose multiallelic variants are expressed as multiple biallelic records. I am trying to convert the file into a plink bed file, and thus each entry in the vcf must have a unique ID. Here is an example :
[-----.-----@hydra1 data]$ tabix gnomad.genomes.v3.1.2.hgdp_tgp.chr6.vcf.bgz chr6:29440751-29440751 | cut -f 1-5
chr6 29440751 rs2074464 A C
chr6 29440751 rs2074464 A G
chr6 29440751 rs2074464 A T
The first row has AF=0.000148017
, the second has row has AF=0.586294
and the third row has AF=0.0592066
.
I would like to filter this vcf so that when there are multiple rows with the same ID, only the one with the highest "AF" is kept. In this example, filter out row 1 and 3.
I have been looking through bcftools documentation but I find it to be very brief and can't figure out a way to do this. These vcf files I'm using are massive so I would like to use a package and not do manipulations manually on the files.