How to sort entries with the same ID based off their allele frequency (AF) in a vcf file

51 Views Asked by At

I have a vcf file whose multiallelic variants are expressed as multiple biallelic records. I am trying to convert the file into a plink bed file, and thus each entry in the vcf must have a unique ID. Here is an example :

[-----.-----@hydra1 data]$ tabix gnomad.genomes.v3.1.2.hgdp_tgp.chr6.vcf.bgz chr6:29440751-29440751 | cut -f 1-5
chr6    29440751    rs2074464   A   C
chr6    29440751    rs2074464   A   G
chr6    29440751    rs2074464   A   T

The first row has AF=0.000148017, the second has row has AF=0.586294 and the third row has AF=0.0592066.

I would like to filter this vcf so that when there are multiple rows with the same ID, only the one with the highest "AF" is kept. In this example, filter out row 1 and 3.

I have been looking through bcftools documentation but I find it to be very brief and can't figure out a way to do this. These vcf files I'm using are massive so I would like to use a package and not do manipulations manually on the files.

0

There are 0 best solutions below