Extract variant positions from VCF dependent on contents of other columns

912 Views Asked by At

I have a vcf file, I am trying to extract the information from these columns:

#CHROM  POS   REF     ALT

However I would like to extract these only if the SAMPLE-1 column contains the string DeNovo (Not DeNovoSV) and that SAMPLE-1, SAMPLE-2, and SAMPLE-3 all contain PASS.

#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  SAMPLE-1  SAMPLE-2  SAMPLE-3
chr1    10230   .       AC      A       186.90  .       AC=4;AF=0.667;AN=6;DP=77;FS=0.000;MQ=26.38;MQRankSum=0.436;QD=3.89;ReadPosRankSum=0.000;SOR=1.162       GT:AD:AF:DP:GQ:FT:F1R2:F2R1:PL:GP:PP
:DN 0/1:6,12:0.667:18:32:PASS:2,3:4,9:69,0,30:3.9669e+01,2.7888e-03,3.2724e+01:295,0,215:Inherited  0/1:5,15:0.750:20:11:PASS:3,6:2,9:60,0,8:3.0340e+01,3.6694e-01,1.0964e+01:172,0,137:.   1/1:0,10
:1.000:10:26:PASS:0,2:0,8:93,29,0:6.1212e+01,2.6342e+01,1.0101e-02:378,0,183:.
chr1    61871   .       C       CT      60.27   .       AC=3;AF=0.500;AN=6;DP=29;FS=11.290;MQ=33.00;MQRankSum=-0.423;QD=2.51;ReadPosRankSum=0.705;SOR=0.478     GT:AD:AF:DP:GQ:FT:F1R2:F2R1:PL:GP:PP:DPL:DN:DQ  0/0:5,0:0.000:5:15:PASS:.:.:0,15,182:.:0,7,93:0,15,100:DeNovo:2.9227e-07        1/1:0,2:1.000:2:5:PASS:0,1:0,1:42,6,0:2.4787e+01,4.7870e+00,1.7754e+00:29,0,9:40,6,0:.  0/1:15,7:0.318:22:26:PASS:6,3:9,4:43,0,41:2.6538e+01,9.8206e-03,4.4010e+01:65,0,250:74,0,234:.
chr1    66369   .       TA      T       116.77  .       AC=2;AF=0.500;AN=4;DP=56;FS=10.138;MQ=173.59;MQRankSum=1.468;QD=4.32;ReadPosRankSum=0.929;SOR=0.367     GT:AD:AF:DP:GQ:FT:F1R2:F2R1:PL:GP:PP    ./.:11,5:0.312:16:0:LowGQ:.:.   0/1:8,4:0.333:12:40:PASS:3,2:5,2:71,0,43:4.1762e+01,4.0824e-04,4.5625e+01:71,0,43       0/1:8,7:0.467:15:45:PASS:4,4:4,3:77,0,47:4.7400e+01,1.2244e-04,5.0000e+01:77,0,47
chr1    934273  .       G       C       8.67    .       AC=1;AF=0.167;AN=6;DP=26;FS=0.000;MQ=19.17;MQRankSum=-1.179;QD=0.96;ReadPosRankSum=1.666;SOR=0.223      GT:AD:AF:DP:GQ:FT:F1R2:F2R1:PL:GP:PP
:DPL:DN:DQ  0/1:7,2:0.222:9:11:PASS:5,1:2,1:45,0,32:1.0868e+01,3.7242e-01,3.5372e+01:45,0,60:45,0,32:DeNovoSV:4.3945e-09    0/0:6,3:0.333:9:0:LowGQ:.:.:0,0,140:.:55,0,191:46,0,186:.       0/0:8,0:
0.000:8:23:PASS:.:.:0,23,190:.:0,25,195:0,23,190:.
chr1    934274  .       G       C       8.68    .       AC=1;AF=0.167;AN=6;DP=26;FS=0.000;MQ=19.17;MQRankSum=-1.179;QD=0.96;ReadPosRankSum=1.666;SOR=0.223      GT:AD:AF:DP:GQ:FT:F1R2:F2R1:PL:GP:PP
:DPL:DN:DQ  0/1:7,2:0.222:9:11:PASS:5,1:2,1:45,0,32:1.0868e+01,3.7242e-01,3.5372e+01:45,0,60:45,0,32:DeNovoSV:4.3945e-09    0/0:6,3:0.333:9:0:PASS:.:.:0,0,140:.:55,0,191:46,0,186:.       0/0:8,0:
0.000:8:23:PASS:.:.:0,23,190:.:0,25,195:0,23,190:.

I have tried using bcftools, see below.

bcftools query -f '%CHROM %POS %REF %ALT\n' file.vcf | head -3
chr1 10230 AC A
chr1 61871 C CT
chr1 66369 TA T

Is there a way to use bcftools, or combine it with awk in order to get the output I am looking for in the vcf file format?

Many thanks

1

There are 1 best solutions below

0
markp-fuso On

FWIW, I'm not familiar with bcftools but if the intention is to pipe the bcftools output into awk then we could do the whole thing in awk ...

Assumptions:

  • none of the fields (in the file) contain white space
  • always going to pull data for columns #CHROM / POS / REF / ALT
  • these 4 columns are always going to be in the same fields/positions (#CHROM = $1 / POS = $2 / REF = $4 / ALT = $5)
  • DeNovo always shows up between a pair of colons (ie, we're looking for :DeNovo: in field $10)
  • the three test fields always show up in the same fields/positions (SAMPLE_1 = $10 / SAMPLE_2 = $11 / SAMPLE_3 = $12)
  • PASS always shows up between a pair of colons (ie, we're lookinf for :PASS: in fields $10 / $11 / $12)

One awk idea:

awk '
$10 ~ ":DeNovo:" &&
$10 ~ ":PASS:"   &&
$11 ~ ":PASS:"   &&
$12 ~ ":PASS:"      { print $1,$2,$4,$5 }
' file.vcf

This generates:

chr1 61871 C CT

NOTE: The obvious (?) downside to this approach is that we've hardcoded the column references; we could certainly modify the awk code to provide a more dynamic interface (replicate bcftools functionality?) but I'm not sure the added complexity is worth the effort, ymmv.


Assuming OP needs the functionality of bcftools (eg, designate a variable set of columns), one idea would be to modify the current bcftools call to include the SAMPLE_X columns and then pipe that output to awk; something like:

bcftools query -f '%CHROM %POS %REF %ALT %SAMPLE_1 %SAMPLE_2 %SAMPLE_3\n' file.vcf | awk '
$5 ~ ":DeNovo:" &&
$5 ~ ":PASS:"   &&
$6 ~ ":PASS:"   &&
$7 ~ ":PASS:"      { print $1,$2,$3,$4 }'

NOTES:

  • further improvements would include parameterizing the DeNovo and PASS strings
  • going back to the idea of eliminating bcftools and doing the whole thing in awk ... would likely need two column lists (columns to display, columns to test) along with a list of the string(s) to compare each of the 'test' columns against ... doable but now we're getting into a fairly substantial coding effort (vs what we've done so far)