extract variable string content between pipes in a VCF file

463 Views Asked by At

this issue could look related with genetics, but actually it is very programming based.

I have following vcf file (specific txt file, obtained from tool, called VEP) with header and this content of columns:

#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  sample.F
chr1    10643146    .   G   GC  63.2    PASS    CSQ=|FAIL|0.00|0.00|0.01|0.00|13|40|-3|13|||MODIFIER|CASZ1|ENSG00000130940|ENST00000377022|protein_coding||19/20|||||,|FAIL|0.00|0.00|0.01|0.00|13|40|-3|13|||MODIFIER|AL139423.1|ENSG00000272078|ENST00000606802|lncRNA||1/1|||||  GT:GQ:DP:AD:VAF:PL  0/1:58:86:40,45:0.523256:63,0,59
chr1    10646034    .   G   C   64.8    PASS    CSQ=|FAIL|0.00|0.00|0.00|0.00|22|3|1|2|||MODIFIER|CASZ1|ENSG00000130940|ENST00000377022|protein_coding||17/20|||||,|FAIL|0.00|0.00|0.00|0.00|22|3|1|2|||MODIFIER|AL139423.1|ENSG00000272078|ENST00000606802|lncRNA||1/1|||||    GT:GQ:DP:AD:VAF:PL  0/1:59:27:13,14:0.518519:64,0,60

I would like to extract only gene name in first column, and chromosomal position in second column, so that my final file could like:

chr1:10643146             CASZ1

BCFtools plugin https://samtools.github.io/bcftools/howtos/plugin.split-vep.html was not suitable, so I decided to make custom approach.

  1. I wrote a line that prints out needed columns:

awk 'BEGIN {OFS ="\t" ; FS = "\t"};{print $1, $2, $8}' sample > out

  1. I got confused, which bash command is suitable for extracting field no.13 between pipes (i.e. line, starting with CSQ: strings CASZ1, after MODERATE in this sample), so that from all that long line I get only strings between pipe symbols 13. and 14.

From

CSQ=|FAIL|0.00|0.00|0.00|0.00|22|3|1|2|||MODIFIER|CASZ1|ENSG00000130940|ENST00000377022|protein_coding||17/20|||||,|FAIL|0.00|0.00|0.00|0.00|22|3|1|2|||MODIFIER|AL139423.1|ENSG00000272078|ENST00000606802|lncRNA||1/1||||| 

to

CASZ1
  1. I looked at solutions in SO, found this:

bash how to extract a field based on its content from a delimited string

but the problem is that strings in field no.13 are variable, so this is not appropriate for me.

Which shell scrypting approach should I use?

Thank you!

2

There are 2 best solutions below

2
On
$ awk -F'[\t|]' -v OFS='\t' 'NR>1{print $1":"$2, $21}' file
chr1:10643146   CASZ1
chr1:10646034   CASZ1
0
On

I tried bcftools plugin, but got:

The field "Consequence" is not present in INFO/CSQ: "Consequence annotations from Ensembl VEP. Format: 'Allele

There are CSQ fields in my vcf, but not ones named Consequences