Grep multiple positions with/without ID

234 Views Asked by At

I want to grep a vcf file for search for multiple positions. The following works:

grep -f template_gb37 file.vcf>gb37_result

My template_gb37 has 10000 lines and it looks like this:

1   1156131 rs2887286   C   T
1   1211292 rs6685064   T   C
1   2283896 rs2840528   A   G

When the vcf has the rs it works perfect.

The problem is that the vcf I am going to grep may not have the rs and "." instead:

File.vcf

#CHROM  POS  ID  REF  ALT ....
1   1156131 .   C   T  ....
1   1211292 .   T   C  ....
1   1211292 .   T   C  ....

Is there a way to search my multiple patterns with "rs" or just "."?

Thanks in advance

2

There are 2 best solutions below

4
On BEST ANSWER

I think you mean the second field in your file could be . or rsNNNNNN and you want to allow either. So, I think you need an "alternation" which you do with a | like this:

printf "cat\nmonkey\ndog" | grep -E "cat|dog"
cat
dog

So your pattern file "template_gb37" needs to look like this:

1   1156131 (\.)|rs2887286   C   T
1   1211292 (\.)|rs6685064   T   C
1   2283896 (\.)|rs2840528   A   G

And you need to search with:

grep -Ef PATTERNFILE file.vcf

If you don't want to change your pattern file, you can edit it "on-the-fly" each time you use it. So, if "template" currently looks like this:

1   1156131 rs2887286   C   T
1   1211292 rs6685064   T   C
1   2283896 rs2840528   A   G

the following awk will edit it:

awk '{$3 = "(\\.)|" $3}1' template

to make it this:

1 1156131 (\.)|rs2887286 C T
1 1211292 (\.)|rs6685064 T C
1 2283896 (\.)|rs2840528 A G

which means you could use my whole answer like this:

grep -Ef <( awk '{$3 = "(\\.)|" $3}1' template ) file.vcf
1
On

Seems better to use awk for this, your data format is what it wants, columns. First parse the (fixed) patterns, save them, also create the extra ones with dot instead of the rs.... Then matches lines from second file.

awk 'NR==FNR{a[$1 $2 $3 $4 $5]; a[$1 $2 "." $4 $5]; next}
     ($1 $2 $3 $4 $5) in a' template_gb37 file.vcf > gb37_result