I'm trying to isolate, from a bam file, those sequencing reads that have insertions longer than number (let's say 50bp). I guess I can do that using the cigar but I don't know any easy way to parse it and keep only the reads that I want. This is what I need:
Read1 -> 2M1I89M53I2M
Read2 -> 2M1I144M
I should keep only Read1.
Thanks!
Most likely I'm late, but ...
Probably you want MC tag, not CIGAR. I use BWA, and information on insertions is stored in the MC tag. But I may mistake.
Use
pysam
module to parse BAM, and regular expressions to parse MC tags.Example code:
Hope that helps.