Find sequencing reads with insertions longer than number

151 Views Asked by At

I'm trying to isolate, from a bam file, those sequencing reads that have insertions longer than number (let's say 50bp). I guess I can do that using the cigar but I don't know any easy way to parse it and keep only the reads that I want. This is what I need:

Read1 -> 2M1I89M53I2M

Read2 -> 2M1I144M

I should keep only Read1.

Thanks!

1

There are 1 best solutions below

0
On

Most likely I'm late, but ...

Probably you want MC tag, not CIGAR. I use BWA, and information on insertions is stored in the MC tag. But I may mistake.

Use pysam module to parse BAM, and regular expressions to parse MC tags.

Example code:

import pysam
import re

input_file = pysam.AlignmentFile('input.bam', 'rb')
output_file = pysam.AlignmentFile('found.bam', 'wb', template = input_file)
for Read in input_file:
    try:
        TagMC = Read.get_tag('MC')
    except KeyError:
        continue
    InsertionsTags = re.findall('\d+I', TagMC)
    if not InsertionsTags: continue
    InsertionLengths = [int(Item[:-1]) for Item in InsertionsTags]
    MinLength = min(InsertionLengths)
    if MinLength > 50: output_file.write(Read)

input_file.close()
output_file.close()

Hope that helps.