Find sequencing reads with insertions longer than number

163 Views Asked by PaulaO At 30 June 2025 at 23:56

I'm trying to isolate, from a bam file, those sequencing reads that have insertions longer than number (let's say 50bp). I guess I can do that using the cigar but I don't know any easy way to parse it and keep only the reads that I want. This is what I need:

Read1 -> 2M1I89M53I2M

Read2 -> 2M1I144M

I should keep only Read1.

Thanks!

Original Q&A

There are 1 best solutions below

AudioBubble On 31 August 2022 at 12:44

Most likely I'm late, but ...

Probably you want MC tag, not CIGAR. I use BWA, and information on insertions is stored in the MC tag. But I may mistake.

Use pysam module to parse BAM, and regular expressions to parse MC tags.

Example code:

import pysam
import re

input_file = pysam.AlignmentFile('input.bam', 'rb')
output_file = pysam.AlignmentFile('found.bam', 'wb', template = input_file)
for Read in input_file:
    try:
        TagMC = Read.get_tag('MC')
    except KeyError:
        continue
    InsertionsTags = re.findall('\d+I', TagMC)
    if not InsertionsTags: continue
    InsertionLengths = [int(Item[:-1]) for Item in InsertionsTags]
    MinLength = min(InsertionLengths)
    if MinLength > 50: output_file.write(Read)

input_file.close()
output_file.close()

Hope that helps.

Find sequencing reads with insertions longer than number

There are 1 best solutions below

Related Questions in SEQUENCING

Related Questions in SAMTOOLS

Related Questions in BAM

Trending Questions

Popular # Hahtags

Popular Questions