extract lines from bam file

1.1k Views Asked by At

I have a bam file that looks like this:

samtools view pingpon.forward.bam | head
K00311:84:HYCNTBBXX:1:1123:2909:4215    0   LQNS02000001.1:55-552   214 28M *   0   0   TCTAGTTCAACTGTAAATCATCCTGCCC    AAFFFJJJJJJJJJJJJJJJJJJJJJJJ    AS:i:-6 XS:i:-6 XN:i:0  XM:i:1  XO:i:0  XG:i:0  NM:i:1  MD:Z:9T18   YT:Z:UU
K00311:84:HYCNTBBXX:1:1123:2909:4215    0   LQNS02000001.1:55-552   214 28M *   0   0   TCTAGTTCAACTGTAAATCATCCTGCCC    AAFFFJJJJJJJJJJJJJJJJJJJJJJJ    AS:i:-6 XS:i:-6 XN:i:0  XM:i:1  XO:i:0  XG:i:0  NM:i:1  MD:Z:9T18   YT:Z:UU
K00311:84:HYCNTBBXX:1:1123:2909:4215    0   LQNS02000001.1:55-552   214 28M *   0   0   TCTAGTTCAACTGTAAATCATCCTGCCC    AAFFFJJJJJJJJJJJJJJJJJJJJJJJ    AS:i:-6 XS:i:-6 XN:i:0  XM:i:1  XO:i:0  XG:i:0  NM:i:1  MD:Z:9T18   YT:Z:UU
K00311:84:HYCNTBBXX:1:1123:2909:4215    0   LQNS02000001.1:55-552   214 28M *   0   0   TCTAGTTCAACTGTAAATCATCCTGCCC    AAFFFJJJJJJJJJJJJJJJJJJJJJJJ    AS:i:-6 XS:i:-6 XN:i:0  XM:i:1  XO:i:0  XG:i:0  NM:i:1  MD:Z:9T18   YT:Z:UU
K00311:84:HYCNTBBXX:1:1123:2909:4215    0   LQNS02000001.1:55-552   214 28M *   0   0   TCTAGTTCAACTGTAAATCATCCTGCCC    AAFFFJJJJJJJJJJJJJJJJJJJJJJJ    AS:i:-6 XS:i:-6 XN:i:0  XM:i:1  XO:i:0  XG:i:0  NM:i:1  MD:Z:9T18   YT:Z:UU

I also have another file with the IDs I am interested in that looks like this:

K00311:84:HYCNTBBXX:1:2223:15798:5692
K00311:84:HYCNTBBXX:2:2211:11414:30696
K00311:84:HYCNTBBXX:2:2223:28879:41581

Ideally I want to extract the lines from the bam file that start with the IDs from the IDs file. At the moment I am using this code I wrote but it's not working. Any help will be appreciated! Thanks

import pysam
import re


forward = pysam.AlignmentFile('pingpon.forward.bam', "rb")
reverse = pysam.AlignmentFile('pingpon.reverse.bam', "rb")

ids = open("IDs_results_bed_reverse.txt", "w")

for line in reverse:
        if re.match("(.*)(I|i)ds(.*)", line):
            print(line)
2

There are 2 best solutions below

5
On

the first with statement is to read the id file and create a dictionary with the entries. The last with will read the entries from the other file and put in the proper entry on the dictionary.

import re

regex = re.compile('[A-Z0-9]+:\d+:[A-Z0-9]+:\d+:\d+:\d+:\d+')

with open('id.bam') as file:
    ids = {}
    for line in file:
        if regex.match(line):
            temp = line.replace('\n', '')
            ids[temp] = []

print(ids)

with open('list.bam') as file:
    for line in file:
        if regex.match(line):
            temp = line.replace('\n', '').split(' ')
            if temp[0] in ids:
                ids[temp[0]].append(line.replace('\n', ''))

print(ids)
0
On

https://www.biostars.org/p/165090/ Someone had a similar question and it got answered here.