Accessing Bam file at particular location using Pysam

1.6k Views Asked by At

I have a given chromosome number and location (chr1 and location 1599812). I want to use the pysam module of python to access the bam file to obtain read numbers information for only that particular region chr1 and location 1599812. I have tried using pileup() but it requires a range of locations whereas in my case I want only a specific location and not a range of such.

2

There are 2 best solutions below

0
On

If you set the same start and end, the pileup will refer only to that particular position. E.g. (pure samtools):

$ samtools mpileup -r chr1:808957-808957 YourFile.bam
chr1    808957  N   102 READSTRING READQUALITYSTRING

Shows 102 reads covering the position 808957 for the chromosome 1.

0
On

I don't think pileup() is what you want - according to pysam API, this function returns "an iterator over genomic positions" and specifically, "‘all’ reads which overlap the region are returned. The first base returned will be the first base of the first read ‘not’ necessarily the first base of the region used in the query."

You are saying you'd like to obtain "read numbers information" - that is, number of reads in that specific location, correct? For that purpose, count_coverage() should do the job. In your case, I think this code should give you the answer you're looking for:

import pysam

my_bam_file = '/path/to/your/bam_file.bam'
imported = pysam.AlignmentFile(my_bam_file, mode = 'rb')  # 'rb' ~ read bam
coverage = imported.count_coverage(
                  contig = '1',     # Chromosome ID; also might be "chr1" or similar 
                  start = 1599812,
                  stop = 1599813,
                  )
print(coverage)

Note that this works because, as noted in the pysam API glossary, pysam uses half-open intervals, so the range [1599812, 1599813) will include exactly one base-pair.

Running the code above will give you something like this:

> (array('L', [0]), array('L', [0]), array('L', [0]), array('L', [0]))

which is a tuple of arrays containing the number of A, C, G, and T bases in the reads covering this genomic position, respectively. If you are simply interested in the number of reads mapped to this specific genomic position total, you can then sum across this tuple:

import numpy as np

print(np.sum(coverage))