How to extract coordinates in P-match result?

63 Views Asked by At

From this link http://www.gene-regulation.com/cgi-bin/pub/programs/pmatch/bin/p-match.cgi produced result that I need to process in order to obtain only sequence ID, start and end position. What are the ways I can extract coordinate information from the result? Below is example result.

Scanning sequence ID:   BEST1_HUMAN

              150 (-)  1.000  0.997  GGAAAggccc                                   R05891
              354 (+)  0.988  0.981  gtgtAGACAtt                                  R06227
V$CREL_01c-RelV$EVI1_05Evi-1

Scanning sequence ID:   4F2_HUMAN

              365 (+)  1.000  1.000  gggacCTACA                                   R05884
               789 (-)  1.000  1.000  gcgCGAAA                                       R05828; R05834; R05835; R05838; R05839
V$CREL_01c-RelV$E2F_02E2F

Expected output:

Sequence ID start end
(end site is the number of short sequence GGAAAggccc added to start site).

BEST1_HUMAN 150 160
BEST1_HUMAN 354 365
4F2_HUMAN   365 375
4F2_HUMAN   789 797

Can anyone help me?

1

There are 1 best solutions below

3
On BEST ANSWER

Use the snippet from this answer to split your result into evenly sized chunks and extract your desired data:

def chunks(l, n):
    #Generator to yield n sized chunks from l
    for i in xrange(0, len(l), n):
        yield l[i: i + n]

with open('p_match.txt') as f:
    for chunk in chunks(f.readlines(), 6):
        sequence_id = chunk[0].split()[-1].strip()
        for i in (2,3):
            start = int(chunk[i].split()[0].strip())
            sequence = chunk[i].split()[-2].strip()
            stop = start + len(sequence)
            print sequence_id, start, stop

Edit: Apparently the result can contain a variable number of start positions, so then the above solution of splitting in evenly sized chunks doesn't work. You could then go the regex route or go through the file line by line:

with open('p_match.txt') as f:
    text = f.read()
    chunks = text.split('Scanning sequence ID:')
    for chunk in chunks:
        if chunk:
            lines = chunk.split('\n')
            sequence_id = lines[0].strip()
            for line in lines:
                if line.startswith('              '):
                    start = int(line.split()[0].strip())
                    sequence = line.split()[-2].strip()
                    stop = start + len(sequence)
                    print sequence_id, start, stop