From this link http://www.gene-regulation.com/cgi-bin/pub/programs/pmatch/bin/p-match.cgi produced result that I need to process in order to obtain only sequence ID, start and end position. What are the ways I can extract coordinate information from the result? Below is example result.
Scanning sequence ID: BEST1_HUMAN
150 (-) 1.000 0.997 GGAAAggccc R05891
354 (+) 0.988 0.981 gtgtAGACAtt R06227
V$CREL_01c-RelV$EVI1_05Evi-1
Scanning sequence ID: 4F2_HUMAN
365 (+) 1.000 1.000 gggacCTACA R05884
789 (-) 1.000 1.000 gcgCGAAA R05828; R05834; R05835; R05838; R05839
V$CREL_01c-RelV$E2F_02E2F
Expected output:
Sequence ID start end
(end site is the number of short sequence GGAAAggccc added to start site).
BEST1_HUMAN 150 160
BEST1_HUMAN 354 365
4F2_HUMAN 365 375
4F2_HUMAN 789 797
Can anyone help me?
Use the snippet from this answer to split your result into evenly sized chunks and extract your desired data:
Edit: Apparently the result can contain a variable number of start positions, so then the above solution of splitting in evenly sized chunks doesn't work. You could then go the regex route or go through the file line by line: