How to read two files in a for-loop and update values in one files based on matching-values in another file?

173 Views Asked by At

I want to update a values in the column by reading two files simultaneously.

main_file has following data:

contig  pos GT  PGT_phase   PID PG_phase    PI
2   1657    ./. .   .   ./. .
2   1738    0/1 .   .   0|1 935
2   1764    0/1 .   .   1|0 935
2   1782    0/1 .   .   0|1 935
2   1850    0/0 .   .   0/0 .
2   1860    0/1 .   .   1|0 935
2   1863    0/1 .   .   0|1 935
2   2969    0/1 .   .   1|0 3352
2   2971    0/0 .   .   0/0 .
2   5207    0/1 0|1 5185    1|0 1311
2   5238    0/1 .   .   0|1 1311
2   5241    0/0 .   .   0/0 .
2   5258    0/1 .   .   1|0 1311
2   5260    0/0 .   .   0/0 .
2   5319    0/0 .   .   0/0 .
2   5398    0/1 0|1 5398    1|0 1311
2   5403    0/1 0|1 5398    1|0 1311
2   5426    0/1 0|1 5398    1|0 1311
2   5427    0/1 0|1 5398    0/1 .
2   5434    0/1 0|1 5398    1|0 1311
2   5454    0/1 0|1 5398    0/1 .
2   5457    0/0 .   .   0/0 .
2   5467    0/1 0|1 5467    0|1 1311
2   5480    0/1 0|1 5467    0|1 1311
2   5483    0/0 0|1 5482    0/0 .
2   6414    0/1 .   .   0|1 1667
2   6446    0/1 0|1 6446    0|1 1667
2   6448    0/1 0|1 6446    0|1 1667
2   6465    0/1 0|1 6446    0|1 1667
2   6636    0/1 .   .   1|0 1667
2   6740    0/1 .   6740    0|1 1667
2   6748    0/1 .    6740   0|1 .

The another match_file has following type of info:

**PI      PID**
1309    3617741,3617753,3617788,3618156,3618187,3618289
131     11793586
1310    
1311    5185,5398,5467,5576
1312    340692,340728
1313    18503498
1667    6740,12237,12298

What I am trying to do:

  • I want to create a new column(new_PI) with updated PI values.

How the updating works:

  • So, if there a PI value in the line of main_file, its simple: new_PI value = main_PI and then continue
  • If in main_file both main_PI and main_PID is ., new_PI = . and continue
  • But, if the PI value is '.' but PID value is some integer, now we look in the match_file for the PI value that contains that value in the list of PID. If a matching PID is found new_PI = PI_match_file and then continue

I have written the below code:

main_file = open("2ms01e_chr2_table.txt", 'r+')
match_file = open('updated_df_table.txt', 'r+')

main_header = main_file.readline()
match_header = match_file.readline()

main_data = main_file.read().rstrip('\n').split('\n')
match_data = match_file.read().rstrip('\n').split('\n')

file_update = open('PI_updates.txt', 'w')
file_update.write('contig   pos GT  PGT_phase   PID PG_phase    PI  new_PI\n')
file_update.close()

for line in main_data:
    main_column = line.split('\t')
    PID_main = main_column[4]
    PI_main = main_column[6]
    if PID_main == '.' and PI_main == '.':
        new_PI = '.'
        continue

    if PI_main != '.':
        new_PI = PI_main
        continue

    if PI_main == '.' and PID_main != '.':
        for line in match_data:
            match_column = line.split('\t')
            PI_match = match_column[0]
            PID_match = match_column[1].split(',')
            if PID_main in PID_match:
                new_PI = PI_match
                continue

    file_update = open('PI_updates.txt', 'a')
    file_update.write(line + '\t' + str(new_PI)+ '\n')
    file_update.close()

I am not getting any error but looks like I am not writing appropriate code to read the two files.

My output should be something like this:

contig  pos    GT    PGT       PID     PG      PI     new_PI
2      5426    0/1   0|1       5398   1|0   1311       1311 
2      5427    0/1   0|1       5398   0/1   .          1311
2      5434    0/1   0|1       5398   1|0   1311       1311
2      5454    0/1   0|1       5398   0/1   .          1311
2      5457    0/0   .          .     0/0   .          .
2      5467    0/1   0|1       5467   0|1   1311       1311
2      5480    0/1   0|1       5467   0|1   1311       1311
2      5483    0/0   0|1       5482   0/0   1667       1667
2      5518    1/1   1|1       5467   1/1   .          1311
2      5519    0/0   .         .      0/0   .          .
2      5547    1/1   1|1       5467   1/1   .          1311
2      5550    ./.   .         .      ./.   .          .
2      5559    1/1   1|1       5467   1/1   .          1311
2      5561    0/0   .         .      0/0   .          .
2      5576    0/1   0|1       5576   1|0   1311       1311
2      5599    0/1   0|1       5576   1|0   1311       1311
2      5602    0/0   .         .      0/0   .          .
2      5657    0/1   .         .      1|0   1311       1311
2      5723    0/1   .         .      1|0   1311       1311
2      6414    0/1   .         .      0|1   1667       1667
2      6446    0/1  0|1      6446     0|1   1667       1667
2      6448    0/1  0|1      6446     0|1   1667       1667
2      6465    0/1  0|1      6446     0|1   1667       1667
2      6636    0/1  .          .      1|0   1667       1667
2      6740    0/1  .        6740     0|1   1667       1667
2      6748    0/1  .        6740     0|1   .          1667

Thanks in advance !

2

There are 2 best solutions below

2
On BEST ANSWER

Your code appears fine except that your code often doesn't get to the lines appending the PI_update file. continue statements terminate a loop iteration moving to the next iteration, thus skipping the file write lines. This is not the case if the third if statement is entered because then the continue statement will only terminate the inner loop.

Somewhat related, I've got a quick speed win for you: You have two for loops stacked. Instead you could replace the iteration over match_data by a lookup in a dictionary. This can offer a tremendous speedup on larger files. Also you might want to store the new_PI values in a list and perform a single write at the end of your code. File I/O is generally very heavy on performance and should be done as little as possible.

Edit: (example)

main_data = main_file.read().rstrip('\n').split('\n')
match_data = match_file.read().rstrip('\n').split('\n')
match_map = {} # instantiate empty dict
for line in match_data:
    PI, PIDs = line.split('\t')
    # update the dict with all the PIDs from this line
    match_map.update({PID:PI for PID in PIDs})

PI_updates = 'contig\tpos\tGT\tPGT_phase\tPID\tPG_phase\tPI\tnew_PI\n'

for line in main_data:
    _, _, _, PID, _, PI = line.split('\t')
    if PID_main == '.' and PI_main == '.':
        new_PI = '.'
    elif PI_main != '.':
        new_PI = PI_main
    else: 
        # dict.get(key, default) returns default if key doesn't return a value
        new_PI = match_map.get(PID, 'no match found')
    # append the result to the PI_updates string
    PI_updates += line + '\t' + str(new_PI)+ '\n'

# let with statement take care of closing the file
with open('PI_updates.txt', 'w') as file_update:
    file_update.write(PI_updates)
0
On

I should have used break rather than continue. Also, continue at other places isn't helpful.

main_file = open("2ms01e_chr2_table.txt", 'r+')
match_file = open('updated_df_table.txt', 'r+')


main_header = main_file.readline()
match_header = match_file.readline()
print(match_header, "\n**")

main_data = main_file.read().rstrip('\n').split('\n')
match_data = match_file.read().rstrip('\n').replace('[', '')\
    .replace("'", "").replace(']', '').replace(" ", '')
match_data = match_data.split('\n')

file_update = open('PI_updates.txt', 'w')
file_update.write('contig   pos GT  PGT_phase   PID PG_phase    PI  new_PI\n')
file_update.close()

for line in main_data:
    main_column = line.split('\t')
    PID_main = main_column[4]
    PI_main = main_column[6]
    chrom = main_column[0]
    pos = main_column[1]
    if PID_main == '.' and PI_main == '.':
        new_PI = '.'

    if PI_main != '.':
        new_PI = PI_main

    elif PI_main == '.' and PID_main != '.':
        for line1 in match_data:
            match_column = line1.split('\t')
            PI_match = match_column[0]
            PID_match = match_column[1].split(',')
            if PID_main in PID_match:
                new_PI = PI_match
                break
            elif PID_main not in PID_match:
                new_PI = str(chrom) + '_' + str(PID_main)

    file_update = open('PI_updates.txt', 'a')
    file_update.write(line + '\t' + str(new_PI)+ '\n')
    file_update.close()