I want to update a values in the column by reading two files simultaneously.
main_file has following data:
contig pos GT PGT_phase PID PG_phase PI
2 1657 ./. . . ./. .
2 1738 0/1 . . 0|1 935
2 1764 0/1 . . 1|0 935
2 1782 0/1 . . 0|1 935
2 1850 0/0 . . 0/0 .
2 1860 0/1 . . 1|0 935
2 1863 0/1 . . 0|1 935
2 2969 0/1 . . 1|0 3352
2 2971 0/0 . . 0/0 .
2 5207 0/1 0|1 5185 1|0 1311
2 5238 0/1 . . 0|1 1311
2 5241 0/0 . . 0/0 .
2 5258 0/1 . . 1|0 1311
2 5260 0/0 . . 0/0 .
2 5319 0/0 . . 0/0 .
2 5398 0/1 0|1 5398 1|0 1311
2 5403 0/1 0|1 5398 1|0 1311
2 5426 0/1 0|1 5398 1|0 1311
2 5427 0/1 0|1 5398 0/1 .
2 5434 0/1 0|1 5398 1|0 1311
2 5454 0/1 0|1 5398 0/1 .
2 5457 0/0 . . 0/0 .
2 5467 0/1 0|1 5467 0|1 1311
2 5480 0/1 0|1 5467 0|1 1311
2 5483 0/0 0|1 5482 0/0 .
2 6414 0/1 . . 0|1 1667
2 6446 0/1 0|1 6446 0|1 1667
2 6448 0/1 0|1 6446 0|1 1667
2 6465 0/1 0|1 6446 0|1 1667
2 6636 0/1 . . 1|0 1667
2 6740 0/1 . 6740 0|1 1667
2 6748 0/1 . 6740 0|1 .
The another match_file has following type of info:
**PI PID**
1309 3617741,3617753,3617788,3618156,3618187,3618289
131 11793586
1310
1311 5185,5398,5467,5576
1312 340692,340728
1313 18503498
1667 6740,12237,12298
What I am trying to do:
- I want to create a new column(new_PI) with updated PI values.
How the updating works:
- So, if there a PI value in the line of main_file, its simple:
new_PI value = main_PI
and thencontinue
- If in main_file both
main_PI
andmain_PID
is.
,new_PI = .
andcontinue
- But, if the PI value is '.' but PID value is some integer, now we look in the match_file for the PI value that contains that value in the list of PID. If a matching PID is found
new_PI = PI_match_file
and thencontinue
I have written the below code:
main_file = open("2ms01e_chr2_table.txt", 'r+')
match_file = open('updated_df_table.txt', 'r+')
main_header = main_file.readline()
match_header = match_file.readline()
main_data = main_file.read().rstrip('\n').split('\n')
match_data = match_file.read().rstrip('\n').split('\n')
file_update = open('PI_updates.txt', 'w')
file_update.write('contig pos GT PGT_phase PID PG_phase PI new_PI\n')
file_update.close()
for line in main_data:
main_column = line.split('\t')
PID_main = main_column[4]
PI_main = main_column[6]
if PID_main == '.' and PI_main == '.':
new_PI = '.'
continue
if PI_main != '.':
new_PI = PI_main
continue
if PI_main == '.' and PID_main != '.':
for line in match_data:
match_column = line.split('\t')
PI_match = match_column[0]
PID_match = match_column[1].split(',')
if PID_main in PID_match:
new_PI = PI_match
continue
file_update = open('PI_updates.txt', 'a')
file_update.write(line + '\t' + str(new_PI)+ '\n')
file_update.close()
I am not getting any error but looks like I am not writing appropriate code to read the two files.
My output should be something like this:
contig pos GT PGT PID PG PI new_PI
2 5426 0/1 0|1 5398 1|0 1311 1311
2 5427 0/1 0|1 5398 0/1 . 1311
2 5434 0/1 0|1 5398 1|0 1311 1311
2 5454 0/1 0|1 5398 0/1 . 1311
2 5457 0/0 . . 0/0 . .
2 5467 0/1 0|1 5467 0|1 1311 1311
2 5480 0/1 0|1 5467 0|1 1311 1311
2 5483 0/0 0|1 5482 0/0 1667 1667
2 5518 1/1 1|1 5467 1/1 . 1311
2 5519 0/0 . . 0/0 . .
2 5547 1/1 1|1 5467 1/1 . 1311
2 5550 ./. . . ./. . .
2 5559 1/1 1|1 5467 1/1 . 1311
2 5561 0/0 . . 0/0 . .
2 5576 0/1 0|1 5576 1|0 1311 1311
2 5599 0/1 0|1 5576 1|0 1311 1311
2 5602 0/0 . . 0/0 . .
2 5657 0/1 . . 1|0 1311 1311
2 5723 0/1 . . 1|0 1311 1311
2 6414 0/1 . . 0|1 1667 1667
2 6446 0/1 0|1 6446 0|1 1667 1667
2 6448 0/1 0|1 6446 0|1 1667 1667
2 6465 0/1 0|1 6446 0|1 1667 1667
2 6636 0/1 . . 1|0 1667 1667
2 6740 0/1 . 6740 0|1 1667 1667
2 6748 0/1 . 6740 0|1 . 1667
Thanks in advance !
Your code appears fine except that your code often doesn't get to the lines appending the PI_update file.
continue
statements terminate a loop iteration moving to the next iteration, thus skipping the file write lines. This is not the case if the third if statement is entered because then the continue statement will only terminate the inner loop.Somewhat related, I've got a quick speed win for you: You have two for loops stacked. Instead you could replace the iteration over
match_data
by a lookup in a dictionary. This can offer a tremendous speedup on larger files. Also you might want to store the new_PI values in a list and perform a single write at the end of your code. File I/O is generally very heavy on performance and should be done as little as possible.Edit: (example)