I have some large (50k line) files formatted like this
chr1 35276 35481 NR_026820_exon_1_0_chr1_35277_r 0 - 0.526829 0.473171 54 37 60 54 0 0 205
chr1 35720 36081 NR_026818_exon_2_0_chr1_35721_r 0 - 0.398892 0.601108 73 116 101 71 0 0 361
chr1 35720 36081 NR_026820_exon_2_0_chr1_35721_r 0 - 0.398892 0.601108 73 116 101 71 0 0 361
chr1 69090 70008 NM_001005484_exon_0_0_chr1_69091_f 0 + 0.571895 0.428105 212 218 175 313 0 0 918
chr1 134772 139696 NR_039983_exon_0_0_chr1_134773_r 0 - 0.366775 0.633225 997 1194 1924 809 0 0 4924
chr1 139789 139847 NR_039983_exon_1_0_chr1_139790_r 0 - 0.551724 0.448276 13 12 14 19 0 0 58
chr1 140074 140566 NR_039983_exon_2_0_chr1_140075_r 0 - 0.475610 0.524390 126 144 114 108 0 0 492
chr1 323891 324060 NR_028322_exon_0_0_chr1_323892_f 0 + 0.426035 0.573964 37 41 56 35 0 0 169
chr1 323891 324060 NR_028325_exon_0_0_chr1_323892_f 0 + 0.426035 0.573964 37 41 56 35 0 0 169
chr1 323891 324060 NR_028327_exon_0_0_chr1_323892_f 0 + 0.426035 0.573964 37 41 56 35 0 0 169
chr1 324287 324345 NR_028322_exon_1_0_chr1_324288_f 0 + 0.551724 0.448276 19 15 11 13 0 0 58
chr1 324287 324345 NR_028325_exon_1_0_chr1_324288_f 0 + 0.551724 0.448276 19 15 11 13 0 0 58
chr1 324287 324345 NR_028327_exon_1_0_chr1_324288_f 0 + 0.551724 0.448276 19 15 11 13 0 0 58
chr1 324438 326938 NR_028327_exon_2_0_chr1_324439_f 0 + 0.375200 0.624800 400 1013 549 538 0 0 2500
chr1 324438 328581 NR_028322_exon_2_0_chr1_324439_f 0 + 0.378228 0.621772 678 1580 996 889 0 0 4143
chr1 324438 328581 NR_028325_exon_2_0_chr1_324439_f 0 + 0.378228 0.621772 678 1580 996 889 0 0 4143
The 2nd and 3rd columns are a start and end position. I want to delete all rows that have the same start AND end positions (rest of line doesn't matter) and keep only the first time it appears. For example, I'd want to keep lines 14 and 15 in the sample data because even though the start position is the same, the end is not. 15 and 16 have identical start and end, so I'd want to delete 16.I'm trying to do it in Python, but I'm really at a loss for how to handle the two column uniqueness requirement.
Any ideas on the best/easiest way to do this?
Consider using the excellent pandas library to load and mangle such data:
Now, it's super easy: