modifying line names to avoid redundancies when files are merged in terminal

49 Views Asked by At

I have two files containing biological DNA sequence data. Each of these files are the output of a python script which assigns each DNA sequence to a sample ID based on a DNA barcode at the beginning of the sequence. The output of one of these .txt files looks like this:

>S066_1 IGJRWKL02G0QZG orig_bc=ACACGTGTCGC new_bc=ACACGTGTCGC bc_diffs=0
TTAAGTTCAGCGGGTATCCCTACCTGATCCGAGGTCAACCGTGAGAAGTTGAGGTTATGGCAAGCATCCATAAGAACCCTATAGCGAGAATAATTACTACGCTTAGAGCCAGATGGCACCGCCACTGATTTTAGGGGCCGCTGAATAGCGAGCTCCAAGACCCCTTGCGGGATTGGTCAAAATAGACGCTCGAACAGGCATGCCCCTCGGAATACCAAGGGGCGCAATGTGCGTCCAAAGATTCGATGATTCACTGAATTCTGCAATTCACATTACTTATCGCATTTCGCAGCGTTCTTCATCGATGACGAGTCTAG
>S045_2 IGJRWKL02H5XHD orig_bc=ATCTGACGTCA new_bc=ATCTGACGTCA bc_diffs=0
CTAAGTTCAGCGGGTAGTCTTGTCTGATATCAGGTCCAATTGAGATACCACCGACAATCATTCGATCATCAACGATACAGAATTTCCCAAATAAATCTCTCTACGCAACTAAATGCAGCGTCTCCGTACATCGCGAAATACCCTACTAAACAACGATCCACAGCTCAAACCGACAACCTCCAGTACACCTCAAGGCACACAGGGGATAGG

The first line is the sequence ID, and the second line in the DNA sequence. S_066 in the first part of the ID indicates that the sequence is from sample 066, and the _1 indicates that its the first sequence in the file (not the first sequence from S_066 per se). Because of the nuances of the DNA sequencing technology being used, I need to generate two files like this from the raw sequencing files, and the result is an output where I have two of these files, which I then use cat to merge together. So far so good.

The next downstream step in my workflow does not allow identical sample names. Right now it gets half way through, errors, and closes because it encounters some identical sequence IDs. So, it must be that the 400th sequence in both files belongs to the same sample, or something, generating identical sample IDs (i.e. both files might have S066_400).

What I would like to do is use some code to insert a number (1000,, 4971, whatever) immediately after the _ on every other line in the second file, starting with the first line. This way the IDs would no longer be confounded and I could proceed. So, it would cover S066_2 to S066_24971 or S066_49712. Part of the trouble is that the ID may be variable in length such that it could begin as S066_ or as 49BBT1_.

1

There are 1 best solutions below

3
On BEST ANSWER

Try:

awk '/^\>/ {$1=$1 "_13"} {print $0}' filename > tmp.tmp
mv tmp.tmp filename