I have a list of files each containing six lines. A representative example is shown below.
cat test.fa
>chain A
MIRLGAPQTLVLLTLLVAAVLRCQGQDVQEAGSCVQDGQRYNDKDVWKPEPCRICVCDTGTVLCDDIICEDVKDCLSPEIPFGECCPICPTDLATASGQPGPKGQKGEPGDIKDIVGPKGPPGPQGPAGEQGPRGDRGDKGEKGAPGPRGRDGEPGTPGNPGPPGPPGPPGPPGLGGNFAAQMAGGFDEKAGGAQLGVMQGPMGPMGPRGPPGPAGAPGPQGFQGNPGEPGEPGVSGPMGPRGPPGPPGKPGDDGEAGKPGKAGERGPPGPQGARGFPGTPGLPGVKGHRGYPGLDGAKGEAGAPGVKGESGSPGENGSPGPMGPRGLPGERGRTGPAGAAGARGNDGQPGPAGPPGPVGPAGGPGFPGAPGAKGE
>chain B
MMSFVQKGSWLLLALLHPTIILAQQEAVEGGCSHLGQSYADRDVWKPEPCQICVCDSGSVLCDDIICDDQELDCPNPEIPFGECCAVCPQPPTAPTRPPNGQGPQGPKGDPGPPGIPGRNGDPGIPGQPGSPGSPGPPGICESCPTGPQNYSPQYDSYDVKSGVAVGGLAGYPGPAGPPGPPGPPGTSGHPGSPGSPGYQGPPGEPGQAGPSGPPGPPGAIGPSGPAGKDGESGRPGRPGERGLPGPPGIKGPAGIPGFPGMKGHRGFDGRNGEKGETGAPGLKGENGLPGENGAPGPMGPRGAPGERGRPGLPGAAGARGNDGARGSDGQPGPPGPPGTAGFPGSPGAKGEVGPAGSPGSNGAPGQRGEPGPQGH
>chain C
MLPQIPFLLLVSLNLVHGVFYAERYQMPTGIKGPLPNTKTQFFIPYTIKSKGIAVRGEQGTPGPPGPAGPRGHPGPSGPPGKPGYGSPGLQGEPGLPGPPGPSAVGKPGVPGLPGKPGERGPYGPKGDVGPAGLPGPRGPPGPPGIPGPAGISVPGKPGQQGPTGAPGPRGFPGEKGAPGVPGMNGQKGEMGYGAPGRPGERGLPGPQGPTGPSGPPGVGKRGENGVPGQPGIKGDRGFPGEMGPIGPPGPQGPPGERGPEGIGKPGAAGAPGQPGIPGTKGLPGAPGIAGPPGPPGFGKPGLPGLKGERGPAGLPGGPGAKGEQGPAGLPGKPGLTGPPGNMGPQGPKGIPGSHGLPGPKGETGPAGPAGYPGAK
Reading row-by-row another file called test.list, I would like to substitute character position 140 in chain A of test.fa with "0" if the third column is "K" and character position 142 of chain B with "1" if the fourth column is E. Same for other rows.
cat test.list
A-B 140-142 K E
B-C 140-142 K E
A-B 299-301 K E
B-C 299-301 K E
I cannot figure out how to get a headstart. Really appreciate any help!
Interpretation of the task
Files are to be processed in pairs where a file with file-extension
.listcontains information needed to modify a corresponding file with file-extension.fa.The
.fafiles containfasta-formatted sequences for three chains identified A, B, or C. I have assumed every.fafile will contain three sequences identified A, B, and C. (if other identifiers are used, the script will need to be modified to extract identifiers for the chains).The
.listfiles contain four sets of conditions specifying changes to be made to the identified chains in the corresponding.fafile. Any changes accumulate as each condition is processed. Each condition is formatted as follows:Using the values in the above (one of four per pair of files) condition example, the instructions become:
The changed chains are then subjected to tests for the next three condition sets pertaining to that sequence file, accumulating changes with each condition set.
Approach
Since many paired files may exist, a
bash scriptcan be used to interrogate adirectorycontaining target files to identify pairs of matched files to be processed. The target directory will be passed as an argument to the script.The bash script will then pass pairs of matched files to an
awkscript, embedded in thebash script, where the values of the.listfile can be read and the corresponding changes made to the.fafiles.Because of the potential for unforeseen errors, the original
.fafiles are not overwritten. Instead files containing the substituted sequences will be written to a set of new files (in asub-directoryof the target directory).In order to track the changes and report any unmatched files, the script will also create a
log fileto record the changes made to each file and note any unmatched files.awkstepsThe
bash scriptpasses the.listfile toawkbefore the corresponding.fasequence file. Thus, when the record numberNR(equivalent to accumulated line numbers across both files but beginning with the.listfile) is equal to the record number of the current fileFNR, the script is processing the first file and an action block can be assigned to perform steps only on the data in the.listfile. Using this, the conditions data (in the.listfile) can be re-formatted into an array for easier reading later on:(the comment note examples relate to the example condition
A-B 140-142 K Eas used above). A 2-d array namedconditionsis built containing the data needed for the tests later on.A set of three pairs of short blocks follow. These extract the sequence runs for the three chains A, B, and C, storing them in an associative array indexed with the relevant letter. The first of each pair identifies the record (line) number
FNRof the line following the line in which the chain identifier is found, storing it in a variable namedaline,blineorclinefor the three chains in each.fafile:When the line after the line containing the label, for example 'chain A', is encountered (
FNR==aLine) the sequence found in that line is recorded in thechainsarray.The condition tests and substitutions are made entirely in the
awkENDblock, which executes once after the earlier blocks have been applied to each record (line) of the argument files.The
ENDblock contains a loop to cycle through each condition, making any specified changes before moving to the next condition.The logic test is performed inside the loop using an
ifconditional, which specifies that both conditions specified in the current condition line if the block of code is to be executed.The right hand side of each comparison refers to the condition letter 'K' or 'E' from the earlier example, checking if it is currently present at the position specified in the left hand side of the condition. The current character at that position is extracted from the sequence, stored in
chains[conditions[i,1]](which resolves to chain array elements e.gchains[A]orchains[B]) usingawk'ssubstrfunction to extract the single character present at positioncondition[i,3].If both conditions are met, the substitutions are made by joining
substrs of the regions flanking the target position either side of the required '0' or '1'. The sequences, altered or not, are then written to file.Implementing the script
The target files, one
.listfile for each.fa, file should be placed together in a directory. The directory name will be the argument for the script. The name part of the filenames should match for the matched pairs (e.gsequence1.fashould have a matched conditions file namedsequence1.list).(assuming the script is saved in a file named
process_list_fa.sh). The script must be made executable. This can be done on the command line using:The script is then called from the command line as follows:
The
path/to/directoryargument must represent the directory under which the matched files are stored. (don't use a trailing '/').The script will execute, reporting progress to the command line. It may halt if, for example, the number of .fa files is different to the number of .list files, and await confirmation to continue.
Once complete, the target directory will now contain a sub directory named
output. Inside theoutputdirectory will be a collection of files containg the modified sequences. Each will be named as it was originally but pre-pended withmodified_. Alog.txtfile is also created summarising the changes made, and any files skipped. This may be useful for checking the files, especially if large numbers of pairs are processed.The
process_list_fa.shscriptcopy the entire following snippet and save in a file named
process_list_fa.sh. Make the file executable (chmod +x process_list_fa.sh) and execute it passing the directory containing the sequences as an argument (./process_list_fa.sh path/directory)