Replacing fasta header without key-value pair file

84 Views Asked by At

I am trying the following to replace fasta headers without key-value pairs using bioawk

for infile in $(ls *.faa)
do
        prefix=$(basename $infile .faa)
        bioawk -c fastx '{ print ">"$prefix"_" ++i "\n"$seq }' < ${infile} > ${prefix}_hdrrn.faa
done

Basically, I want to change the headers in my species_name.faa file to

>species_name_1
>species_name_2
...
>species_name_n

The problem is the $prefix inside bioawk print is not working. The error I get is:

bioawk: illegal field $(), name "prefix" input record number 1, file source line number 1

Why the substitution is not happening?

3

There are 3 best solutions below

0
Arun On BEST ANSWER

It was too easy with seqkit. using 'nr' does the trick !

for infile in $(ls *.faa)
do
    prefix=$(basename $infile .faa)
    echo "prefix is ${prefix}"
    seqkit replace -p .+ -r ${prefix}_{nr} --nr-width 4 $infile -o ${prefix}_hdrrn.faa
done
echo "done"
0
Ed Morton On

I'm guessing that $seq is supposed to be the contents of the current input line, i.e. $0 and if so then you should probably be using:

#!/usr/bin/env bash

outdir='/some/where'
mkdir -p "$outdir" &&
for infile in *.faa
do
    prefix="${infile%.*}"
    bioawk -c fastx -v pfx="$prefix" '{ print ">" pfx "_" NR ORS $0 }' < "$infile" > "${outdir}/${prefix}_hdrrn.faa"
done

Note that I'm creating the output files in some other directory so their names don't clash with your input file names.

You don't actually need a shell loop though, you could just do it all in one call to awk:

#!/usr/bin/env bash

outdir='/some/where'
mkdir -p "$outdir" &&
bioawk -c fastx -v outdir="$outdir" '
    FILENAME != prev {
        close(out)
        outfile = FILENAME
        sub(/\.[^.]+$/,"_hdrnna&",outfile)
        out = outdir "/" outfile
        prev = FILENAME
    }
    { print ">" pfx "_" FNR ORS $0 > out }
' *.faa

The above is assuming that bioawk, which I've never used and don't have/want a copy of, essentially behaves the same way as every other awk.

0
Timur Shtatland On

Use find, which is safer than ls to iterate over files. Use a Perl one-liner to change sequence headers:

find . -mindepth 1 -maxdepth 1 -name '*.faa' -exec \
  perl -i.bak -lpe 'if ( $. == 1 ) { chomp ( $species_name = `basename $ARGV ".faa"` ); } s{^>.*}{">${species_name}_" . ++$idx}e' {} \;

The Perl one-liner uses these command line flags:
-e : Tells Perl to look for code in-line, instead of in a file.
-p : Loop over the input one line at a time, assigning it to $_ by default. Add print $_ after each loop iteration.
-l : Strip the input line separator ("\n" on *NIX by default) before executing the code in-line, and append it when printing.
-i.bak : Edit input files in-place (overwrite the input file). Before overwriting, save a backup copy of the original file by appending to its name the extension .bak. If you want to skip writing a backup file, just use -i and skip the extension.

$. : Current input line number.
$ARGV : Current input file name.
chomp : Remove input line separator, here, newline.

s{PATTERN}{REPLACEMENT} : Replace regex PATTERN with REPLACEMENT.
/e : Evaluate REPLACEMENT as an expression in s{PATTERN}{REPLACEMENT}.
^> : Literal > at the beginning of the line.
.* : Any character repeated 0 or more times. Parens are for capturing this pattern.
++$idx : first, increment $idx by 1, then return its value. When first used, $idx is not defined and is treated as 0 in this expression. Thus, the first header has suffix 1.

See also: