edit only the 1st column of a fasta header by removing strings after '-'

Question

edit only the 1st column of a fasta header by removing strings after '-'

63 Views Asked by Rachel At 14 September 2023 at 22:08

I have a fasta file with the following header structure:

>Saurogobio_punctatus-NC_080528.1|taxid=1771284|cellularorganisms,Eukaryota,Opisthokonta,Metazoa
GCTAGCGTAGCTTAATATAAAGCATAACACTGAAGATGTTAAGATGAGCCCTAA

Where each section is separated by a pipe '|', and the first section is a combination of species_name-accessionID.

I want to remove the accesionIDs after the hyphen '-', but keep everything else. Like this:

>Saurogobio_punctatus|taxid=1771284|cellularorganisms,Eukaryota,Opisthokonta,Metazoa
GCTAGCGTAGCTTAATATAAAGCATAACACTGAAGATGTTAAGATGAGCCCTAA

I've tried:

sed -E '/^>/s/(\|[^-]*)-.*$/\1/' input.fasta > output.fasta

But this removes everything after the hyphen '-':

>Saurogobio_punctatus
GCTAGCGTAGCTTAATATAAAGCATAACACTGAAGATGTTAAGATGAGCCCTAA

I've used this piece of code before to edit my header and include the taxid= before my 2nd column:

awk 'BEGIN { FS=OFS="|" } /^>/ { print $1, "taxid=", $2, $3; next } { print }' file.fa > edit_file

I was wondering if there is a way to maybe combine these 2 commands, where i edit my first column and then reprint the rest, but i don't know how to do it :(

I appreciate any help with this!

Original Q&A

There are 3 best solutions below

markp-fuso On 14 September 2023 at 23:26

One awk idea:

awk '
BEGIN { FS=OFS="|" }
/^>/  { n=index($1,"-"); $1=substr($1,1,n-1) }
1
' input.fasta

This generates:

>Saurogobio_punctatus|taxid=1771284|cellularorganisms,Eukaryota,Opisthokonta,Metazoa
GCTAGCGTAGCTTAATATAAAGCATAACACTGAAGATGTTAAGATGAGCCCTAA

jared_mamrot On 14 September 2023 at 23:28

Another potential awk option:

awk '/>/ {gsub("-[^\\|]*\\|", "|", $0); print} !/>/ {print}' input
>Saurogobio_punctatus|taxid=1771284|cellularorganisms,Eukaryota,Opisthokonta,Metazoa
GCTAGCGTAGCTTAATATAAAGCATAACACTGAAGATGTTAAGATGAGCCCTAA

Or, a better option, using bioinformatics software designed for the task (https://bioinf.shenwei.me/seqkit/):

seqkit replace -p "\-[^|]*|" -r "" input
>Saurogobio_punctatus|taxid=1771284|cellularorganisms,Eukaryota,Opisthokonta,Metazoa
GCTAGCGTAGCTTAATATAAAGCATAACACTGAAGATGTTAAGATGAGCCCTAA

**Cyrus** · Accepted Answer · 2023-09-14T22:12:02.480000

Cyrus On 14 September 2023 at 22:12 BEST ANSWER

I suggest with sed:

sed 's/-[^|]*//' file

Output to stdout:

>Saurogobio_punctatus|taxid=1771284|cellularorganisms,Eukaryota,Opisthokonta,Metazoa
GCTAGCGTAGCTTAATATAAAGCATAACACTGAAGATGTTAAGATGAGCCCTAA

See: The Stack Overflow Regular Expressions FAQ

edit only the 1st column of a fasta header by removing strings after '-'

There are 3 best solutions below

Related Questions in AWK

Related Questions in SED

Related Questions in FASTA

Trending Questions

Popular # Hahtags

Popular Questions