I have a names.dmp file which contains taxonomy ids and scientific names among other details.
I want to fetch the scientific name of a particular tax-id, for which I am running this command:
cat names.dmp | grep "scientific name" | awk '$1~/^10090$/{print $0}' | cut -d "|" -f1,2
which gives me the output:
10090 | Mus musculus
But I need this to be dynamic, i.e., set a variable id=10090 and use this variable inside the regular expression. I need an exact match of the value while using "id", as there are entries such as 210090 and 100904 which I am getting as output which are not needed.
I am quite inexperienced when it comes to awk, so any help is appreciated.
EDIT:
Here is the example input:
10089 | Mus formosanus Kuroda, 1925 | | authority |
10089 | Mus formosanus | | synonym |
10089 | ricefield mouse | | common name |
10089 | Ryukyu mouse | | genbank common name |
10090 | house mouse | | genbank common name |
10090 | LK3 transgenic mice | | includes |
10090 | mouse | mouse <Mus musculus> | common name |
10090 | Mus musculus Linnaeus, 1758 | | authority |
10090 | Mus musculus | | scientific name |
10090 | Mus sp. 129SV | | includes |
10090 | nude mice | | includes |
10090 | transgenic mice | | includes |
10091 | Mus castaneus | | synonym |
10091 | Mus musculus castaneus | | scientific name |
10091 | Mus musculus castaneus Waterhouse, 1843 | | authority |
10091 | southeastern Asian house mouse | | genbank common name |
10092 | Mus domesticus | | synonym |
10092 | Mus musculus domesticus Schwarz & Scharz 1943 | | authority |
10092 | Mus musculus domesticus | | scientific name |
10092 | Mus musculus praetextus | | synonym |
100902 | Fusarium oxysporum f. sp. conglutinans | | scientific name |
100903 | Fusarium oxysporum f. sp. fragariae | | scientific name |
100905 | Cloning vector pACN | | scientific name |
100906 | Nitrosomonas sp. ENI-11 | | scientific name |
100907 | Chilean sea bass | | common name |
And the output I need is:
10090 | Mus musculus
When you use
awk, frequently, you don't need anything else:-F'[[:space:]]*\\|[[:space:]]*': set the input field separator as space-surrounded|.-v id="10090": declareawkvariableidand assign it10090(change this if needed).scientific nameand the first field equalsid, print the two first fields separated by|.As noted in comments this does not preserve the input field separators. In case you want to preserve them you can use the
splitfunction of GNUawk, instead of the input field separator, to save the fields in an array and the separators in another:Finally, if your
awkis not GNUawkbut you want to preserve the field separators, you can usematchandsubstrinstead ofsplit:We simply use
matchto find the index of the first|(a), then the index of the first space before the second|(b), and print only the everything before that (substr).