Am I using the proper command?

69 Views Asked by At

I am trying to write a one-line command on terminal to count all the unique "gene-MIR" in a very large file. The "gene-MIR" are followed by a series of numbers ex. gene-MIR334223, gene-MIR633235, gene-MIR53453 ... etc, and there are multiples of the same "gene-MIR" ex. gene-MIR342433 may show up 10x in the script.

My question is, how do I write a command that will annotate the unique "gene-MIR" that are present in my file?

The commands I have been using so far is:

  1. grep -c "gene-MIR" myfile.txt | uniq

  2. grep "gene-MIR" myfile.txt | sort -u

The first command provides me with a count; however, I believe it does not include the number series after "MIR" and is only counting how many "gene-MIR" itself are present.

Thanks!

[1]: https://i.stack.imgur.com/Y7EcD.png

2

There are 2 best solutions below

6
On

Assuming all the entries are are on separate lines, try this:

grep "gene-MIR" myfile.txt | sort | uniq -c

If the entries are mixed up with other text, and the system has GNU grep try this:

grep -o 'gene-MIR[0-9]*' myfile.txt | sort | uniq -c

To get the total count:

grep -o 'gene-MIR[0-9]*' myfile.txt  | wc -l
2
On

If you have information like this:

Inf1
Inf2
Inf1
Inf2

And you want to know the amount of "inf" kinds, you always need to sort it first. Only afterwards you can start counting.

Edit

I've created a similar file, containing the examples, mentioned in the requester's comment, as follows:

Nonsense
gene-MIR4232
gene-MIR2334
gene-MIR93284
gene-MIR4232
gene-MIR2334
gene-MIR93284
More nonsense

On that, I've applied both commands, as mentioned in the question:

grep -c "gene-MIR" myfile.txt | uniq

Which results in 6, just like the following command:

grep -c "gene-MIR" myfile.txt

Why? The question here is "How many lines contain the string "gene-MIR"?".
This is clearly not the requested information.

The other command also is not correct:

grep "gene-MIR" myfile.txt | sort -u

The result:

gene-MIR2334
gene-MIR4232
gene-MIR93284

Explanation:
grep "gene-MIR" ... means: show all the lines, which contain "gene-MIR"
| sort -u means: sort the displayed lines and if there are multiple instances of the same, only show one of them.

Also this is not what the requester wants. Therefore I have following proposal:

grep "gene-MIR" myfile.txt | sort | uniq -c

With following result:

      2 gene-MIR2334
      2 gene-MIR4232
      2 gene-MIR93284

This is more what the requester is looking for, I presume.

What does it mean? grep "gene-MIR" myfile.txt : only show the lines which contain "gene-MIR"
| sort : sort the lines, which are shown. Like this, you get an intermediate result like this:

    gene-MIR2334
    gene-MIR2334
    gene-MIR4232
    gene-MIR4232
    gene-MIR93284
    gene-MIR93284

| uniq -c : group those results together and show the count for every instance.

Unfortunately, the example is badly chosen as every instance occurs exactly two times. Therefore, for clarification purposes, I've created another "myfile.txt", as follows:

Nonsense
gene-MIR4232
gene-MIR2334
gene-MIR93284
gene-MIR2334
gene-MIR2334
gene-MIR93284
More nonsense

I've applied the same command again:

grep "gene-MIR" myfile.txt | sort | uniq -c

With following result:

      3 gene-MIR2334
      1 gene-MIR4232
      2 gene-MIR93284

Here you can see in a much clearer way that the proposed command is correct.

... and your next question is: "Yes, but is it possible to sort the result?", on which I answer:

grep "gene-MIR" myfile.txt | sort | uniq -c | sort -n

With following result:

      1 gene-MIR4232
      2 gene-MIR93284
      3 gene-MIR2334

Have fun!