How can one read FASTA files directly into a data frame in R using base code. These files store information bio-sequence (e.g. DNA or protein) and have 2*n lines for n individual bio-molecules (id1 through idn), and thus are of the type:
>id1 #(always starts with a `>`)
seq1
>id2
seq2
...
>idn
seqn
If one want to be in base R (instead of dedicated packages like Biostrings
and seqinr
, which make use of novel classes for various manipulations of bio-sequences), how can you use e.g. read.table , to get a simple data frame with a id and a seq column?
It certainly is possible in base R. Consider the following example and function:
However, be warned: the function does not currently handle, e.g., special characters, which are rather commonplace in sequence files (in context such as 5' or #5 rRNA).