I have a file (fasta file to be specific) that I would like to index, so that I can quickly locate any substring within the file and then find the location within the original fasta file.
This would be easy to do in many cases, using a Trie or substring array, unfortunately the strings I need to index are 800+ MBs which means that doing them in memory in unacceptable, so I'm looking for a reasonable way to create this index on disk, with minimal memory usage.
(edit for clarification)
I am only interested in the headers of proteins, so for the largest database I'm interested in, this is about 800 MBs of text.
I would like to be able to find an exact substring within O(N) time based on the input string. This must be useable on 32 bit machines as it will be shipped to random people, who are not expected to have 64 bit machines.
I want to be able to index against any word break within a line, to the end of the line (though lines can be several MBs long).
Hopefully this clarifies what is needed and why the current solutions given are not illuminating.
I should also add that this needs to be done from within java, and must be done on client computers on various operating systems, so I can't use any OS Specific solution, and it must be a programatic solution.
I don't imagine that the original poster still has this problem, but anyone needing FASTA file indexing and subsequence extraction should check out fastahack: http://github.com/ekg/fastahack
It uses an index file to count newlines and sequence start offsets. Once the index is generated you can rapidly extract subsequences; the extraction is driven by fseek64.
It will work very, very well in the case that your sequences are as long as the poster's. However, if you have many thousands or millions of sequences in your FASTA file (as is the case with the outputs from short-read sequencing or some de novo assemblies) you will want to use another solution, such as a disk-backed key-value store.