On Disk Substring index

Question

On Disk Substring index

514 Views Asked by emeryc At 09 September 2008 at 23:48

I have a file (fasta file to be specific) that I would like to index, so that I can quickly locate any substring within the file and then find the location within the original fasta file.

This would be easy to do in many cases, using a Trie or substring array, unfortunately the strings I need to index are 800+ MBs which means that doing them in memory in unacceptable, so I'm looking for a reasonable way to create this index on disk, with minimal memory usage.

(edit for clarification)

I am only interested in the headers of proteins, so for the largest database I'm interested in, this is about 800 MBs of text.

I would like to be able to find an exact substring within O(N) time based on the input string. This must be useable on 32 bit machines as it will be shipped to random people, who are not expected to have 64 bit machines.

I want to be able to index against any word break within a line, to the end of the line (though lines can be several MBs long).

Hopefully this clarifies what is needed and why the current solutions given are not illuminating.

I should also add that this needs to be done from within java, and must be done on client computers on various operating systems, so I can't use any OS Specific solution, and it must be a programatic solution.

Original Q&A

There are 4 best solutions below

**nlucaroni** · Answer 1 · 2008-09-10T00:02:25.823000

I talked to a few co-workers and they just use VIM/Grep to search when they need to. Most of the time I wouldn't expect someone to search for a substring like this though.

But I don't see why MS Desktop search or spotlight or google's equivalent can't help you here.

My recommendation is splitting the file up --by gene or species, hopefully the input sequences aren't interleaved.

**Stu Thompson** · Answer 2 · 2008-09-10T07:03:43.627000

In some languages programmers have access to "direct byte arrays" or "memory maps", which are provided by the OS. In java we have java.nio.MappedByteBuffer. This allows one to work with the data as if it were a byte array in memory, when in fact it is on the disk. The size of the file one can work with is only limited by the OS's virtual memory capabilities, and is typically ~<4GB for 32-bit computers. 64-bit? In theory 16 exabytes (17.2 billion GBs), but I think modern CPUs are limited to a 40-bit (1TB) or 48-bit (128TB) address space.

This would let you easily work with the one big file.

**Adam Davis** · Answer 3 · 2008-09-13T13:25:57.317000

The FASTA file format is very sparse. The first thing I would do is generate a compact binary format, and index that - it should be maybe 20-30% the size of your current file, and the process for coding/decoding the data should be fast enough (even with 4GB) that it won't be an issue.

At that point, your file should fit within memory, even on a 32 bit machine. Let the OS page it, or make a ramdisk if you want to be certain it's all in memory.

Keep in mind that memory is only around $30 a GB (and getting cheaper) so if you have a 64 bit OS then you can even deal with the complete file in memory without encoding it into a more compact format.

Good luck!

-Adam

**Erik Garrison** · Answer 4 · 2010-05-07T17:16:44.740000

I don't imagine that the original poster still has this problem, but anyone needing FASTA file indexing and subsequence extraction should check out fastahack: http://github.com/ekg/fastahack

It uses an index file to count newlines and sequence start offsets. Once the index is generated you can rapidly extract subsequences; the extraction is driven by fseek64.

It will work very, very well in the case that your sequences are as long as the poster's. However, if you have many thousands or millions of sequences in your FASTA file (as is the case with the outputs from short-read sequencing or some de novo assemblies) you will want to use another solution, such as a disk-backed key-value store.

On Disk Substring index

There are 4 best solutions below

Related Questions in INDEXING

Related Questions in SUBSTRING

Related Questions in ON-DISK

Trending Questions

Popular # Hahtags

Popular Questions