I am trying to write a small program that searches a key-value type structure. My search is to find the FASTEST approach possible for searching the key-value.
I would prefer to use C# for this program, unless another language gives me a significant advantage. Another limitation that I am putting is that everything has to be on the same computer. I don't want to use an Oracle or SQL Server database, beacuse I belive the other options will me much faster. The data is mostly read and rarely written. Whenever there are changes or updates to the data a new set is created and it is ok if writing of the data takes time.
Assumptions:
The data is sorted in a numeric order.
The structure is as simple as this:
Char3 file: (This file will only store 3 character keys)
Key|Value
100|2,5,6,7:9:3,4,5:3,4,5:2,5,6,7
999|2,5,6,7:9:3,4:3:2,5
Char5 file: (This file will only store 5 character keys)
Key|Value
A1000|2,5,6,7:9:3,4,5:3,4,5:2,5,6,7
Char3 and Char5 follow the same storage structure but have different types of keys. The key however will be of same length in a given file
I have multiple files like these each file will follow the same structure. The only variation will be the key length in each file.
The task is given a set of 1-200 (variable lengths) Keys find all the data related to each key.
I am generating this data from a database and thus can create the data in any format.
For the FileStream test I am going to pad each line for a given file and then use FileStream.Seek to quickly jump to a given location based on the padding.
What I want to do is find out which of these apporaches would be the fastest?
- FileStream - I will eventually also look at memory-mapped files. (Open to other options)
- Embedded SQL - SQLite (Open to other options)
- NoSql - ?? (Looking for Suggstions)
My question is what should I be using in each of these categories for a proper comparisson. For example, if I was using FileStream and not using FileStream.Seek than it would not be a proper comparisson.
I would eventually also like to run the searches in parallel as much as I can. My pripary requirement is SEARCH performance.
Any ideas or suggestions would be great.
Thanks,
UPDATE: I will list the option details and results as I process them
Find 5000 random entries (by line numebr or some other similar charateristic) in a file that contains 10K lines, 2.28 MB.
- FileStream options - Best time: 00:00:00.0398530 ms
You're best bet is Berkeley DB, via the C# API (which uses key-value pair storage). Berkeley DB is a library, so it links into your application. There is no separate server to install and no client/server overhead. Berkeley DB is extremely fast, scalable and reliable and is designed to do exactly what you describe here.
Disclaimer: I'm the Product Manager for Berkeley DB, so I'm a little biased. But I'm serious when I say that this is exactly the scenario that Berkeley DB is designed for.