handling LARGE dataset

286 Views Asked by At

What is the best solution for handling LARGE dataset.
I have txt files broken down into multiple files. which if I add up it will be about 100 GB the files are nothing more than just

uniqID1 uniqID2 etc

id pairs and if I want calculate things like 1:unique number of uniqIDs etc 2:list of other IDs uniqID1 is linked to?

what is the best solution? how do I update these into a database?

thank you!

1

There are 1 best solutions below

0
On

So if you had a table with the following columns:

           id1 varchar(10)   // how long are you ids? are they numeric? text?
           id2 varchar(10)

with about five billion rows in the table, and you wanted quick answers to questions such as:

        how many unique values in column id1 are there?
        what is the set of distinct values from id1 where id2 = {some parameter}

a relational database (that supports SQL) and a table with an index on id1 and another index on id2 would do what you need. SQLite would do the job.

EDIT: to import them it would be best to separate the two values with some character that never occurs in the values, like a comma or a pipe character or a tab, one pair per line:

         foo|bar
         moo|mar

EDIT2: You don't need relational but it doesn't hurt anything, and your data structure is more extensible if the db is relational.