I've got data with 500M+ records, in a file with 2 fields, blob
and c_id
.
There are also two other files with the same data in other formats: A file with 90M unique blobs. A file with a blob, and a comma-separated list of c_ids per record.
I've got two tables:
table_a: [id, blob] # id is auto-increment
table_b: [a_id, c_id]
For each unique blob, a record in table_a must be created. For each record in the file, a record in table_b must be created with the appropriate foreign key to table_a.
The solution I use now is to generate insert statements, using last_insert_id
, but it's too slow. I'd prefer to use LOAD DATA INFILE, but the auto-increment id is making stuff complicated.
e.g.
# Raw data
c_id blob
1 aaaa
2 aaaa
3 aaaa
3 aaab
4 aaac
Desired output:
# Table_a
id blob
1 aaaa
2 aaab
3 aaac
# Table_b
c_id a_id
1 1
2 1
3 2
3 2
4 3
I am not sure how you are populating the "c_id" field for table_b, but you can do something like this: load all the data in table_a first, then load table_b by executing batch queries like:
"SELECT id into outfile '/tmp/file1.csv' FIELDS TERMINATED BY ';' LINES TERMINATED BY '\n' from table_a where id > 0 limit 100000"
and using load infile on '/tmp/file1.csv'.