I have a CSV file representing a large matrix that I wish to upload to an Apache Hbase instance (running on AWS EMR, but that shouldn't matter). The CSV contains ~15000 columns and ~50000 rows. The matrix's cell values are integers.
The CSV looks something like this:
ROW_KEY col1 col2 col3 .... col15000
row1 0 1 125 456
row2 23 23 45 ...
row3 ... ... ...
...
row50000
I'm planning on keeping my HBase schema in a single column family, with each of the columns (col1, col2, etc) as column qualifiers.
I've looked into iterating over the CSV in a python script and uploading each row using something like happybase, but that seems to take quite a while.
I've looked into the ImportTSV tool, but it looks like the tool requires an argument to detail all the column names, such as this:
Dimporttsv.columns=HBASE_ROW_KEY,cf1:name,cf2:exp
Detailing tens of thousands of columns in the args doesn't seem like a good solution.