How to upload wide (15,000 + columns) CSV to Apache Hbase instance

110 Views Asked by At

I have a CSV file representing a large matrix that I wish to upload to an Apache Hbase instance (running on AWS EMR, but that shouldn't matter). The CSV contains ~15000 columns and ~50000 rows. The matrix's cell values are integers.

The CSV looks something like this:

ROW_KEY col1 col2 col3 .... col15000
row1 0    1  125  456
row2 23   23  45  ...
row3 ...  ...  ...
...
row50000

I'm planning on keeping my HBase schema in a single column family, with each of the columns (col1, col2, etc) as column qualifiers.

I've looked into iterating over the CSV in a python script and uploading each row using something like happybase, but that seems to take quite a while.

I've looked into the ImportTSV tool, but it looks like the tool requires an argument to detail all the column names, such as this:

Dimporttsv.columns=HBASE_ROW_KEY,cf1:name,cf2:exp

Detailing tens of thousands of columns in the args doesn't seem like a good solution.

0

There are 0 best solutions below