I need to get data from a Postgres database to an Accumulo database. We're hoping to using sequence files to run map/reduce job to do this, but aren't sure how to start. For internal technical reasons, we need to avoid Sqoop.
Will this be possible without Sqoop? Again, I'm really not sure where to start. Do I write a java class to read all records (millions) into JDBC and somehow output that to an HDFS sequence file?
Thanks for any input!
P.S. - I should have mentioned that using a delimited file is the problem we're having now. Some of our are long character fields that contain the delimiter, and therefore don't parse correctly. The field may even have a tab in it. We wanted to go from Postgres straight to HDFS without parsing.
You can export data from your database as a CSV or tab-delimited, or pipe-delimited, or Ctrl-A (Unicode 0x0001) - delimited files. Then you can copy those files into HDFS and run a very simple MapReduce job, maybe even consisting just of a Mapper and configured to read the file format you used and to output the sequence files.
This would allow to distribute the load for the creating of the sequence files between the servers of the Hadoop cluster.
Also, most likely, this will not be a one-time deal. You will have to load the data from the Postgres database into HDFS on the regular basis. They you would be able to tweak your MapReduce job to merge the new data in.