How to get from Postgres database to Hadoop Sequence File?

Question

How to get from Postgres database to Hadoop Sequence File?

2.5k Views Asked by aatuc210 At 18 October 2025 at 12:03

I need to get data from a Postgres database to an Accumulo database. We're hoping to using sequence files to run map/reduce job to do this, but aren't sure how to start. For internal technical reasons, we need to avoid Sqoop.

Will this be possible without Sqoop? Again, I'm really not sure where to start. Do I write a java class to read all records (millions) into JDBC and somehow output that to an HDFS sequence file?

Thanks for any input!

P.S. - I should have mentioned that using a delimited file is the problem we're having now. Some of our are long character fields that contain the delimiter, and therefore don't parse correctly. The field may even have a tab in it. We wanted to go from Postgres straight to HDFS without parsing.

Original Q&A

There are 3 best solutions below

zero323 On 04 September 2013 at 09:05

You can serialize your data using Avro, although it won't be very fast (especially when using python like in the example) and then load it into the hdfs.

Assuming you have database foo:

postgres=# \c foo
You are now connected to database "foo" as user "user".
foo=# 

foo=# \d bar
                              Table "public.bar"
Column |          Type           |                     Modifiers                     
--------+-------------------------+---------------------------------------------------
key    | integer                 | not null default nextval('bar_key_seq'::regclass)
value  | character varying(1024) | not null

You can create avro schema like below:

{"namespace": "foo.avro",
 "type": "record",
 "name": "bar",
 "fields": [
     {"name": "id", "type": "int"},
     {"name": "value", "type": "string"}
 ]
}

And then serialize your data row by row:

import psycopg2
import avro.schema
from avro.datafile import DataFileReader, DataFileWriter
from avro.io import DatumReader, DatumWriter

schema = avro.schema.parse(open("foo.avsc").read())
writer = DataFileWriter(open("foo.avro", "w"), DatumWriter(), schema)

c = psycopg2.connect(user='user', password='s3cr3t', database='foo')
cur = c.cursor()
cur.execute('SELECT * FROM bar')

for row in cur.fetchall():
    writer.append({"id": row[0], "value": row[1]})

writer.close()
cur.close()
c.close()

Alternatively you can use serialize your data using plain json.

user7610 On 17 March 2017 at 15:46

There is http://sqoop.apache.org/ which should do what you ask.

**Olaf** · Accepted Answer

You can export data from your database as a CSV or tab-delimited, or pipe-delimited, or Ctrl-A (Unicode 0x0001) - delimited files. Then you can copy those files into HDFS and run a very simple MapReduce job, maybe even consisting just of a Mapper and configured to read the file format you used and to output the sequence files.

This would allow to distribute the load for the creating of the sequence files between the servers of the Hadoop cluster.

Also, most likely, this will not be a one-time deal. You will have to load the data from the Postgres database into HDFS on the regular basis. They you would be able to tweak your MapReduce job to merge the new data in.

How to get from Postgres database to Hadoop Sequence File?

There are 3 best solutions below

Related Questions in POSTGRESQL

Related Questions in HADOOP

Related Questions in ACCUMULO

Trending Questions

Popular # Hahtags

Popular Questions