We have a ORC file format which are stored in s3 and we want to load the files into AWS Aurora postgres DB .
What we got from internet was :
- postgres support csv, txt and other formats not ORC ..
INSERT OVERWRITE DIRECTORY '<Hdfs-Directory-Path>' ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE SELECT * FROM default.foo;
Can any one please help us to find a solution?
This date PostgreSQL on Aurora supports ingestion of data from S3 through the COPY command only from TXT and CSV files.
Since your files are in ORC format, you could convert these tiles in either CSV or TXT and then ingest the data. You could do this very easily with Athena, by simply creating a table for your original data and running a SELECT * FROM table query. As explained in the Working with Query Results, Output Files, and Query History page, this will automatically generate a CSV file containing the results. This would not be optimal as you’d pay not only the transform price but also the he storage twice (as original ORC and converted CSV), but it would allow you to convert the data pretty easily.
A better way to do it would instead be to use a service like AWS Glue, that supports S3 as source and that has an Aurora connector. Using this method would give you an actual ETL and even if now you just need the E(xtract) and L(oad), would still leave the door open for any kind of transform you might need in the future.
In this AWS Blog titled How to extract, transform, and load data for analytic processing using AWS Glue (Part 2) they show the opposite flow (Aurora->S3 via Glue), but it should still give you an idea of the process.