Using Presto on existing ORC files in S3

956 Views Asked by At

I would like to configure Presto to query ORC files in AWS S3. According to the documentation, a Hive Metastore must be configured. This metastore will gather all the schemas of the tables. Since I have existing ORC files, is there any way to populate this metastore automatically with the schemes of the already existing tables that should be embedded in the ORC files ?

Thanks in advance for your help !

1

There are 1 best solutions below

0
On

This is totally possible but it may fail some times if the ORC writer is not compatible with Trino (formerly known as PrestoSQL). This is rather unlikely but should be noted.

The first step is being able to get the schema correct. You can do this by printing out the orc schema using the uber orc-tools.jar and the meta command. See more: https://orc.apache.org/docs/java-tools.html

java -jar orc-tools-1.6.7-uber.jar meta bucket_00001.orc
Processing data file bucket_00001.orc [length: 78511]
Structure for bucket_00001.orc
File Version: 0.12 with PRESTO_ORIGINAL
Rows: 1500
Compression: ZLIB
Compression size: 262144
Calendar: Julian/Gregorian
Type: struct<custkey:bigint,name:string,address:string,nationkey:bigint,phone:string,acctbal:double,mktsegment:string,comment:string>

You will want to take the Type struct that is in the output. You will also want to ensure that the schemas are aligned between all of your ORC files.

The next step is to translate the Type struct to a CREATE TABLE statement like so:

CREATE TABLE hive.default.customer (
   custkey BIGINT,
   name VARCHAR,
   address VARCHAR,
   nationkey BIGINT,
   phone VARCHAR,
   acctbal DOUBLE,
   mktsegment VARCHAR,
   comment VARCHAR
);

Once the table is created you should be able to start querying unless you have a partition in your table. If that is the case you will then need to sync your the Hive metastore (or possibly the file metastore) to add the existing partition locations to the metastore.

Assuming you had a partition on the nationkey column above, your CREATE TABLE statement would look like this:

CREATE TABLE hive.default.customer (
   custkey BIGINT,
   name VARCHAR,
   address VARCHAR,
   phone VARCHAR,
   acctbal DOUBLE,
   mktsegment VARCHAR,
   comment VARCHAR,
   nationkey BIGINT
) WITH (
  partitioned_by = ARRAY['nationkey']
);

You would then want to sync which can be done with the system.sync_partition_metadata procedure.

This will look something like this:

CALL system.sync_partition_metadata('default', 'customer', 'ADD');

There is also a demo of doing this in the Trino Community Broadcast (back when Trino was still called PrestoSQL).